All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 -mm] oom reaper v4
@ 2016-01-06 15:42 ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-06 15:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

Hi Andrew,
the number of -fix patches for the the v3 of the patch [1] has grown
quite a bit... so this is a drop in replacement for 
mm-oom-introduce-oom-reaper.patch
mm-oom-introduce-oom-reaper-fix.patch
mm-oom-introduce-oom-reaper-fix-fix.patch
mm-oom-introduce-oom-reaper-fix-fix-2.patch
mm-oom-introduce-oom-reaper-checkpatch-fixes.patch
mm-oom-introduce-oom-reaper-fix-3.patch
mm-oom-introduce-oom-reaper-fix-4.patch
mm-oom-introduce-oom-reaper-fix-4-fix.patch
mm-oom-introduce-oom-reaper-fix-5.patch
mm-oom-introduce-oom-reaper-fix-5-fix.patch
mm-oom-introduce-oom-reaper-fix-6.patch

I belive this should make the further review easier. I have put an
additional patch on top which allows to munlock & unmap anonymous
mappings as well. This went to a separate patch for an easier
bisectability.

[1] http://lkml.kernel.org/r/1450204575-13052-1-git-send-email-mhocko%40kernel.org

Michal Hocko (2):
      mm, oom: introduce oom reaper
      oom reaper: handle anonymous mlocked pages

Diffstat says:
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 175 insertions(+), 11 deletions(-)



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 0/2 -mm] oom reaper v4
@ 2016-01-06 15:42 ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-06 15:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

Hi Andrew,
the number of -fix patches for the the v3 of the patch [1] has grown
quite a bit... so this is a drop in replacement for 
mm-oom-introduce-oom-reaper.patch
mm-oom-introduce-oom-reaper-fix.patch
mm-oom-introduce-oom-reaper-fix-fix.patch
mm-oom-introduce-oom-reaper-fix-fix-2.patch
mm-oom-introduce-oom-reaper-checkpatch-fixes.patch
mm-oom-introduce-oom-reaper-fix-3.patch
mm-oom-introduce-oom-reaper-fix-4.patch
mm-oom-introduce-oom-reaper-fix-4-fix.patch
mm-oom-introduce-oom-reaper-fix-5.patch
mm-oom-introduce-oom-reaper-fix-5-fix.patch
mm-oom-introduce-oom-reaper-fix-6.patch

I belive this should make the further review easier. I have put an
additional patch on top which allows to munlock & unmap anonymous
mappings as well. This went to a separate patch for an easier
bisectability.

[1] http://lkml.kernel.org/r/1450204575-13052-1-git-send-email-mhocko%40kernel.org

Michal Hocko (2):
      mm, oom: introduce oom reaper
      oom reaper: handle anonymous mlocked pages

Diffstat says:
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 162 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 175 insertions(+), 11 deletions(-)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06 15:42 ` Michal Hocko
@ 2016-01-06 15:42   ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-06 15:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 170 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..f60c6d6aa633 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dc490c06941b..1ece40b94725 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
 		if (is_global_init(p))
 			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.6.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-06 15:42   ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-06 15:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 170 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..f60c6d6aa633 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dc490c06941b..1ece40b94725 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
 		if (is_global_init(p))
 			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 2/2] oom reaper: handle anonymous mlocked pages
  2016-01-06 15:42 ` Michal Hocko
@ 2016-01-06 15:42   ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-06 15:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__oom_reap_vmas current skips over all mlocked vmas because they need
a special treatment before they are unmapped. This is primarily done
for simplicity. There is no reason to skip over them for all mappings
though and reduce the amount of reclaimed memory. Anonymous mappings
are not visible by any other process so doing a munlock before unmap
is safe to do from the semantic point of view. munlock_vma_pages_all
is also safe to be called from the oom reaper context because it
doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ece40b94725..913b68a44fd4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -445,11 +445,16 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 			continue;
 
 		/*
-		 * mlocked VMAs require explicit munlocking before unmap.
-		 * Let's keep it simple here and skip such VMAs.
+		 * mlocked VMAs require explicit munlocking before unmap
+		 * and that is safe only for anonymous mappings because
+		 * nobody except for the victim will need them locked
 		 */
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
+		if (vma->vm_flags & VM_LOCKED) {
+			if (vma_is_anonymous(vma))
+				munlock_vma_pages_all(vma);
+			else
+				continue;
+		}
 
 		/*
 		 * Only anonymous pages have a good chance to be dropped
-- 
2.6.4


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 2/2] oom reaper: handle anonymous mlocked pages
@ 2016-01-06 15:42   ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-06 15:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__oom_reap_vmas current skips over all mlocked vmas because they need
a special treatment before they are unmapped. This is primarily done
for simplicity. There is no reason to skip over them for all mappings
though and reduce the amount of reclaimed memory. Anonymous mappings
are not visible by any other process so doing a munlock before unmap
is safe to do from the semantic point of view. munlock_vma_pages_all
is also safe to be called from the oom reaper context because it
doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ece40b94725..913b68a44fd4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -445,11 +445,16 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 			continue;
 
 		/*
-		 * mlocked VMAs require explicit munlocking before unmap.
-		 * Let's keep it simple here and skip such VMAs.
+		 * mlocked VMAs require explicit munlocking before unmap
+		 * and that is safe only for anonymous mappings because
+		 * nobody except for the victim will need them locked
 		 */
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
+		if (vma->vm_flags & VM_LOCKED) {
+			if (vma_is_anonymous(vma))
+				munlock_vma_pages_all(vma);
+			else
+				continue;
+		}
 
 		/*
 		 * Only anonymous pages have a good chance to be dropped
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] oom reaper: handle anonymous mlocked pages
  2016-01-06 15:42   ` Michal Hocko
@ 2016-01-07  8:14     ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-07  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 06-01-16 16:42:55, Michal Hocko wrote:
> Anonymous mappings
> are not visible by any other process so doing a munlock before unmap
> is safe to do from the semantic point of view.

I was too conservative here. I have completely forgoten about the lazy
mlock handling during try_to_unmap which would keep the page mlocked if
there is an mlocked vma mapping that page. So we can safely do what I
was proposing originally. I hope I am not missing anything now. Here is
the replacement patch
---
>From 9aa92fc1c7f0f1c55d2efab0239dbb10a9dce001 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 6 Jan 2016 10:48:39 +0100
Subject: [PATCH] oom reaper: handle mlocked pages

__oom_reap_vmas current skips over all mlocked vmas because they need a
special treatment before they are unmapped. This is primarily done for
simplicity. There is no reason to skip over them and reduce the amount
of reclaimed memory. This is safe from the semantic point of view
because try_to_unmap_one during rmap walk would keep tell the reclaim
to cull the page back and mlock it again.

munlock_vma_pages_all is also safe to be called from the oom reaper
context because it doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ece40b94725..0e4af31db96f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -445,13 +445,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 			continue;
 
 		/*
-		 * mlocked VMAs require explicit munlocking before unmap.
-		 * Let's keep it simple here and skip such VMAs.
-		 */
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
-
-		/*
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
 		 * are OOM already.
@@ -461,9 +454,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 		 * we do not want to block exit_mmap by keeping mm ref
 		 * count elevated without a good reason.
 		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_all(vma);
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 &details);
+		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
-- 
2.6.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 2/2] oom reaper: handle anonymous mlocked pages
@ 2016-01-07  8:14     ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-07  8:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 06-01-16 16:42:55, Michal Hocko wrote:
> Anonymous mappings
> are not visible by any other process so doing a munlock before unmap
> is safe to do from the semantic point of view.

I was too conservative here. I have completely forgoten about the lazy
mlock handling during try_to_unmap which would keep the page mlocked if
there is an mlocked vma mapping that page. So we can safely do what I
was proposing originally. I hope I am not missing anything now. Here is
the replacement patch
---

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06 15:42   ` Michal Hocko
@ 2016-01-07 11:23     ` Tetsuo Handa
  -1 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-07 11:23 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
>  		if (is_global_init(p))
>  			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot use oom_reaper for the mm shared by this
> +			 * process because it wouldn't get killed and so the
> +			 * memory might be still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;
> -
> +		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	}
>  	rcu_read_unlock();

According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
to kill init sharing same memory"), below patch is needed for avoid
killing init process with SIGSEGV.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9548dce..9832f3f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  			continue;
  		if (same_thread_group(p, victim))
  			continue;
-		if (is_global_init(p))
-			continue;
-		if (unlikely(p->flags & PF_KTHREAD) ||
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
  		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
  			/*
  			 * We cannot use oom_reaper for the mm shared by this
----------

----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

static int child(void *unused)
{
	char *buf = NULL;
	unsigned long i;
	unsigned long size = 0;
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < size; i += 4096)
		buf[i] = '\0'; /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	char *cp = malloc(8192);
	if (cp && clone(child, cp + 8192, CLONE_VM, NULL) > 0)
		while (1) {
			sleep(1);
			write(1, cp, 1);
		}
	return 0;
}
----------
[    2.954212] init invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[    2.959697] init cpuset=/ mems_allowed=0
[    2.961927] CPU: 0 PID: 98 Comm: init Not tainted 4.4.0-rc8-next-20160106+ #28
[    2.965738] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[    2.971239]  0000000000000000 0000000075c7a38e ffffffff812ab8c4 ffff88003bd6fd48
[    2.975461]  ffffffff8117eb58 0000000000000000 ffff88003bd6fd48 0000000000000000
[    2.979572]  ffffffff810c5630 0000000000000003 0000000000000202 0000000000000549
[    2.983525] Call Trace:
[    2.984813]  [<ffffffff812ab8c4>] ? dump_stack+0x40/0x5c
[    2.987497]  [<ffffffff8117eb58>] ? dump_header+0x58/0x1ed
[    2.990285]  [<ffffffff810c5630>] ? ktime_get+0x30/0x90
[    2.992963]  [<ffffffff810fd225>] ? delayacct_end+0x35/0x60
[    2.995884]  [<ffffffff81113dc3>] ? oom_kill_process+0x323/0x460
[    2.998944]  [<ffffffff81114060>] ? out_of_memory+0x110/0x480
[    3.001833]  [<ffffffff811197ad>] ? __alloc_pages_nodemask+0xbbd/0xd60
[    3.005400]  [<ffffffff8115d951>] ? alloc_pages_vma+0xb1/0x220
[    3.008391]  [<ffffffff811780ac>] ? mem_cgroup_commit_charge+0x7c/0xf0
[    3.011668]  [<ffffffff8113ce86>] ? handle_mm_fault+0x1036/0x1460
[    3.014782]  [<ffffffff81056c97>] ? __do_page_fault+0x177/0x430
[    3.017770]  [<ffffffff81056f7b>] ? do_page_fault+0x2b/0x70
[    3.020615]  [<ffffffff815a9198>] ? page_fault+0x28/0x30
[    3.023359] Mem-Info:
[    3.024575] active_anon:244334 inactive_anon:0 isolated_anon:0
[    3.024575]  active_file:0 inactive_file:0 isolated_file:0
[    3.024575]  unevictable:561 dirty:0 writeback:0 unstable:0
[    3.024575]  slab_reclaimable:94 slab_unreclaimable:2386
[    3.024575]  mapped:275 shmem:0 pagetables:477 bounce:0
[    3.024575]  free:1924 free_pcp:304 free_cma:0
[    3.040715] Node 0 DMA free:3936kB min:60kB low:72kB high:88kB active_anon:11260kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:64kB kernel_stack:0kB pagetables:564kB unstable:0kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[    3.062251] lowmem_reserve[]: 0 969 969 969
[    3.064752] Node 0 DMA32 free:3760kB min:3812kB low:4764kB high:5716kB active_anon:966076kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:2244kB isolated(anon):0kB 
isolated(file):0kB present:1032064kB managed:994872kB mlocked:0kB dirty:0kB writeback:0kB mapped:1100kB shmem:0kB slab_reclaimable:372kB slab_unreclaimable:9480kB kernel_stack:2192kB pagetables:1344kB 
unstable:0kB bounce:0kB free_pcp:1216kB local_pcp:244kB free_cma:0kB writeback_tmp:0kB pages_scanned:2244 all_unreclaimable? yes
[    3.087299] lowmem_reserve[]: 0 0 0 0
[    3.089437] Node 0 DMA: 2*4kB (ME) 1*8kB (E) 3*16kB (UME) 3*32kB (UME) 3*64kB (UME) 2*128kB (ME) 3*256kB (UME) 3*512kB (UME) 1*1024kB (E) 0*2048kB 0*4096kB = 3936kB
[    3.098058] Node 0 DMA32: 4*4kB (UME) 4*8kB (UME) 2*16kB (UE) 1*32kB (M) 1*64kB (M) 2*128kB (UE) 1*256kB (E) 0*512kB 3*1024kB (UME) 0*2048kB 0*4096kB = 3760kB
[    3.106371] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[    3.110846] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[    3.115169] 561 total pagecache pages
[    3.117051] 0 pages in swap cache
[    3.118764] Swap cache stats: add 0, delete 0, find 0/0
[    3.121414] Free swap  = 0kB
[    3.122958] Total swap = 0kB
[    3.124468] 262013 pages RAM
[    3.125962] 0 pages HighMem/MovableOnly
[    3.127932] 9319 pages reserved
[    3.129597] 0 pages cma reserved
[    3.131258] 0 pages hwpoisoned
[    3.132836] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[    3.137232] [   98]     0    98   279607   244400     489       5        0             0 init
[    3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice child
[    3.145346] Killed process 98 (init) total-vm:1118428kB, anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
[    3.416105] init[1]: segfault at 0 ip           (null) sp 00007ffd484cf5f0 error 14 in init[400000+1000]
[    3.439074] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.439074]
[    3.450193] Kernel Offset: disabled
[    3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.456259]
----------

Guessing from commit 1e99bad0d9c12a4a ("oom: kill all threads sharing oom
killed task's mm"), the

	if (same_thread_group(p, victim))
		continue;

test is for avoiding "Kill process %d (%s) sharing same memory\n" on the
victim's mm, but that printk() was already removed. Thus, I think we have
nothing to do (or can remove it if we don't mind sending SIGKILL twice).

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-07 11:23     ` Tetsuo Handa
  0 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-07 11:23 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
>  		if (is_global_init(p))
>  			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot use oom_reaper for the mm shared by this
> +			 * process because it wouldn't get killed and so the
> +			 * memory might be still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;
> -
> +		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	}
>  	rcu_read_unlock();

According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
to kill init sharing same memory"), below patch is needed for avoid
killing init process with SIGSEGV.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9548dce..9832f3f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  			continue;
  		if (same_thread_group(p, victim))
  			continue;
-		if (is_global_init(p))
-			continue;
-		if (unlikely(p->flags & PF_KTHREAD) ||
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
  		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
  			/*
  			 * We cannot use oom_reaper for the mm shared by this
----------

----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

static int child(void *unused)
{
	char *buf = NULL;
	unsigned long i;
	unsigned long size = 0;
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < size; i += 4096)
		buf[i] = '\0'; /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	char *cp = malloc(8192);
	if (cp && clone(child, cp + 8192, CLONE_VM, NULL) > 0)
		while (1) {
			sleep(1);
			write(1, cp, 1);
		}
	return 0;
}
----------
[    2.954212] init invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[    2.959697] init cpuset=/ mems_allowed=0
[    2.961927] CPU: 0 PID: 98 Comm: init Not tainted 4.4.0-rc8-next-20160106+ #28
[    2.965738] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[    2.971239]  0000000000000000 0000000075c7a38e ffffffff812ab8c4 ffff88003bd6fd48
[    2.975461]  ffffffff8117eb58 0000000000000000 ffff88003bd6fd48 0000000000000000
[    2.979572]  ffffffff810c5630 0000000000000003 0000000000000202 0000000000000549
[    2.983525] Call Trace:
[    2.984813]  [<ffffffff812ab8c4>] ? dump_stack+0x40/0x5c
[    2.987497]  [<ffffffff8117eb58>] ? dump_header+0x58/0x1ed
[    2.990285]  [<ffffffff810c5630>] ? ktime_get+0x30/0x90
[    2.992963]  [<ffffffff810fd225>] ? delayacct_end+0x35/0x60
[    2.995884]  [<ffffffff81113dc3>] ? oom_kill_process+0x323/0x460
[    2.998944]  [<ffffffff81114060>] ? out_of_memory+0x110/0x480
[    3.001833]  [<ffffffff811197ad>] ? __alloc_pages_nodemask+0xbbd/0xd60
[    3.005400]  [<ffffffff8115d951>] ? alloc_pages_vma+0xb1/0x220
[    3.008391]  [<ffffffff811780ac>] ? mem_cgroup_commit_charge+0x7c/0xf0
[    3.011668]  [<ffffffff8113ce86>] ? handle_mm_fault+0x1036/0x1460
[    3.014782]  [<ffffffff81056c97>] ? __do_page_fault+0x177/0x430
[    3.017770]  [<ffffffff81056f7b>] ? do_page_fault+0x2b/0x70
[    3.020615]  [<ffffffff815a9198>] ? page_fault+0x28/0x30
[    3.023359] Mem-Info:
[    3.024575] active_anon:244334 inactive_anon:0 isolated_anon:0
[    3.024575]  active_file:0 inactive_file:0 isolated_file:0
[    3.024575]  unevictable:561 dirty:0 writeback:0 unstable:0
[    3.024575]  slab_reclaimable:94 slab_unreclaimable:2386
[    3.024575]  mapped:275 shmem:0 pagetables:477 bounce:0
[    3.024575]  free:1924 free_pcp:304 free_cma:0
[    3.040715] Node 0 DMA free:3936kB min:60kB low:72kB high:88kB active_anon:11260kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:64kB kernel_stack:0kB pagetables:564kB unstable:0kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[    3.062251] lowmem_reserve[]: 0 969 969 969
[    3.064752] Node 0 DMA32 free:3760kB min:3812kB low:4764kB high:5716kB active_anon:966076kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:2244kB isolated(anon):0kB 
isolated(file):0kB present:1032064kB managed:994872kB mlocked:0kB dirty:0kB writeback:0kB mapped:1100kB shmem:0kB slab_reclaimable:372kB slab_unreclaimable:9480kB kernel_stack:2192kB pagetables:1344kB 
unstable:0kB bounce:0kB free_pcp:1216kB local_pcp:244kB free_cma:0kB writeback_tmp:0kB pages_scanned:2244 all_unreclaimable? yes
[    3.087299] lowmem_reserve[]: 0 0 0 0
[    3.089437] Node 0 DMA: 2*4kB (ME) 1*8kB (E) 3*16kB (UME) 3*32kB (UME) 3*64kB (UME) 2*128kB (ME) 3*256kB (UME) 3*512kB (UME) 1*1024kB (E) 0*2048kB 0*4096kB = 3936kB
[    3.098058] Node 0 DMA32: 4*4kB (UME) 4*8kB (UME) 2*16kB (UE) 1*32kB (M) 1*64kB (M) 2*128kB (UE) 1*256kB (E) 0*512kB 3*1024kB (UME) 0*2048kB 0*4096kB = 3760kB
[    3.106371] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[    3.110846] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[    3.115169] 561 total pagecache pages
[    3.117051] 0 pages in swap cache
[    3.118764] Swap cache stats: add 0, delete 0, find 0/0
[    3.121414] Free swap  = 0kB
[    3.122958] Total swap = 0kB
[    3.124468] 262013 pages RAM
[    3.125962] 0 pages HighMem/MovableOnly
[    3.127932] 9319 pages reserved
[    3.129597] 0 pages cma reserved
[    3.131258] 0 pages hwpoisoned
[    3.132836] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[    3.137232] [   98]     0    98   279607   244400     489       5        0             0 init
[    3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice child
[    3.145346] Killed process 98 (init) total-vm:1118428kB, anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
[    3.416105] init[1]: segfault at 0 ip           (null) sp 00007ffd484cf5f0 error 14 in init[400000+1000]
[    3.439074] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.439074]
[    3.450193] Kernel Offset: disabled
[    3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.456259]
----------

Guessing from commit 1e99bad0d9c12a4a ("oom: kill all threads sharing oom
killed task's mm"), the

	if (same_thread_group(p, victim))
		continue;

test is for avoiding "Kill process %d (%s) sharing same memory\n" on the
victim's mm, but that printk() was already removed. Thus, I think we have
nothing to do (or can remove it if we don't mind sending SIGKILL twice).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-07 11:23     ` Tetsuo Handa
@ 2016-01-07 12:30       ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-07 12:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 07-01-16 20:23:04, Tetsuo Handa wrote:
[...]
> According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
> to kill init sharing same memory"), below patch is needed for avoid
> killing init process with SIGSEGV.
> 
> ----------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9548dce..9832f3f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>   			continue;
>   		if (same_thread_group(p, victim))
>   			continue;
> -		if (is_global_init(p))
> -			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) ||
> +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
>   		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>   			/*
>   			 * We cannot use oom_reaper for the mm shared by this
[...]
> [    3.132836] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [    3.137232] [   98]     0    98   279607   244400     489       5        0             0 init
> [    3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice child
> [    3.145346] Killed process 98 (init) total-vm:1118428kB, anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
> [    3.416105] init[1]: segfault at 0 ip           (null) sp 00007ffd484cf5f0 error 14 in init[400000+1000]
> [    3.439074] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    3.439074]
> [    3.450193] Kernel Offset: disabled
> [    3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    3.456259]

Ouch. You are right. The reaper will tear down the shared mm and the
global init will blow up. Very well spotted! The system will blow up
later, I would guess, because killing the victim wouldn't release a lot
of memory which will be pinned by the global init. So a panic sounds
unevitable. The scenario is really insane but what you are proposing is
correct.

Updated patch below.
--- 
>From 71c6f4135fe4a8d448d63d4904ba514787dea008 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 23 Nov 2015 18:20:57 +0100
Subject: [PATCH] mm, oom: introduce oom reaper

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
- do not tear down mm if it is shared with the global init because this
  could lead to SEGV and panic - Tetsuo
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 170 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..f60c6d6aa633 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dc490c06941b..95ce1602744b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,17 +748,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (is_global_init(p))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.6.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-07 12:30       ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-07 12:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 07-01-16 20:23:04, Tetsuo Handa wrote:
[...]
> According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
> to kill init sharing same memory"), below patch is needed for avoid
> killing init process with SIGSEGV.
> 
> ----------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9548dce..9832f3f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>   			continue;
>   		if (same_thread_group(p, victim))
>   			continue;
> -		if (is_global_init(p))
> -			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) ||
> +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
>   		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>   			/*
>   			 * We cannot use oom_reaper for the mm shared by this
[...]
> [    3.132836] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [    3.137232] [   98]     0    98   279607   244400     489       5        0             0 init
> [    3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice child
> [    3.145346] Killed process 98 (init) total-vm:1118428kB, anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
> [    3.416105] init[1]: segfault at 0 ip           (null) sp 00007ffd484cf5f0 error 14 in init[400000+1000]
> [    3.439074] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    3.439074]
> [    3.450193] Kernel Offset: disabled
> [    3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    3.456259]

Ouch. You are right. The reaper will tear down the shared mm and the
global init will blow up. Very well spotted! The system will blow up
later, I would guess, because killing the victim wouldn't release a lot
of memory which will be pinned by the global init. So a panic sounds
unevitable. The scenario is really insane but what you are proposing is
correct.

Updated patch below.
--- 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-06 15:42 ` Michal Hocko
@ 2016-01-11 12:42   ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-11 12:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task if necessary.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem
for write should be blocked killable which will help to reduce such a
lock contention. This is not done by this patch.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this has passed my basic testing but it definitely needs a deeper
review.  I have tested it by flooding the system by OOM and delaying
exit_mm for TIF_MEMDIE tasks to win the race for the oom reaper. I made
sure to delay after the mm was set to NULL to make sure that oom reaper
sees NULL mm from time to time to exercise this case as well. This
happened in roughly half instance.

 include/linux/oom.h |  2 +-
 kernel/exit.c       |  2 +-
 mm/oom_kill.c       | 72 ++++++++++++++++++++++++++++++++++-------------------
 3 files changed, 49 insertions(+), 27 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257321f0..45993b840ed6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(void);
+extern void exit_oom_victim(struct task_struct *tsk);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index ea95ee1b5ef7..4c114ba8a825 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -436,7 +436,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
+		exit_oom_victim(tsk);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 45e51ad2f7cf..abefeeb42504 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -419,21 +419,37 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct mm_struct *mm_to_reap;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
-static bool __oom_reap_vmas(struct mm_struct *mm)
+static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/* We might have raced with exit path */
-	if (!atomic_inc_not_zero(&mm->mm_users))
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	p = find_lock_task_mm(tsk);
+	if (!p)
 		return true;
 
+	mm = p->mm;
+	if (!atomic_inc_not_zero(&mm->mm_users)) {
+		task_unlock(p);
+		return true;
+	}
+
+	task_unlock(p);
+
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
@@ -463,60 +479,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * reasonably reclaimable memory anymore. OOM killer can continue
+	 * by selecting other victim if unmapping hasn't led to any
+	 * improvements. This also means that selecting this task doesn't
+	 * make any sense.
+	 */
+	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
 }
 
-static void oom_reap_vmas(struct mm_struct *mm)
+static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
 	/* Drop a reference taken by wake_oom_reaper */
-	mmdrop(mm);
+	put_task_struct(tsk);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct mm_struct *mm;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (mm = READ_ONCE(mm_to_reap)));
-		oom_reap_vmas(mm);
-		WRITE_ONCE(mm_to_reap, NULL);
+				     (tsk = READ_ONCE(task_to_reap)));
+		oom_reap_task(tsk);
+		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct mm_struct *old_mm;
+	struct task_struct *old_tsk;
 
 	if (!oom_reaper_th)
 		return;
 
-	/*
-	 * Pin the given mm. Use mm_count instead of mm_users because
-	 * we do not want to delay the address space tear down.
-	 */
-	atomic_inc(&mm->mm_count);
+	get_task_struct(tsk);
 
 	/*
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
 	 */
-	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm)
+	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
+	if (!old_tsk)
 		wake_up(&oom_reaper_wait);
 	else
-		mmdrop(mm);
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
@@ -539,7 +561,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *mm)
 {
 }
 #endif
@@ -570,9 +592,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(void)
+void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	clear_tsk_thread_flag(tsk, TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -759,7 +781,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		wake_oom_reaper(mm);
+		wake_oom_reaper(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.6.4

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-11 12:42   ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-11 12:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task if necessary.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem
for write should be blocked killable which will help to reduce such a
lock contention. This is not done by this patch.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this has passed my basic testing but it definitely needs a deeper
review.  I have tested it by flooding the system by OOM and delaying
exit_mm for TIF_MEMDIE tasks to win the race for the oom reaper. I made
sure to delay after the mm was set to NULL to make sure that oom reaper
sees NULL mm from time to time to exercise this case as well. This
happened in roughly half instance.

 include/linux/oom.h |  2 +-
 kernel/exit.c       |  2 +-
 mm/oom_kill.c       | 72 ++++++++++++++++++++++++++++++++++-------------------
 3 files changed, 49 insertions(+), 27 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257321f0..45993b840ed6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(void);
+extern void exit_oom_victim(struct task_struct *tsk);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index ea95ee1b5ef7..4c114ba8a825 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -436,7 +436,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
+		exit_oom_victim(tsk);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 45e51ad2f7cf..abefeeb42504 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -419,21 +419,37 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct mm_struct *mm_to_reap;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
-static bool __oom_reap_vmas(struct mm_struct *mm)
+static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/* We might have raced with exit path */
-	if (!atomic_inc_not_zero(&mm->mm_users))
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	p = find_lock_task_mm(tsk);
+	if (!p)
 		return true;
 
+	mm = p->mm;
+	if (!atomic_inc_not_zero(&mm->mm_users)) {
+		task_unlock(p);
+		return true;
+	}
+
+	task_unlock(p);
+
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
@@ -463,60 +479,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * reasonably reclaimable memory anymore. OOM killer can continue
+	 * by selecting other victim if unmapping hasn't led to any
+	 * improvements. This also means that selecting this task doesn't
+	 * make any sense.
+	 */
+	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
 }
 
-static void oom_reap_vmas(struct mm_struct *mm)
+static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
 	/* Drop a reference taken by wake_oom_reaper */
-	mmdrop(mm);
+	put_task_struct(tsk);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct mm_struct *mm;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (mm = READ_ONCE(mm_to_reap)));
-		oom_reap_vmas(mm);
-		WRITE_ONCE(mm_to_reap, NULL);
+				     (tsk = READ_ONCE(task_to_reap)));
+		oom_reap_task(tsk);
+		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct mm_struct *old_mm;
+	struct task_struct *old_tsk;
 
 	if (!oom_reaper_th)
 		return;
 
-	/*
-	 * Pin the given mm. Use mm_count instead of mm_users because
-	 * we do not want to delay the address space tear down.
-	 */
-	atomic_inc(&mm->mm_count);
+	get_task_struct(tsk);
 
 	/*
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
 	 */
-	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm)
+	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
+	if (!old_tsk)
 		wake_up(&oom_reaper_wait);
 	else
-		mmdrop(mm);
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
@@ -539,7 +561,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *mm)
 {
 }
 #endif
@@ -570,9 +592,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(void)
+void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	clear_tsk_thread_flag(tsk, TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -759,7 +781,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		wake_oom_reaper(mm);
+		wake_oom_reaper(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-11 12:42   ` Michal Hocko
@ 2016-01-11 16:52     ` Johannes Weiner
  -1 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2016-01-11 16:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Michal Hocko

This patch looks already good to me. I just have one question:

On Mon, Jan 11, 2016 at 01:42:00PM +0100, Michal Hocko wrote:
> @@ -463,60 +479,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
>  	up_read(&mm->mmap_sem);
> +
> +	/*
> +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> +	 * reasonably reclaimable memory anymore. OOM killer can continue
> +	 * by selecting other victim if unmapping hasn't led to any
> +	 * improvements. This also means that selecting this task doesn't
> +	 * make any sense.
> +	 */
> +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> +	exit_oom_victim(tsk);

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that one regardless of the score. Is there a possibility
that the task might hang after it has set PF_EXITING? In that case the
OOM killer should be able to move on to the next task.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer. We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b8a4210..7dfb351 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -305,9 +305,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task_will_free_mem(task) && !is_sysrq_oom(oc))
-		return OOM_SCAN_ABORT;
-
 	return OOM_SCAN_OK;
 }
 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-11 16:52     ` Johannes Weiner
  0 siblings, 0 replies; 56+ messages in thread
From: Johannes Weiner @ 2016-01-11 16:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Michal Hocko

This patch looks already good to me. I just have one question:

On Mon, Jan 11, 2016 at 01:42:00PM +0100, Michal Hocko wrote:
> @@ -463,60 +479,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
>  	up_read(&mm->mmap_sem);
> +
> +	/*
> +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> +	 * reasonably reclaimable memory anymore. OOM killer can continue
> +	 * by selecting other victim if unmapping hasn't led to any
> +	 * improvements. This also means that selecting this task doesn't
> +	 * make any sense.
> +	 */
> +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> +	exit_oom_victim(tsk);

When the OOM killer scans tasks and encounters a PF_EXITING one, it
force-selects that one regardless of the score. Is there a possibility
that the task might hang after it has set PF_EXITING? In that case the
OOM killer should be able to move on to the next task.

Frankly, I don't even know why we check for exiting tasks in the OOM
killer. We've tried direct reclaim at least 15 times by the time we
decide the system is OOM, there was plenty of time to exit and free
memory; and a task might exit voluntarily right after we issue a kill.
This is testing pure noise.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b8a4210..7dfb351 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -305,9 +305,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	if (oom_task_origin(task))
 		return OOM_SCAN_SELECT;
 
-	if (task_will_free_mem(task) && !is_sysrq_oom(oc))
-		return OOM_SCAN_ABORT;
-
 	return OOM_SCAN_OK;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-11 16:52     ` Johannes Weiner
@ 2016-01-11 17:46       ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-11 17:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Mon 11-01-16 11:52:14, Johannes Weiner wrote:
> This patch looks already good to me. I just have one question:

Thank you for the review!

> On Mon, Jan 11, 2016 at 01:42:00PM +0100, Michal Hocko wrote:
> > @@ -463,60 +479,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
> >  	}
> >  	tlb_finish_mmu(&tlb, 0, -1);
> >  	up_read(&mm->mmap_sem);
> > +
> > +	/*
> > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > +	 * by selecting other victim if unmapping hasn't led to any
> > +	 * improvements. This also means that selecting this task doesn't
> > +	 * make any sense.
> > +	 */
> > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > +	exit_oom_victim(tsk);
> 
> When the OOM killer scans tasks and encounters a PF_EXITING one, it
> force-selects that one regardless of the score.

True. For some reason I thought that oom_unkillable_task would skip
OOM_SCORE_ADJ_MIN task as they should be hidden from the OOM killer
by definition. Instead we are handling them in oom_badness. Maybe we
should move that check as it would better reflect the semantic.
dump_tasks wouldn't list the task anymore but should it in the first
place? The task is clearly unkillable so why it should add the noise to
the logs.

> Is there a possibility
> that the task might hang after it has set PF_EXITING? In that case the
> OOM killer should be able to move on to the next task.

I guess we can because we are taking some locks after exit_signals but I
haven't checked very closely.

> Frankly, I don't even know why we check for exiting tasks in the OOM
> killer. We've tried direct reclaim at least 15 times by the time we
> decide the system is OOM, there was plenty of time to exit and free
> memory; and a task might exit voluntarily right after we issue a kill.
> This is testing pure noise.

I guess the idea was to prevent from killing another task if some task
is exiting and so it should release its memory shortly. But as you
say this is racy and the oom scanner doesn't know how long has the
target task been in this state without any change. So maybe this is
indeed no longer needed and task_will_free_mem check in out_of_memory is
sufficient.
David?

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b8a4210..7dfb351 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -305,9 +305,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	if (oom_task_origin(task))
>  		return OOM_SCAN_SELECT;
>  
> -	if (task_will_free_mem(task) && !is_sysrq_oom(oc))
> -		return OOM_SCAN_ABORT;
> -
>  	return OOM_SCAN_OK;
>  }
>  

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-11 17:46       ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-11 17:46 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Mon 11-01-16 11:52:14, Johannes Weiner wrote:
> This patch looks already good to me. I just have one question:

Thank you for the review!

> On Mon, Jan 11, 2016 at 01:42:00PM +0100, Michal Hocko wrote:
> > @@ -463,60 +479,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
> >  	}
> >  	tlb_finish_mmu(&tlb, 0, -1);
> >  	up_read(&mm->mmap_sem);
> > +
> > +	/*
> > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > +	 * by selecting other victim if unmapping hasn't led to any
> > +	 * improvements. This also means that selecting this task doesn't
> > +	 * make any sense.
> > +	 */
> > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > +	exit_oom_victim(tsk);
> 
> When the OOM killer scans tasks and encounters a PF_EXITING one, it
> force-selects that one regardless of the score.

True. For some reason I thought that oom_unkillable_task would skip
OOM_SCORE_ADJ_MIN task as they should be hidden from the OOM killer
by definition. Instead we are handling them in oom_badness. Maybe we
should move that check as it would better reflect the semantic.
dump_tasks wouldn't list the task anymore but should it in the first
place? The task is clearly unkillable so why it should add the noise to
the logs.

> Is there a possibility
> that the task might hang after it has set PF_EXITING? In that case the
> OOM killer should be able to move on to the next task.

I guess we can because we are taking some locks after exit_signals but I
haven't checked very closely.

> Frankly, I don't even know why we check for exiting tasks in the OOM
> killer. We've tried direct reclaim at least 15 times by the time we
> decide the system is OOM, there was plenty of time to exit and free
> memory; and a task might exit voluntarily right after we issue a kill.
> This is testing pure noise.

I guess the idea was to prevent from killing another task if some task
is exiting and so it should release its memory shortly. But as you
say this is racy and the oom scanner doesn't know how long has the
target task been in this state without any change. So maybe this is
indeed no longer needed and task_will_free_mem check in out_of_memory is
sufficient.
David?

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b8a4210..7dfb351 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -305,9 +305,6 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
>  	if (oom_task_origin(task))
>  		return OOM_SCAN_SELECT;
>  
> -	if (task_will_free_mem(task) && !is_sysrq_oom(oc))
> -		return OOM_SCAN_ABORT;
> -
>  	return OOM_SCAN_OK;
>  }
>  

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06 15:42   ` Michal Hocko
@ 2016-01-11 22:54     ` Andrew Morton
  -1 siblings, 0 replies; 56+ messages in thread
From: Andrew Morton @ 2016-01-11 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed,  6 Jan 2016 16:42:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> - use subsys_initcall instead of module_init - Paul Gortmaker

That's pretty much the only change between what-i-have and
what-you-sent, so I'll just do this as a delta:


--- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-v4
+++ a/mm/oom_kill.c
@@ -32,12 +32,11 @@
 #include <linux/mempolicy.h>
 #include <linux/security.h>
 #include <linux/ptrace.h>
-#include <linux/delay.h>
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
-#include <linux/module.h>
+#include <linux/init.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -542,7 +541,7 @@ static int __init oom_init(void)
 	}
 	return 0;
 }
-module_init(oom_init)
+subsys_initcall(oom_init)
 #else
 static void wake_oom_reaper(struct mm_struct *mm)
 {
_

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-11 22:54     ` Andrew Morton
  0 siblings, 0 replies; 56+ messages in thread
From: Andrew Morton @ 2016-01-11 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed,  6 Jan 2016 16:42:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> - use subsys_initcall instead of module_init - Paul Gortmaker

That's pretty much the only change between what-i-have and
what-you-sent, so I'll just do this as a delta:


--- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-v4
+++ a/mm/oom_kill.c
@@ -32,12 +32,11 @@
 #include <linux/mempolicy.h>
 #include <linux/security.h>
 #include <linux/ptrace.h>
-#include <linux/delay.h>
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
-#include <linux/module.h>
+#include <linux/init.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -542,7 +541,7 @@ static int __init oom_init(void)
 	}
 	return 0;
 }
-module_init(oom_init)
+subsys_initcall(oom_init)
 #else
 static void wake_oom_reaper(struct mm_struct *mm)
 {
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-11 22:54     ` Andrew Morton
@ 2016-01-12  8:16       ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-12  8:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 11-01-16 14:54:55, Andrew Morton wrote:
> On Wed,  6 Jan 2016 16:42:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > - use subsys_initcall instead of module_init - Paul Gortmaker
> 
> That's pretty much the only change between what-i-have and
> what-you-sent, so I'll just do this as a delta:

Yeah that should be the case, thanks for double checking!
 
> --- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-v4
> +++ a/mm/oom_kill.c
> @@ -32,12 +32,11 @@
>  #include <linux/mempolicy.h>
>  #include <linux/security.h>
>  #include <linux/ptrace.h>
> -#include <linux/delay.h>
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
>  #include <linux/kthread.h>
> -#include <linux/module.h>
> +#include <linux/init.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -542,7 +541,7 @@ static int __init oom_init(void)
>  	}
>  	return 0;
>  }
> -module_init(oom_init)
> +subsys_initcall(oom_init)
>  #else
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> _

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-12  8:16       ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-12  8:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 11-01-16 14:54:55, Andrew Morton wrote:
> On Wed,  6 Jan 2016 16:42:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > - use subsys_initcall instead of module_init - Paul Gortmaker
> 
> That's pretty much the only change between what-i-have and
> what-you-sent, so I'll just do this as a delta:

Yeah that should be the case, thanks for double checking!
 
> --- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-v4
> +++ a/mm/oom_kill.c
> @@ -32,12 +32,11 @@
>  #include <linux/mempolicy.h>
>  #include <linux/security.h>
>  #include <linux/ptrace.h>
> -#include <linux/delay.h>
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
>  #include <linux/kthread.h>
> -#include <linux/module.h>
> +#include <linux/init.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -542,7 +541,7 @@ static int __init oom_init(void)
>  	}
>  	return 0;
>  }
> -module_init(oom_init)
> +subsys_initcall(oom_init)
>  #else
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> _

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-11 12:42   ` Michal Hocko
@ 2016-01-18  4:35     ` Tetsuo Handa
  -1 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-18  4:35 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> When oom_reaper manages to unmap all the eligible vmas there shouldn't
> be much of the freable memory held by the oom victim left anymore so it
> makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> OOM killer to select another task if necessary.
> 
> The lack of TIF_MEMDIE also means that the victim cannot access memory
> reserves anymore but that shouldn't be a problem because it would get
> the access again if it needs to allocate and hits the OOM killer again
> due to the fatal_signal_pending resp. PF_EXITING check. We can safely
> hide the task from the OOM killer because it is clearly not a good
> candidate anymore as everyhing reclaimable has been torn down already.
> 
> This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
> and thus hold off further global OOM killer actions granted the oom
> reaper is able to take mmap_sem for the associated mm struct. This is
> not guaranteed now but further steps should make sure that mmap_sem
> for write should be blocked killable which will help to reduce such a
> lock contention. This is not done by this patch.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> 
> Hi,
> this has passed my basic testing but it definitely needs a deeper
> review.  I have tested it by flooding the system by OOM and delaying
> exit_mm for TIF_MEMDIE tasks to win the race for the oom reaper. I made
> sure to delay after the mm was set to NULL to make sure that oom reaper
> sees NULL mm from time to time to exercise this case as well. This
> happened in roughly half instance.
> 
>  include/linux/oom.h |  2 +-
>  kernel/exit.c       |  2 +-
>  mm/oom_kill.c       | 72 ++++++++++++++++++++++++++++++++++-------------------
>  3 files changed, 49 insertions(+), 27 deletions(-)

A patch attached bottom is my suggestion for making sure that we won't be
trapped by OOM livelock when the OOM reaper did not reclaim enough memory for
terminating OOM victim. It also includes several bugfixes which I think current
patch is missing.

I like the OOM reaper approach. But I don't like current patch because current
patch ignores unlikely cases described below. I proposed two simple patches for
handling such corner cases.

  (P1) "[PATCH v2] mm,oom: exclude TIF_MEMDIE processes from candidates."
       http://lkml.kernel.org/r/201601081909.CDJ52685.HLFOFJFOQMVOtS@I-love.SAKURA.ne.jp

  (P2) "[PATCH] mm,oom: Re-enable OOM killer using timers."
       http://lkml.kernel.org/r/201601072026.JCJ95845.LHQOFOOSMFtVFJ@I-love.SAKURA.ne.jp
       (oomkiller_holdoff_timer and sysctl_oomkiller_holdoff_ms in this patch
       are not directly related with avoiding OOM livelock.)

If all changes that cover unlikely cases are implemented, P1 and P2 will
become unneeded.

(1) Make the OOM reaper available on CONFIG_MMU=n kernels.

    I don't know about MMU, but I assume we can handle these errors.

    slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
    slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
    slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'

(2) Do not boot the system if failed to create the OOM reaper thread.

    We are already heavily depending on the OOM reaper.

    pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
                    PTR_ERR(oom_reaper_th));

(3) Eliminate locations that call mark_oom_victim() without
    making the OOM victim task under monitor of the OOM reaper.

    The OOM reaper needs to take actions when the OOM victim task got stuck
    because we (except me) do not want to use my sysctl-controlled timeout-
    based OOM victim selection.

    out_of_memory():
        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

    oom_kill_process():
        task_lock(p);
        if (p->mm && task_will_free_mem(p)) {
                mark_oom_victim(p);
                task_unlock(p);
                put_task_struct(p);
                return;
        }
        task_unlock(p);

    mem_cgroup_out_of_memory():
        if (fatal_signal_pending(current) || task_will_free_mem(current)) {
                mark_oom_victim(current);
                goto unlock;
        }

    lowmem_scan():
        if (selected->mm)
                mark_oom_victim(selected);

(4) Don't select an OOM victim until mm_to_reap (or task_to_reap) becomes NULL.

    This is needed for making sure that any OOM victim is made under
    monitor of the OOM reaper in order to let the OOM reaper take action
    before leaving oom_reap_vmas() (or oom_reap_task()).

    Since the OOM reaper can do mm_to_reap (or task_to_reap) = NULL shortly
    (e.g. within a second if it retries for 10 times with 0.1 second interval),
    waiting should not become a problem.

(5) Decrease oom_score_adj value after the OOM reaper reclaimed memory.

    If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) succeeded, set oom_score_adj
    value of all tasks sharing the same mm to -1000 (by walking the process list)
    and clear TIF_MEMDIE.

    Changing only the OOM victim's oom_score_adj is not sufficient
    when there are other thread groups sharing the OOM victim's memory
    (i.e. clone(!CLONE_THREAD && CLONE_VM) case).

(6) Decrease oom_score_adj value even if the OOM reaper failed to reclaim memory.

    If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) failed for 10 times, decrease
    oom_score_adj value of all tasks sharing the same mm and clear TIF_MEMDIE.
    This is needed for preventing the OOM killer from selecting the same thread
    group forever.

    An example is, set oom_score_adj to -999 if oom_score_adj is greater than
    -999, set -1000 if oom_score_adj is already -999. This will allow the OOM
    killer try to choose different OOM victims before retrying __oom_reap_vmas(mm)
    (or __oom_reap_task(tsk)) of this OOM victim, then trigger kernel panic if
    all OOM victims got -1000.

    Changing mmap_sem lock killable increases possibility of __oom_reap_vmas(mm)
    (or __oom_reap_task(tsk)) to succeed. But due to the changes in (3) and (4),
    there is no guarantee that TIF_MEMDIE is set to the thread which is looping at
    __alloc_pages_slowpath() with the mmap_sem held for writing. If the OOM killer
    were able to know which thread is looping at __alloc_pages_slowpath() with the
    mmap_sem held for writing (via per task_struct variable), the OOM killer would
    set TIF_MEMDIE on that thread before randomly choosing one thread using
    find_lock_task_mm().

(7) Decrease oom_score_adj value even if the OOM reaper is not allowed to reclaim
    memory.

    This is same with (6) except for cases where the OOM victim's memory is
    used by some OOM-unkillable threads (i.e. can_oom_reap = false case).

    Calling wake_oom_reaper() with can_oom_reap added is the simplest way for
    waiting for short period (e.g. a second) and change oom_score_adj value
    and clear TIF_MEMDIE.

Since kmallocwd-like approach (i.e. walk the process list) will eliminate
the need for doing (3) and (4), I tried it (a patch is shown below). The
changes are larger than I initially thought, for clearing TIF_MEMDIE needs
a switch for avoid re-setting TIF_MEMDIE forever and such switch is
complicated.

  (a) PFA_OOM_NO_RECURSION is a switch for avoid re-setting TIF_MEMDIE forever
      when an OOM victim is chosen without taking ->oom_score_adj into account.

  (b) When an OOM victim is chosen with taking ->oom_score_adj into account,
      it is set to -999 when the OOM reaper was unable to reclaim victim's
      memory. It is set to -1000 when the OOM reaper was unable to reclaim
      victim's memory when it was already -999.

  (c) If ->oom_score_adj was set to -1000 when TIF_MEMDIE was cleared,
      we can consider such task OOM-killable because such task is either
      SIGKILL pending or already exiting. Thus, we should not try to test
      whether a task's memory is reapable at oom_kill_process().

Do we prefer this direction over P1+P2 which do not clear TIF_MEMDIE?
----------------------------------------
Date: Mon, 18 Jan 2016 13:22:51 +0900
Subject: [PATCH 4/2] oom: change OOM reaper to walk the process list

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/sched.h |   4 +
 mm/memcontrol.c       |   8 +-
 mm/oom_kill.c         | 250 ++++++++++++++++++++++++++++++++++----------------
 3 files changed, 183 insertions(+), 79 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1ef541c..1a15c584 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2167,6 +2167,7 @@ static inline void memalloc_noio_restore(unsigned int flags)
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
 #define PFA_SPREAD_SLAB  2      /* Spread some slab caches over cpuset */
+#define PFA_OOM_NO_RECURSION 3  /* OOM-killing with OOM score ignored */
 
 
 #define TASK_PFA_TEST(name, func)					\
@@ -2190,6 +2191,9 @@ TASK_PFA_TEST(SPREAD_SLAB, spread_slab)
 TASK_PFA_SET(SPREAD_SLAB, spread_slab)
 TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab)
 
+TASK_PFA_TEST(OOM_NO_RECURSION, oom_no_recursion)
+TASK_PFA_SET(OOM_NO_RECURSION, oom_no_recursion)
+
 /*
  * task->jobctl flags
  */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d75028d..134ddf7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1290,8 +1290,14 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
+	 *
+	 * But prepare for situations where failing to OOM-kill current task
+	 * caused unable to choose next OOM victim.
+	 * In that case, do regular OOM victim selection.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    !task_oom_no_recursion(current)) {
+		task_set_oom_no_recursion(current);
 		mark_oom_victim(current);
 		goto unlock;
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6ebc0351..d3a7cd8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,9 +288,16 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
+	 *
+	 * But prepare for situations where failing to OOM-kill this task
+	 * after the OOM reaper reaped this task's memory caused unable to
+	 * abort swapoff() or KSM's unmerge operation.
+	 * In that case, do regular OOM victim selection.
 	 */
-	if (oom_task_origin(task))
+	if (oom_task_origin(task) && !task_oom_no_recursion(task)) {
+		task_set_oom_no_recursion(task);
 		return OOM_SCAN_SELECT;
+	}
 
 	return OOM_SCAN_OK;
 }
@@ -416,37 +423,18 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_vma(struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm;
+	struct task_struct *g;
 	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		return true;
-
-	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
-		task_unlock(p);
-		return true;
-	}
-
-	task_unlock(p);
-
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
@@ -478,64 +466,169 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
+	 * If we successfully reaped a mm, mark all tasks using it as
+	 * OOM-unkillable and clear TIF_MEMDIE. This will help future
+	 * select_bad_process() try to select other OOM-killable tasks.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (mm != p->mm)
+			continue;
+		p->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+		exit_oom_victim(p);
+	}
+	rcu_read_unlock();
 out:
-	mmput(mm);
 	return ret;
 }
 
-static void oom_reap_task(struct task_struct *tsk)
+#define MAX_PIDS_TO_CHECK_LEN 16
+static struct pid *pids_to_check[MAX_PIDS_TO_CHECK_LEN];
+static int pids_to_check_len;
+
+static int gather_pids_to_check(void)
 {
-	int attempts = 0;
+	struct task_struct *g;
+	struct task_struct *p;
 
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_task(tsk))
-		schedule_timeout_idle(HZ/10);
+	if (!atomic_read(&oom_victims))
+		return 0;
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (!test_tsk_thread_flag(p, TIF_MEMDIE))
+			continue;
+		/*
+		 * Remember "struct pid" of TIF_MEMDIE tasks rather than
+		 * "struct task_struct". This will avoid needlessly deferring
+		 * final __put_task_struct() call when such tasks become
+		 * ready to terminate.
+		 */
+		pids_to_check[pids_to_check_len++] =
+			get_task_pid(p, PIDTYPE_PID);
+		if (pids_to_check_len == MAX_PIDS_TO_CHECK_LEN)
+			goto done;
+	}
+done:
+	rcu_read_unlock();
+	return pids_to_check_len;
+}
 
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
+static int reap_pids_to_check(void)
+{
+	int i;
+	int j;
+	struct pid *pid;
+	struct task_struct *g;
+	struct task_struct *p;
+	struct mm_struct *mm;
+	bool success;
+
+	for (i = 0; i < pids_to_check_len; i++) {
+		pid = pids_to_check[i];
+		rcu_read_lock();
+		p = pid_task(pid, PIDTYPE_PID);
+		mm = p ? READ_ONCE(p->mm) : NULL;
+		if (!mm) {
+			rcu_read_unlock();
+			goto done;
+		}
+		/*
+		 * Since it is possible that p voluntarily called do_exit() or
+		 * somebody other than the OOM killer sent SIGKILL on p, a mm
+		 * used by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is
+		 * reapable if p has pending SIGKILL or already reached
+		 * do_exit().
+		 *
+		 * On the other hand, it is possible that mark_oom_victim(p) is
+		 * called without sending SIGKILL to all OOM-killable tasks
+		 * using a mm used by p. In that case, the OOM reaper cannot
+		 * reap that mm unless p is the only task using that mm.
+		 *
+		 * Therefore, determine whether a mm is reapable by testing
+		 * whether all tasks using that mm are dying or already exiting
+		 * rather than depending on p->signal->oom_score_adj value
+		 * which is updated by the OOM reaper.
+		 */
+		for_each_process_thread(g, p) {
+			if (mm != READ_ONCE(p->mm) ||
+			    fatal_signal_pending(p) || (p->flags & PF_EXITING))
+				continue;
+			mm = NULL;
+			goto skip;
+		}
+		if (!atomic_inc_not_zero(&mm->mm_users))
+			mm = NULL;
+skip:
+		rcu_read_unlock();
+		if (!mm)
+			continue;
+		success = __oom_reap_vma(mm);
+		mmput(mm);
+		if (success) {
+done:
+			put_pid(pid);
+			pids_to_check_len--;
+			for (j = i; j < pids_to_check_len; j++)
+				pids_to_check[j] = pids_to_check[j + 1];
+			i--;
+		}
+	}
+	return pids_to_check_len;
+}
+
+static void release_pids_to_check(void)
+{
+	int i;
+	struct pid *pid;
+	struct task_struct *p;
+	short score;
+
+	for (i = 0; i < pids_to_check_len; i++) {
+		pid = pids_to_check[i];
+		/*
+		 * If we failed to reap a mm, mark that task using it as almost
+		 * OOM-unkillable and clear TIF_MEMDIE. This will help future
+		 * select_bad_process() try to select other OOM-killable tasks
+		 * before selecting that task again.
+		 *
+		 * But if that task got TIF_MEMDIE when that task is already
+		 * marked as almost OOM-unkillable, mark that task completely
+		 * OOM-unkillable. Otherwise, we cannot make progress when all
+		 * OOM-killable tasks became almost OOM-unkillable.
+		 */
+		rcu_read_lock();
+		p = pid_task(pid, PIDTYPE_PID);
+		if (p) {
+			score = p->signal->oom_score_adj;
+			p->signal->oom_score_adj =
+				score > OOM_SCORE_ADJ_MIN + 1 ?
+				OOM_SCORE_ADJ_MIN + 1 : OOM_SCORE_ADJ_MIN;
+			exit_oom_victim(p);
+		}
+		rcu_read_unlock();
+		put_pid(pid);
+	}
+	pids_to_check_len = 0;
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk;
+		int i;
 
-		wait_event_freezable(oom_reaper_wait,
-				     (tsk = READ_ONCE(task_to_reap)));
-		oom_reap_task(tsk);
-		WRITE_ONCE(task_to_reap, NULL);
+		wait_event_freezable(oom_reaper_wait, gather_pids_to_check());
+		for (i = 0; reap_pids_to_check() && i < 10; i++)
+			schedule_timeout_idle(HZ / 10);
+		release_pids_to_check();
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(void)
 {
-	struct task_struct *old_tsk;
-
-	if (!oom_reaper_th)
-		return;
-
-	get_task_struct(tsk);
-
-	/*
-	 * Make sure that only a single mm is ever queued for the reaper
-	 * because multiple are not necessary and the operation might be
-	 * disruptive so better reduce it to the bare minimum.
-	 */
-	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
-	if (!old_tsk)
+	if (oom_reaper_th)
 		wake_up(&oom_reaper_wait);
-	else
-		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
@@ -558,7 +651,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct task_struct *mm)
+static void wake_oom_reaper(void)
 {
 }
 #endif
@@ -584,6 +677,7 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	wake_oom_reaper();
 }
 
 /**
@@ -591,9 +685,8 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk, TIF_MEMDIE);
-
-	if (!atomic_dec_return(&oom_victims))
+	if (test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE) &&
+	    !atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
 }
 
@@ -672,7 +765,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -740,7 +832,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -766,23 +857,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		if (is_global_init(p))
 			continue;
 		if (unlikely(p->flags & PF_KTHREAD) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
-		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
+	mark_oom_victim(victim);
 	put_task_struct(victim);
 }
 #undef K
@@ -858,9 +940,14 @@ bool out_of_memory(struct oom_control *oc)
 	 *
 	 * But don't select if current has already released its mm and cleared
 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
+	 *
+	 * Also, prepare for situations where failing to OOM-kill current task
+	 * caused unable to choose next OOM victim.
+	 * In that case, do regular OOM victim selection.
 	 */
-	if (current->mm &&
+	if (current->mm && !task_oom_no_recursion(current) &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+		task_set_oom_no_recursion(current);
 		mark_oom_victim(current);
 		return true;
 	}
@@ -876,7 +963,14 @@ bool out_of_memory(struct oom_control *oc)
 
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN &&
+	    !task_oom_no_recursion(current)) {
+		/*
+		 * But prepare for situations where failing to OOM-kill current
+		 * task caused unable to choose next OOM victim.
+		 * In that case, do regular OOM victim selection.
+		 */
+		task_set_oom_no_recursion(current);
 		get_task_struct(current);
 		oom_kill_process(oc, current, 0, totalpages, NULL,
 				 "Out of memory (oom_kill_allocating_task)");
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-18  4:35     ` Tetsuo Handa
  0 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-18  4:35 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> When oom_reaper manages to unmap all the eligible vmas there shouldn't
> be much of the freable memory held by the oom victim left anymore so it
> makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> OOM killer to select another task if necessary.
> 
> The lack of TIF_MEMDIE also means that the victim cannot access memory
> reserves anymore but that shouldn't be a problem because it would get
> the access again if it needs to allocate and hits the OOM killer again
> due to the fatal_signal_pending resp. PF_EXITING check. We can safely
> hide the task from the OOM killer because it is clearly not a good
> candidate anymore as everyhing reclaimable has been torn down already.
> 
> This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
> and thus hold off further global OOM killer actions granted the oom
> reaper is able to take mmap_sem for the associated mm struct. This is
> not guaranteed now but further steps should make sure that mmap_sem
> for write should be blocked killable which will help to reduce such a
> lock contention. This is not done by this patch.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> 
> Hi,
> this has passed my basic testing but it definitely needs a deeper
> review.  I have tested it by flooding the system by OOM and delaying
> exit_mm for TIF_MEMDIE tasks to win the race for the oom reaper. I made
> sure to delay after the mm was set to NULL to make sure that oom reaper
> sees NULL mm from time to time to exercise this case as well. This
> happened in roughly half instance.
> 
>  include/linux/oom.h |  2 +-
>  kernel/exit.c       |  2 +-
>  mm/oom_kill.c       | 72 ++++++++++++++++++++++++++++++++++-------------------
>  3 files changed, 49 insertions(+), 27 deletions(-)

A patch attached bottom is my suggestion for making sure that we won't be
trapped by OOM livelock when the OOM reaper did not reclaim enough memory for
terminating OOM victim. It also includes several bugfixes which I think current
patch is missing.

I like the OOM reaper approach. But I don't like current patch because current
patch ignores unlikely cases described below. I proposed two simple patches for
handling such corner cases.

  (P1) "[PATCH v2] mm,oom: exclude TIF_MEMDIE processes from candidates."
       http://lkml.kernel.org/r/201601081909.CDJ52685.HLFOFJFOQMVOtS@I-love.SAKURA.ne.jp

  (P2) "[PATCH] mm,oom: Re-enable OOM killer using timers."
       http://lkml.kernel.org/r/201601072026.JCJ95845.LHQOFOOSMFtVFJ@I-love.SAKURA.ne.jp
       (oomkiller_holdoff_timer and sysctl_oomkiller_holdoff_ms in this patch
       are not directly related with avoiding OOM livelock.)

If all changes that cover unlikely cases are implemented, P1 and P2 will
become unneeded.

(1) Make the OOM reaper available on CONFIG_MMU=n kernels.

    I don't know about MMU, but I assume we can handle these errors.

    slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
    slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
    slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'

(2) Do not boot the system if failed to create the OOM reaper thread.

    We are already heavily depending on the OOM reaper.

    pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
                    PTR_ERR(oom_reaper_th));

(3) Eliminate locations that call mark_oom_victim() without
    making the OOM victim task under monitor of the OOM reaper.

    The OOM reaper needs to take actions when the OOM victim task got stuck
    because we (except me) do not want to use my sysctl-controlled timeout-
    based OOM victim selection.

    out_of_memory():
        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

    oom_kill_process():
        task_lock(p);
        if (p->mm && task_will_free_mem(p)) {
                mark_oom_victim(p);
                task_unlock(p);
                put_task_struct(p);
                return;
        }
        task_unlock(p);

    mem_cgroup_out_of_memory():
        if (fatal_signal_pending(current) || task_will_free_mem(current)) {
                mark_oom_victim(current);
                goto unlock;
        }

    lowmem_scan():
        if (selected->mm)
                mark_oom_victim(selected);

(4) Don't select an OOM victim until mm_to_reap (or task_to_reap) becomes NULL.

    This is needed for making sure that any OOM victim is made under
    monitor of the OOM reaper in order to let the OOM reaper take action
    before leaving oom_reap_vmas() (or oom_reap_task()).

    Since the OOM reaper can do mm_to_reap (or task_to_reap) = NULL shortly
    (e.g. within a second if it retries for 10 times with 0.1 second interval),
    waiting should not become a problem.

(5) Decrease oom_score_adj value after the OOM reaper reclaimed memory.

    If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) succeeded, set oom_score_adj
    value of all tasks sharing the same mm to -1000 (by walking the process list)
    and clear TIF_MEMDIE.

    Changing only the OOM victim's oom_score_adj is not sufficient
    when there are other thread groups sharing the OOM victim's memory
    (i.e. clone(!CLONE_THREAD && CLONE_VM) case).

(6) Decrease oom_score_adj value even if the OOM reaper failed to reclaim memory.

    If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) failed for 10 times, decrease
    oom_score_adj value of all tasks sharing the same mm and clear TIF_MEMDIE.
    This is needed for preventing the OOM killer from selecting the same thread
    group forever.

    An example is, set oom_score_adj to -999 if oom_score_adj is greater than
    -999, set -1000 if oom_score_adj is already -999. This will allow the OOM
    killer try to choose different OOM victims before retrying __oom_reap_vmas(mm)
    (or __oom_reap_task(tsk)) of this OOM victim, then trigger kernel panic if
    all OOM victims got -1000.

    Changing mmap_sem lock killable increases possibility of __oom_reap_vmas(mm)
    (or __oom_reap_task(tsk)) to succeed. But due to the changes in (3) and (4),
    there is no guarantee that TIF_MEMDIE is set to the thread which is looping at
    __alloc_pages_slowpath() with the mmap_sem held for writing. If the OOM killer
    were able to know which thread is looping at __alloc_pages_slowpath() with the
    mmap_sem held for writing (via per task_struct variable), the OOM killer would
    set TIF_MEMDIE on that thread before randomly choosing one thread using
    find_lock_task_mm().

(7) Decrease oom_score_adj value even if the OOM reaper is not allowed to reclaim
    memory.

    This is same with (6) except for cases where the OOM victim's memory is
    used by some OOM-unkillable threads (i.e. can_oom_reap = false case).

    Calling wake_oom_reaper() with can_oom_reap added is the simplest way for
    waiting for short period (e.g. a second) and change oom_score_adj value
    and clear TIF_MEMDIE.

Since kmallocwd-like approach (i.e. walk the process list) will eliminate
the need for doing (3) and (4), I tried it (a patch is shown below). The
changes are larger than I initially thought, for clearing TIF_MEMDIE needs
a switch for avoid re-setting TIF_MEMDIE forever and such switch is
complicated.

  (a) PFA_OOM_NO_RECURSION is a switch for avoid re-setting TIF_MEMDIE forever
      when an OOM victim is chosen without taking ->oom_score_adj into account.

  (b) When an OOM victim is chosen with taking ->oom_score_adj into account,
      it is set to -999 when the OOM reaper was unable to reclaim victim's
      memory. It is set to -1000 when the OOM reaper was unable to reclaim
      victim's memory when it was already -999.

  (c) If ->oom_score_adj was set to -1000 when TIF_MEMDIE was cleared,
      we can consider such task OOM-killable because such task is either
      SIGKILL pending or already exiting. Thus, we should not try to test
      whether a task's memory is reapable at oom_kill_process().

Do we prefer this direction over P1+P2 which do not clear TIF_MEMDIE?
----------------------------------------
Date: Mon, 18 Jan 2016 13:22:51 +0900
Subject: [PATCH 4/2] oom: change OOM reaper to walk the process list

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/sched.h |   4 +
 mm/memcontrol.c       |   8 +-
 mm/oom_kill.c         | 250 ++++++++++++++++++++++++++++++++++----------------
 3 files changed, 183 insertions(+), 79 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1ef541c..1a15c584 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2167,6 +2167,7 @@ static inline void memalloc_noio_restore(unsigned int flags)
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
 #define PFA_SPREAD_SLAB  2      /* Spread some slab caches over cpuset */
+#define PFA_OOM_NO_RECURSION 3  /* OOM-killing with OOM score ignored */
 
 
 #define TASK_PFA_TEST(name, func)					\
@@ -2190,6 +2191,9 @@ TASK_PFA_TEST(SPREAD_SLAB, spread_slab)
 TASK_PFA_SET(SPREAD_SLAB, spread_slab)
 TASK_PFA_CLEAR(SPREAD_SLAB, spread_slab)
 
+TASK_PFA_TEST(OOM_NO_RECURSION, oom_no_recursion)
+TASK_PFA_SET(OOM_NO_RECURSION, oom_no_recursion)
+
 /*
  * task->jobctl flags
  */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d75028d..134ddf7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1290,8 +1290,14 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * If current has a pending SIGKILL or is exiting, then automatically
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
+	 *
+	 * But prepare for situations where failing to OOM-kill current task
+	 * caused unable to choose next OOM victim.
+	 * In that case, do regular OOM victim selection.
 	 */
-	if (fatal_signal_pending(current) || task_will_free_mem(current)) {
+	if ((fatal_signal_pending(current) || task_will_free_mem(current)) &&
+	    !task_oom_no_recursion(current)) {
+		task_set_oom_no_recursion(current);
 		mark_oom_victim(current);
 		goto unlock;
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 6ebc0351..d3a7cd8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -288,9 +288,16 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
 	 * killed first if it triggers an oom, then select it.
+	 *
+	 * But prepare for situations where failing to OOM-kill this task
+	 * after the OOM reaper reaped this task's memory caused unable to
+	 * abort swapoff() or KSM's unmerge operation.
+	 * In that case, do regular OOM victim selection.
 	 */
-	if (oom_task_origin(task))
+	if (oom_task_origin(task) && !task_oom_no_recursion(task)) {
+		task_set_oom_no_recursion(task);
 		return OOM_SCAN_SELECT;
+	}
 
 	return OOM_SCAN_OK;
 }
@@ -416,37 +423,18 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_vma(struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm;
+	struct task_struct *g;
 	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		return true;
-
-	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
-		task_unlock(p);
-		return true;
-	}
-
-	task_unlock(p);
-
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
@@ -478,64 +466,169 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
+	 * If we successfully reaped a mm, mark all tasks using it as
+	 * OOM-unkillable and clear TIF_MEMDIE. This will help future
+	 * select_bad_process() try to select other OOM-killable tasks.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (mm != p->mm)
+			continue;
+		p->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+		exit_oom_victim(p);
+	}
+	rcu_read_unlock();
 out:
-	mmput(mm);
 	return ret;
 }
 
-static void oom_reap_task(struct task_struct *tsk)
+#define MAX_PIDS_TO_CHECK_LEN 16
+static struct pid *pids_to_check[MAX_PIDS_TO_CHECK_LEN];
+static int pids_to_check_len;
+
+static int gather_pids_to_check(void)
 {
-	int attempts = 0;
+	struct task_struct *g;
+	struct task_struct *p;
 
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_task(tsk))
-		schedule_timeout_idle(HZ/10);
+	if (!atomic_read(&oom_victims))
+		return 0;
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (!test_tsk_thread_flag(p, TIF_MEMDIE))
+			continue;
+		/*
+		 * Remember "struct pid" of TIF_MEMDIE tasks rather than
+		 * "struct task_struct". This will avoid needlessly deferring
+		 * final __put_task_struct() call when such tasks become
+		 * ready to terminate.
+		 */
+		pids_to_check[pids_to_check_len++] =
+			get_task_pid(p, PIDTYPE_PID);
+		if (pids_to_check_len == MAX_PIDS_TO_CHECK_LEN)
+			goto done;
+	}
+done:
+	rcu_read_unlock();
+	return pids_to_check_len;
+}
 
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
+static int reap_pids_to_check(void)
+{
+	int i;
+	int j;
+	struct pid *pid;
+	struct task_struct *g;
+	struct task_struct *p;
+	struct mm_struct *mm;
+	bool success;
+
+	for (i = 0; i < pids_to_check_len; i++) {
+		pid = pids_to_check[i];
+		rcu_read_lock();
+		p = pid_task(pid, PIDTYPE_PID);
+		mm = p ? READ_ONCE(p->mm) : NULL;
+		if (!mm) {
+			rcu_read_unlock();
+			goto done;
+		}
+		/*
+		 * Since it is possible that p voluntarily called do_exit() or
+		 * somebody other than the OOM killer sent SIGKILL on p, a mm
+		 * used by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is
+		 * reapable if p has pending SIGKILL or already reached
+		 * do_exit().
+		 *
+		 * On the other hand, it is possible that mark_oom_victim(p) is
+		 * called without sending SIGKILL to all OOM-killable tasks
+		 * using a mm used by p. In that case, the OOM reaper cannot
+		 * reap that mm unless p is the only task using that mm.
+		 *
+		 * Therefore, determine whether a mm is reapable by testing
+		 * whether all tasks using that mm are dying or already exiting
+		 * rather than depending on p->signal->oom_score_adj value
+		 * which is updated by the OOM reaper.
+		 */
+		for_each_process_thread(g, p) {
+			if (mm != READ_ONCE(p->mm) ||
+			    fatal_signal_pending(p) || (p->flags & PF_EXITING))
+				continue;
+			mm = NULL;
+			goto skip;
+		}
+		if (!atomic_inc_not_zero(&mm->mm_users))
+			mm = NULL;
+skip:
+		rcu_read_unlock();
+		if (!mm)
+			continue;
+		success = __oom_reap_vma(mm);
+		mmput(mm);
+		if (success) {
+done:
+			put_pid(pid);
+			pids_to_check_len--;
+			for (j = i; j < pids_to_check_len; j++)
+				pids_to_check[j] = pids_to_check[j + 1];
+			i--;
+		}
+	}
+	return pids_to_check_len;
+}
+
+static void release_pids_to_check(void)
+{
+	int i;
+	struct pid *pid;
+	struct task_struct *p;
+	short score;
+
+	for (i = 0; i < pids_to_check_len; i++) {
+		pid = pids_to_check[i];
+		/*
+		 * If we failed to reap a mm, mark that task using it as almost
+		 * OOM-unkillable and clear TIF_MEMDIE. This will help future
+		 * select_bad_process() try to select other OOM-killable tasks
+		 * before selecting that task again.
+		 *
+		 * But if that task got TIF_MEMDIE when that task is already
+		 * marked as almost OOM-unkillable, mark that task completely
+		 * OOM-unkillable. Otherwise, we cannot make progress when all
+		 * OOM-killable tasks became almost OOM-unkillable.
+		 */
+		rcu_read_lock();
+		p = pid_task(pid, PIDTYPE_PID);
+		if (p) {
+			score = p->signal->oom_score_adj;
+			p->signal->oom_score_adj =
+				score > OOM_SCORE_ADJ_MIN + 1 ?
+				OOM_SCORE_ADJ_MIN + 1 : OOM_SCORE_ADJ_MIN;
+			exit_oom_victim(p);
+		}
+		rcu_read_unlock();
+		put_pid(pid);
+	}
+	pids_to_check_len = 0;
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk;
+		int i;
 
-		wait_event_freezable(oom_reaper_wait,
-				     (tsk = READ_ONCE(task_to_reap)));
-		oom_reap_task(tsk);
-		WRITE_ONCE(task_to_reap, NULL);
+		wait_event_freezable(oom_reaper_wait, gather_pids_to_check());
+		for (i = 0; reap_pids_to_check() && i < 10; i++)
+			schedule_timeout_idle(HZ / 10);
+		release_pids_to_check();
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(void)
 {
-	struct task_struct *old_tsk;
-
-	if (!oom_reaper_th)
-		return;
-
-	get_task_struct(tsk);
-
-	/*
-	 * Make sure that only a single mm is ever queued for the reaper
-	 * because multiple are not necessary and the operation might be
-	 * disruptive so better reduce it to the bare minimum.
-	 */
-	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
-	if (!old_tsk)
+	if (oom_reaper_th)
 		wake_up(&oom_reaper_wait);
-	else
-		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
@@ -558,7 +651,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct task_struct *mm)
+static void wake_oom_reaper(void)
 {
 }
 #endif
@@ -584,6 +677,7 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	wake_oom_reaper();
 }
 
 /**
@@ -591,9 +685,8 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk, TIF_MEMDIE);
-
-	if (!atomic_dec_return(&oom_victims))
+	if (test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE) &&
+	    !atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
 }
 
@@ -672,7 +765,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -740,7 +832,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -766,23 +857,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		if (is_global_init(p))
 			continue;
 		if (unlikely(p->flags & PF_KTHREAD) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
-		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
+	mark_oom_victim(victim);
 	put_task_struct(victim);
 }
 #undef K
@@ -858,9 +940,14 @@ bool out_of_memory(struct oom_control *oc)
 	 *
 	 * But don't select if current has already released its mm and cleared
 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
+	 *
+	 * Also, prepare for situations where failing to OOM-kill current task
+	 * caused unable to choose next OOM victim.
+	 * In that case, do regular OOM victim selection.
 	 */
-	if (current->mm &&
+	if (current->mm && !task_oom_no_recursion(current) &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
+		task_set_oom_no_recursion(current);
 		mark_oom_victim(current);
 		return true;
 	}
@@ -876,7 +963,14 @@ bool out_of_memory(struct oom_control *oc)
 
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN &&
+	    !task_oom_no_recursion(current)) {
+		/*
+		 * But prepare for situations where failing to OOM-kill current
+		 * task caused unable to choose next OOM victim.
+		 * In that case, do regular OOM victim selection.
+		 */
+		task_set_oom_no_recursion(current);
 		get_task_struct(current);
 		oom_kill_process(oc, current, 0, totalpages, NULL,
 				 "Out of memory (oom_kill_allocating_task)");
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-18  4:35     ` Tetsuo Handa
@ 2016-01-18 10:22       ` Tetsuo Handa
  -1 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-18 10:22 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

> Date: Mon, 18 Jan 2016 13:22:51 +0900
> Subject: [PATCH 4/2] oom: change OOM reaper to walk the process list
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/sched.h |   4 +
>  mm/memcontrol.c       |   8 +-
>  mm/oom_kill.c         | 250 ++++++++++++++++++++++++++++++++++----------------
>  3 files changed, 183 insertions(+), 79 deletions(-)
> 
Oops. I meant to move mark_oom_victim() to after sending SIGKILL to other
processes sharing the same memory, but I can't move mark_oom_victim() to
after task_unlock().

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d3a7cd8..51cb936 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -832,6 +832,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -864,7 +865,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	rcu_read_unlock();
 
 	mmdrop(mm);
-	mark_oom_victim(victim);
 	put_task_struct(victim);
 }
 #undef K

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-18 10:22       ` Tetsuo Handa
  0 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-18 10:22 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

> Date: Mon, 18 Jan 2016 13:22:51 +0900
> Subject: [PATCH 4/2] oom: change OOM reaper to walk the process list
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  include/linux/sched.h |   4 +
>  mm/memcontrol.c       |   8 +-
>  mm/oom_kill.c         | 250 ++++++++++++++++++++++++++++++++++----------------
>  3 files changed, 183 insertions(+), 79 deletions(-)
> 
Oops. I meant to move mark_oom_victim() to after sending SIGKILL to other
processes sharing the same memory, but I can't move mark_oom_victim() to
after task_unlock().

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d3a7cd8..51cb936 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -832,6 +832,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	 * space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -864,7 +865,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	rcu_read_unlock();
 
 	mmdrop(mm);
-	mark_oom_victim(victim);
 	put_task_struct(victim);
 }
 #undef K

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-18  4:35     ` Tetsuo Handa
@ 2016-01-26 16:38       ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-26 16:38 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Mon 18-01-16 13:35:44, Tetsuo Handa wrote:
[...]
> (1) Make the OOM reaper available on CONFIG_MMU=n kernels.
> 
>     I don't know about MMU, but I assume we can handle these errors.

What is the usecase for this on !MMU configurations? Why does it make
sense to add more code to such a restricted environments? I haven't
heard of a single OOM report from that land.

>     slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
>     slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
>     slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
> 
> (2) Do not boot the system if failed to create the OOM reaper thread.
> 
>     We are already heavily depending on the OOM reaper.

Hohmm, does this really bother you that much? This all happens really
early during the boot. If a single kernel thread creation fails that
early then we are screwed anyway and OOM killer will not help a tiny
bit. The only place where the current benevolence matters is a test for
oom_reaper_th != NULL in wake_oom_reaper and I doubt it adds an
overhead. BUG_ON is suited for unrecoverable errors and we can clearly
live without oom_reaper.
 
>     pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
>                     PTR_ERR(oom_reaper_th));
> 
> (3) Eliminate locations that call mark_oom_victim() without
>     making the OOM victim task under monitor of the OOM reaper.
> 
>     The OOM reaper needs to take actions when the OOM victim task got stuck
>     because we (except me) do not want to use my sysctl-controlled timeout-
>     based OOM victim selection.

I do not think this is a correct way to approach the problem. I think we
should involve oom_reaper for those cases. I just want to do that in an
incremental steps. Originally I had the oom_reaper invocation in
mark_oom_victim but that didn't work out (for reasons I do not remember
right now and would have to find them in the archive).
[...]

> (4) Don't select an OOM victim until mm_to_reap (or task_to_reap) becomes NULL.

If we ever see a realistic case where the OOM killer hits in such a pace
that the oom reaper cannot cope with it then I would rather introduce a
queuing mechanism than add a complex code to synchronize the two
contexts. They are currently synchronized via TIF_MEMDIE and that should
be sufficient until the TIF_MEMDIE stops being the oom synchronization
point.

>     This is needed for making sure that any OOM victim is made under
>     monitor of the OOM reaper in order to let the OOM reaper take action
>     before leaving oom_reap_vmas() (or oom_reap_task()).
> 
>     Since the OOM reaper can do mm_to_reap (or task_to_reap) = NULL shortly
>     (e.g. within a second if it retries for 10 times with 0.1 second interval),
>     waiting should not become a problem.
> 
> (5) Decrease oom_score_adj value after the OOM reaper reclaimed memory.
> 
>     If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) succeeded, set oom_score_adj
>     value of all tasks sharing the same mm to -1000 (by walking the process list)
>     and clear TIF_MEMDIE.
> 
>     Changing only the OOM victim's oom_score_adj is not sufficient
>     when there are other thread groups sharing the OOM victim's memory
>     (i.e. clone(!CLONE_THREAD && CLONE_VM) case).
>
> (6) Decrease oom_score_adj value even if the OOM reaper failed to reclaim memory.
> 
>     If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) failed for 10 times, decrease
>     oom_score_adj value of all tasks sharing the same mm and clear TIF_MEMDIE.
>     This is needed for preventing the OOM killer from selecting the same thread
>     group forever.

I understand what you mean but I would consider this outside of the
scope of the patchset as I want to pursue it right now. I really want to
introduce a simple async OOM handling. Further steps can be built on top
but please let's not make it a huge monster right away. The same applies
to the point 5. mm shared between processes is a border line to focus on
it in the first submission.

>     An example is, set oom_score_adj to -999 if oom_score_adj is greater than
>     -999, set -1000 if oom_score_adj is already -999. This will allow the OOM
>     killer try to choose different OOM victims before retrying __oom_reap_vmas(mm)
>     (or __oom_reap_task(tsk)) of this OOM victim, then trigger kernel panic if
>     all OOM victims got -1000.
> 
>     Changing mmap_sem lock killable increases possibility of __oom_reap_vmas(mm)
>     (or __oom_reap_task(tsk)) to succeed. But due to the changes in (3) and (4),
>     there is no guarantee that TIF_MEMDIE is set to the thread which is looping at
>     __alloc_pages_slowpath() with the mmap_sem held for writing. If the OOM killer
>     were able to know which thread is looping at __alloc_pages_slowpath() with the
>     mmap_sem held for writing (via per task_struct variable), the OOM killer would
>     set TIF_MEMDIE on that thread before randomly choosing one thread using
>     find_lock_task_mm().

If mmap_sem (for write) holder is looping in the allocator and the
process gets killed it will get access to memory reserves automatically,
so I am not sure what do you mean here.

Thank you for your feedback. There are some improvements and additional
heuristics proposed and they might be really valuable in some cases but
I believe that none of the points you are rising are blockers for the
current code. My intention here is to push the initial version which
would handle the most probable cases and build more on top. I would
really prefer this doesn't grow into a hard to evaluate bloat from the
early beginning.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-26 16:38       ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-26 16:38 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Mon 18-01-16 13:35:44, Tetsuo Handa wrote:
[...]
> (1) Make the OOM reaper available on CONFIG_MMU=n kernels.
> 
>     I don't know about MMU, but I assume we can handle these errors.

What is the usecase for this on !MMU configurations? Why does it make
sense to add more code to such a restricted environments? I haven't
heard of a single OOM report from that land.

>     slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
>     slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
>     slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
> 
> (2) Do not boot the system if failed to create the OOM reaper thread.
> 
>     We are already heavily depending on the OOM reaper.

Hohmm, does this really bother you that much? This all happens really
early during the boot. If a single kernel thread creation fails that
early then we are screwed anyway and OOM killer will not help a tiny
bit. The only place where the current benevolence matters is a test for
oom_reaper_th != NULL in wake_oom_reaper and I doubt it adds an
overhead. BUG_ON is suited for unrecoverable errors and we can clearly
live without oom_reaper.
 
>     pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
>                     PTR_ERR(oom_reaper_th));
> 
> (3) Eliminate locations that call mark_oom_victim() without
>     making the OOM victim task under monitor of the OOM reaper.
> 
>     The OOM reaper needs to take actions when the OOM victim task got stuck
>     because we (except me) do not want to use my sysctl-controlled timeout-
>     based OOM victim selection.

I do not think this is a correct way to approach the problem. I think we
should involve oom_reaper for those cases. I just want to do that in an
incremental steps. Originally I had the oom_reaper invocation in
mark_oom_victim but that didn't work out (for reasons I do not remember
right now and would have to find them in the archive).
[...]

> (4) Don't select an OOM victim until mm_to_reap (or task_to_reap) becomes NULL.

If we ever see a realistic case where the OOM killer hits in such a pace
that the oom reaper cannot cope with it then I would rather introduce a
queuing mechanism than add a complex code to synchronize the two
contexts. They are currently synchronized via TIF_MEMDIE and that should
be sufficient until the TIF_MEMDIE stops being the oom synchronization
point.

>     This is needed for making sure that any OOM victim is made under
>     monitor of the OOM reaper in order to let the OOM reaper take action
>     before leaving oom_reap_vmas() (or oom_reap_task()).
> 
>     Since the OOM reaper can do mm_to_reap (or task_to_reap) = NULL shortly
>     (e.g. within a second if it retries for 10 times with 0.1 second interval),
>     waiting should not become a problem.
> 
> (5) Decrease oom_score_adj value after the OOM reaper reclaimed memory.
> 
>     If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) succeeded, set oom_score_adj
>     value of all tasks sharing the same mm to -1000 (by walking the process list)
>     and clear TIF_MEMDIE.
> 
>     Changing only the OOM victim's oom_score_adj is not sufficient
>     when there are other thread groups sharing the OOM victim's memory
>     (i.e. clone(!CLONE_THREAD && CLONE_VM) case).
>
> (6) Decrease oom_score_adj value even if the OOM reaper failed to reclaim memory.
> 
>     If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) failed for 10 times, decrease
>     oom_score_adj value of all tasks sharing the same mm and clear TIF_MEMDIE.
>     This is needed for preventing the OOM killer from selecting the same thread
>     group forever.

I understand what you mean but I would consider this outside of the
scope of the patchset as I want to pursue it right now. I really want to
introduce a simple async OOM handling. Further steps can be built on top
but please let's not make it a huge monster right away. The same applies
to the point 5. mm shared between processes is a border line to focus on
it in the first submission.

>     An example is, set oom_score_adj to -999 if oom_score_adj is greater than
>     -999, set -1000 if oom_score_adj is already -999. This will allow the OOM
>     killer try to choose different OOM victims before retrying __oom_reap_vmas(mm)
>     (or __oom_reap_task(tsk)) of this OOM victim, then trigger kernel panic if
>     all OOM victims got -1000.
> 
>     Changing mmap_sem lock killable increases possibility of __oom_reap_vmas(mm)
>     (or __oom_reap_task(tsk)) to succeed. But due to the changes in (3) and (4),
>     there is no guarantee that TIF_MEMDIE is set to the thread which is looping at
>     __alloc_pages_slowpath() with the mmap_sem held for writing. If the OOM killer
>     were able to know which thread is looping at __alloc_pages_slowpath() with the
>     mmap_sem held for writing (via per task_struct variable), the OOM killer would
>     set TIF_MEMDIE on that thread before randomly choosing one thread using
>     find_lock_task_mm().

If mmap_sem (for write) holder is looping in the allocator and the
process gets killed it will get access to memory reserves automatically,
so I am not sure what do you mean here.

Thank you for your feedback. There are some improvements and additional
heuristics proposed and they might be really valuable in some cases but
I believe that none of the points you are rising are blockers for the
current code. My intention here is to push the initial version which
would handle the most probable cases and build more on top. I would
really prefer this doesn't grow into a hard to evaluate bloat from the
early beginning.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06 15:42   ` Michal Hocko
@ 2016-01-28  1:28     ` David Rientjes
  -1 siblings, 0 replies; 56+ messages in thread
From: David Rientjes @ 2016-01-28  1:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 6 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 

Suggested-bys?

> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
> 
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
> 

s/for on/for/

I think it would be good to note in either of the two paragraphs above 
that each victim is per-memcg hierarchy or system-wide and the oom reaper 
is used for memcg oom conditions as well.  Otherwise, there's no mention 
of the memcg usecase.

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
> 
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
> 

Very good points.  And I think this makes the case clear that oom_reaper 
is really a best-effort solution.

> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
> 
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
> 
> Changes since v3
> - many style/compile fixups by Andrew
> - unmap_mapping_range_tree needs full initialization of zap_details
>   to prevent from missing unmaps and follow up BUG_ON during truncate
>   resp. misaccounting - Kirill/Andrew
> - exclude mlocked pages because they need an explicit munlock by Kirill
> - use subsys_initcall instead of module_init - Paul Gortmaker
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mm.h |   2 +
>  mm/internal.h      |   5 ++
>  mm/memory.c        |  17 +++---
>  mm/oom_kill.c      | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 170 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25cdec395f2c..d1ce03569942 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1061,6 +1061,8 @@ struct zap_details {
>  	struct address_space *check_mapping;	/* Check page->mapping if set */
>  	pgoff_t	first_index;			/* Lowest page->index to unmap */
>  	pgoff_t last_index;			/* Highest page->index to unmap */
> +	bool ignore_dirty;			/* Ignore dirty pages */
> +	bool check_swap_entries;		/* Check also swap entries */
>  };
>  
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/internal.h b/mm/internal.h
> index 4ae7b7c7462b..9006ce1960ff 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
>  
> +void unmap_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end,
> +			     struct zap_details *details);
> +
>  static inline void set_page_count(struct page *page, int v)
>  {
>  	atomic_set(&page->_count, v);
> diff --git a/mm/memory.c b/mm/memory.c
> index f5b8e8c9f4c3..f60c6d6aa633 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  
>  			if (!PageAnon(page)) {
>  				if (pte_dirty(ptent)) {
> +					/*
> +					 * oom_reaper cannot tear down dirty
> +					 * pages
> +					 */
> +					if (unlikely(details && details->ignore_dirty))
> +						continue;
>  					force_flush = 1;
>  					set_page_dirty(page);
>  				}
> @@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			}
>  			continue;
>  		}
> -		/* If details->check_mapping, we leave swap entries. */
> -		if (unlikely(details))
> +		/* only check swap_entries if explicitly asked for in details */
> +		if (unlikely(details && !details->check_swap_entries))
>  			continue;
>  
>  		entry = pte_to_swp_entry(ptent);
> @@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>  
> -static void unmap_page_range(struct mmu_gather *tlb,
> +void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
>  			     struct zap_details *details)
> @@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
>  	pgd_t *pgd;
>  	unsigned long next;
>  
> -	if (details && !details->check_mapping)
> -		details = NULL;
> -
>  	BUG_ON(addr >= end);
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
> @@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
>  void unmap_mapping_range(struct address_space *mapping,
>  		loff_t const holebegin, loff_t const holelen, int even_cows)
>  {
> -	struct zap_details details;
> +	struct zap_details details = { };
>  	pgoff_t hba = holebegin >> PAGE_SHIFT;
>  	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dc490c06941b..1ece40b94725 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,6 +35,11 @@
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
> +#include <linux/kthread.h>
> +#include <linux/init.h>
> +
> +#include <asm/tlb.h>
> +#include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/oom.h>
> @@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
>  
>  bool oom_killer_disabled __read_mostly;
>  
> +#ifdef CONFIG_MMU
> +/*
> + * OOM Reaper kernel thread which tries to reap the memory used by the OOM
> + * victim (if that is possible) to help the OOM killer to move on.
> + */
> +static struct task_struct *oom_reaper_th;
> +static struct mm_struct *mm_to_reap;
> +static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> +
> +static bool __oom_reap_vmas(struct mm_struct *mm)
> +{
> +	struct mmu_gather tlb;
> +	struct vm_area_struct *vma;
> +	struct zap_details details = {.check_swap_entries = true,
> +				      .ignore_dirty = true};
> +	bool ret = true;
> +
> +	/* We might have raced with exit path */
> +	if (!atomic_inc_not_zero(&mm->mm_users))
> +		return true;
> +
> +	if (!down_read_trylock(&mm->mmap_sem)) {
> +		ret = false;
> +		goto out;
> +	}
> +
> +	tlb_gather_mmu(&tlb, mm, 0, -1);
> +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +
> +		/*
> +		 * mlocked VMAs require explicit munlocking before unmap.
> +		 * Let's keep it simple here and skip such VMAs.
> +		 */
> +		if (vma->vm_flags & VM_LOCKED)
> +			continue;

Shouldn't there be VM_PFNMAP handling here?

I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
used here for simplicity?  It appears as though what you're doing is an 
MADV_DONTNEED over the length of all anonymous vmas that aren't shared, so 
why not have such an implementation in a single place so any changes don't 
have to be made in two different spots for things such as VM_PFNMAP?

> +
> +		/*
> +		 * Only anonymous pages have a good chance to be dropped
> +		 * without additional steps which we cannot afford as we
> +		 * are OOM already.
> +		 *
> +		 * We do not even care about fs backed pages because all
> +		 * which are reclaimable have already been reclaimed and
> +		 * we do not want to block exit_mmap by keeping mm ref
> +		 * count elevated without a good reason.
> +		 */
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> +					 &details);
> +	}
> +	tlb_finish_mmu(&tlb, 0, -1);
> +	up_read(&mm->mmap_sem);
> +out:
> +	mmput(mm);
> +	return ret;
> +}
> +
> +static void oom_reap_vmas(struct mm_struct *mm)
> +{
> +	int attempts = 0;
> +
> +	/* Retry the down_read_trylock(mmap_sem) a few times */
> +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> +		schedule_timeout_idle(HZ/10);
> +
> +	/* Drop a reference taken by wake_oom_reaper */
> +	mmdrop(mm);
> +}
> +
> +static int oom_reaper(void *unused)
> +{
> +	while (true) {
> +		struct mm_struct *mm;
> +
> +		wait_event_freezable(oom_reaper_wait,
> +				     (mm = READ_ONCE(mm_to_reap)));
> +		oom_reap_vmas(mm);
> +		WRITE_ONCE(mm_to_reap, NULL);
> +	}
> +
> +	return 0;
> +}
> +
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +	struct mm_struct *old_mm;
> +
> +	if (!oom_reaper_th)
> +		return;
> +
> +	/*
> +	 * Pin the given mm. Use mm_count instead of mm_users because
> +	 * we do not want to delay the address space tear down.
> +	 */
> +	atomic_inc(&mm->mm_count);
> +
> +	/*
> +	 * Make sure that only a single mm is ever queued for the reaper
> +	 * because multiple are not necessary and the operation might be
> +	 * disruptive so better reduce it to the bare minimum.
> +	 */
> +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> +	if (!old_mm)
> +		wake_up(&oom_reaper_wait);
> +	else
> +		mmdrop(mm);

This behavior is probably the only really significant concern I have about 
the patch: we just drop the mm and don't try any reaping if there is 
already reaping in progress.

We don't always have control over the amount of memory that can be reaped 
from the victim, either because of oom kill prioritization through 
/proc/pid/oom_score_adj or because the memory of the victim is not 
eligible.

I'm imagining a scenario where the oom reaper has raced with a follow-up 
oom kill before mm_to_reap has been set to NULL so there's no subsequent 
reaping.  It's also possible that oom reaping of the first victim actually 
freed little memory.

Would it really be difficult to queue mm's to reap from?  If memory has 
already been freed before the reaper can get to it, the 
find_lock_task_mm() should just fail and we're done.  I'm not sure why 
this is being limited to a single mm system-wide.

> +}
> +
> +static int __init oom_init(void)
> +{
> +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> +	if (IS_ERR(oom_reaper_th)) {
> +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> +				PTR_ERR(oom_reaper_th));
> +		oom_reaper_th = NULL;
> +	} else {
> +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> +
> +		/*
> +		 * Make sure our oom reaper thread will get scheduled when
> +		 * ASAP and that it won't get preempted by malicious userspace.
> +		 */
> +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);

Eeek, do you really show this is necessary?  I would imagine that we would 
want to limit high priority processes system-wide and that we wouldn't 
want to be interferred with by memcg oom conditions that trigger the oom 
reaper, for example.

> +	}
> +	return 0;
> +}
> +subsys_initcall(oom_init)
> +#else
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +}
> +#endif
> +
>  /**
>   * mark_oom_victim - mark the given task as OOM victim
>   * @tsk: task to mark
> @@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> +	bool can_oom_reap = true;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
>  		if (is_global_init(p))
>  			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot use oom_reaper for the mm shared by this
> +			 * process because it wouldn't get killed and so the
> +			 * memory might be still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;
> -
> +		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);

Is it possible to just do wake_oom_reaper(mm) here and eliminate 
can_oom_reap with a little bit of moving around?

>  	}
>  	rcu_read_unlock();
>  
> +	if (can_oom_reap)
> +		wake_oom_reaper(mm);
> +
>  	mmdrop(mm);
>  	put_task_struct(victim);
>  }

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-28  1:28     ` David Rientjes
  0 siblings, 0 replies; 56+ messages in thread
From: David Rientjes @ 2016-01-28  1:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 6 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 

Suggested-bys?

> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
> 
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
> 

s/for on/for/

I think it would be good to note in either of the two paragraphs above 
that each victim is per-memcg hierarchy or system-wide and the oom reaper 
is used for memcg oom conditions as well.  Otherwise, there's no mention 
of the memcg usecase.

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
> 
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
> 

Very good points.  And I think this makes the case clear that oom_reaper 
is really a best-effort solution.

> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
> 
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
> 
> Changes since v3
> - many style/compile fixups by Andrew
> - unmap_mapping_range_tree needs full initialization of zap_details
>   to prevent from missing unmaps and follow up BUG_ON during truncate
>   resp. misaccounting - Kirill/Andrew
> - exclude mlocked pages because they need an explicit munlock by Kirill
> - use subsys_initcall instead of module_init - Paul Gortmaker
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mm.h |   2 +
>  mm/internal.h      |   5 ++
>  mm/memory.c        |  17 +++---
>  mm/oom_kill.c      | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 170 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25cdec395f2c..d1ce03569942 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1061,6 +1061,8 @@ struct zap_details {
>  	struct address_space *check_mapping;	/* Check page->mapping if set */
>  	pgoff_t	first_index;			/* Lowest page->index to unmap */
>  	pgoff_t last_index;			/* Highest page->index to unmap */
> +	bool ignore_dirty;			/* Ignore dirty pages */
> +	bool check_swap_entries;		/* Check also swap entries */
>  };
>  
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/internal.h b/mm/internal.h
> index 4ae7b7c7462b..9006ce1960ff 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
>  
> +void unmap_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end,
> +			     struct zap_details *details);
> +
>  static inline void set_page_count(struct page *page, int v)
>  {
>  	atomic_set(&page->_count, v);
> diff --git a/mm/memory.c b/mm/memory.c
> index f5b8e8c9f4c3..f60c6d6aa633 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  
>  			if (!PageAnon(page)) {
>  				if (pte_dirty(ptent)) {
> +					/*
> +					 * oom_reaper cannot tear down dirty
> +					 * pages
> +					 */
> +					if (unlikely(details && details->ignore_dirty))
> +						continue;
>  					force_flush = 1;
>  					set_page_dirty(page);
>  				}
> @@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			}
>  			continue;
>  		}
> -		/* If details->check_mapping, we leave swap entries. */
> -		if (unlikely(details))
> +		/* only check swap_entries if explicitly asked for in details */
> +		if (unlikely(details && !details->check_swap_entries))
>  			continue;
>  
>  		entry = pte_to_swp_entry(ptent);
> @@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>  
> -static void unmap_page_range(struct mmu_gather *tlb,
> +void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
>  			     struct zap_details *details)
> @@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
>  	pgd_t *pgd;
>  	unsigned long next;
>  
> -	if (details && !details->check_mapping)
> -		details = NULL;
> -
>  	BUG_ON(addr >= end);
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
> @@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
>  void unmap_mapping_range(struct address_space *mapping,
>  		loff_t const holebegin, loff_t const holelen, int even_cows)
>  {
> -	struct zap_details details;
> +	struct zap_details details = { };
>  	pgoff_t hba = holebegin >> PAGE_SHIFT;
>  	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dc490c06941b..1ece40b94725 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,6 +35,11 @@
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
> +#include <linux/kthread.h>
> +#include <linux/init.h>
> +
> +#include <asm/tlb.h>
> +#include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/oom.h>
> @@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
>  
>  bool oom_killer_disabled __read_mostly;
>  
> +#ifdef CONFIG_MMU
> +/*
> + * OOM Reaper kernel thread which tries to reap the memory used by the OOM
> + * victim (if that is possible) to help the OOM killer to move on.
> + */
> +static struct task_struct *oom_reaper_th;
> +static struct mm_struct *mm_to_reap;
> +static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> +
> +static bool __oom_reap_vmas(struct mm_struct *mm)
> +{
> +	struct mmu_gather tlb;
> +	struct vm_area_struct *vma;
> +	struct zap_details details = {.check_swap_entries = true,
> +				      .ignore_dirty = true};
> +	bool ret = true;
> +
> +	/* We might have raced with exit path */
> +	if (!atomic_inc_not_zero(&mm->mm_users))
> +		return true;
> +
> +	if (!down_read_trylock(&mm->mmap_sem)) {
> +		ret = false;
> +		goto out;
> +	}
> +
> +	tlb_gather_mmu(&tlb, mm, 0, -1);
> +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +
> +		/*
> +		 * mlocked VMAs require explicit munlocking before unmap.
> +		 * Let's keep it simple here and skip such VMAs.
> +		 */
> +		if (vma->vm_flags & VM_LOCKED)
> +			continue;

Shouldn't there be VM_PFNMAP handling here?

I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
used here for simplicity?  It appears as though what you're doing is an 
MADV_DONTNEED over the length of all anonymous vmas that aren't shared, so 
why not have such an implementation in a single place so any changes don't 
have to be made in two different spots for things such as VM_PFNMAP?

> +
> +		/*
> +		 * Only anonymous pages have a good chance to be dropped
> +		 * without additional steps which we cannot afford as we
> +		 * are OOM already.
> +		 *
> +		 * We do not even care about fs backed pages because all
> +		 * which are reclaimable have already been reclaimed and
> +		 * we do not want to block exit_mmap by keeping mm ref
> +		 * count elevated without a good reason.
> +		 */
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> +					 &details);
> +	}
> +	tlb_finish_mmu(&tlb, 0, -1);
> +	up_read(&mm->mmap_sem);
> +out:
> +	mmput(mm);
> +	return ret;
> +}
> +
> +static void oom_reap_vmas(struct mm_struct *mm)
> +{
> +	int attempts = 0;
> +
> +	/* Retry the down_read_trylock(mmap_sem) a few times */
> +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> +		schedule_timeout_idle(HZ/10);
> +
> +	/* Drop a reference taken by wake_oom_reaper */
> +	mmdrop(mm);
> +}
> +
> +static int oom_reaper(void *unused)
> +{
> +	while (true) {
> +		struct mm_struct *mm;
> +
> +		wait_event_freezable(oom_reaper_wait,
> +				     (mm = READ_ONCE(mm_to_reap)));
> +		oom_reap_vmas(mm);
> +		WRITE_ONCE(mm_to_reap, NULL);
> +	}
> +
> +	return 0;
> +}
> +
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +	struct mm_struct *old_mm;
> +
> +	if (!oom_reaper_th)
> +		return;
> +
> +	/*
> +	 * Pin the given mm. Use mm_count instead of mm_users because
> +	 * we do not want to delay the address space tear down.
> +	 */
> +	atomic_inc(&mm->mm_count);
> +
> +	/*
> +	 * Make sure that only a single mm is ever queued for the reaper
> +	 * because multiple are not necessary and the operation might be
> +	 * disruptive so better reduce it to the bare minimum.
> +	 */
> +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> +	if (!old_mm)
> +		wake_up(&oom_reaper_wait);
> +	else
> +		mmdrop(mm);

This behavior is probably the only really significant concern I have about 
the patch: we just drop the mm and don't try any reaping if there is 
already reaping in progress.

We don't always have control over the amount of memory that can be reaped 
from the victim, either because of oom kill prioritization through 
/proc/pid/oom_score_adj or because the memory of the victim is not 
eligible.

I'm imagining a scenario where the oom reaper has raced with a follow-up 
oom kill before mm_to_reap has been set to NULL so there's no subsequent 
reaping.  It's also possible that oom reaping of the first victim actually 
freed little memory.

Would it really be difficult to queue mm's to reap from?  If memory has 
already been freed before the reaper can get to it, the 
find_lock_task_mm() should just fail and we're done.  I'm not sure why 
this is being limited to a single mm system-wide.

> +}
> +
> +static int __init oom_init(void)
> +{
> +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> +	if (IS_ERR(oom_reaper_th)) {
> +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> +				PTR_ERR(oom_reaper_th));
> +		oom_reaper_th = NULL;
> +	} else {
> +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> +
> +		/*
> +		 * Make sure our oom reaper thread will get scheduled when
> +		 * ASAP and that it won't get preempted by malicious userspace.
> +		 */
> +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);

Eeek, do you really show this is necessary?  I would imagine that we would 
want to limit high priority processes system-wide and that we wouldn't 
want to be interferred with by memcg oom conditions that trigger the oom 
reaper, for example.

> +	}
> +	return 0;
> +}
> +subsys_initcall(oom_init)
> +#else
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +}
> +#endif
> +
>  /**
>   * mark_oom_victim - mark the given task as OOM victim
>   * @tsk: task to mark
> @@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> +	bool can_oom_reap = true;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
>  		if (is_global_init(p))
>  			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot use oom_reaper for the mm shared by this
> +			 * process because it wouldn't get killed and so the
> +			 * memory might be still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;
> -
> +		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);

Is it possible to just do wake_oom_reaper(mm) here and eliminate 
can_oom_reap with a little bit of moving around?

>  	}
>  	rcu_read_unlock();
>  
> +	if (can_oom_reap)
> +		wake_oom_reaper(mm);
> +
>  	mmdrop(mm);
>  	put_task_struct(victim);
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-26 16:38       ` Michal Hocko
@ 2016-01-28 11:24         ` Tetsuo Handa
  -1 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-28 11:24 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 18-01-16 13:35:44, Tetsuo Handa wrote:
> [...]
> > (1) Make the OOM reaper available on CONFIG_MMU=n kernels.
> > 
> >     I don't know about MMU, but I assume we can handle these errors.
> 
> What is the usecase for this on !MMU configurations? Why does it make
> sense to add more code to such a restricted environments? I haven't
> heard of a single OOM report from that land.
> 

For making a guarantee that the OOM reaper always takes care of OOM
victims. What I'm asking for is a guarantee, not a best-effort. I'm
fed up with responding to unexpected corner cases.

If you agree to delegate duty for handling such corner cases to a
guaranteed last resort
( http://lkml.kernel.org/r/201601222259.GJB90663.MLOJtFFOQFVHSO@I-love.SAKURA.ne.jp ),
I will stop pointing out such corner cases in the OOM reaper.

> >     slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
> >     slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
> >     slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
> > 
> > (2) Do not boot the system if failed to create the OOM reaper thread.
> > 
> >     We are already heavily depending on the OOM reaper.
> 
> Hohmm, does this really bother you that much? This all happens really
> early during the boot. If a single kernel thread creation fails that
> early then we are screwed anyway and OOM killer will not help a tiny
> bit. The only place where the current benevolence matters is a test for
> oom_reaper_th != NULL in wake_oom_reaper and I doubt it adds an
> overhead. BUG_ON is suited for unrecoverable errors and we can clearly
> live without oom_reaper.
>  

The OOM reaper is the only chain you are trying to add, but we can't make a
guarantee without reliable chain for reclaiming memory (unless we delegate
duty for handling such corner cases to a guaranteed last resort).

By the way, it saves size of mm/oom_kill.o a bit. ;-)

  pr_err(): 18624 bytes
  BUG_ON(): 18432 bytes

> >     pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> >                     PTR_ERR(oom_reaper_th));
> > 
> > (3) Eliminate locations that call mark_oom_victim() without
> >     making the OOM victim task under monitor of the OOM reaper.
> > 
> >     The OOM reaper needs to take actions when the OOM victim task got stuck
> >     because we (except me) do not want to use my sysctl-controlled timeout-
> >     based OOM victim selection.
> 
> I do not think this is a correct way to approach the problem. I think we
> should involve oom_reaper for those cases. I just want to do that in an
> incremental steps. Originally I had the oom_reaper invocation in
> mark_oom_victim but that didn't work out (for reasons I do not remember
> right now and would have to find them in the archive).
> [...]

The archive is http://lkml.kernel.org/r/201511270024.DFJ57385.OFtJQSMOFFLOHV@I-love.SAKURA.ne.jp .
These threads might be sharing memory with OOM-unkillable threads or not-yet-SIGKILL-pending
threads. In that case, the OOM reaper must not reap memory (which means we disconnect
the chain for reclaiming memory). I showed you a polling version ("[PATCH 4/2] oom: change
OOM reaper to walk the process list") for handling those cases.

> 
> > (4) Don't select an OOM victim until mm_to_reap (or task_to_reap) becomes NULL.
> 
> If we ever see a realistic case where the OOM killer hits in such a pace
> that the oom reaper cannot cope with it then I would rather introduce a
> queuing mechanism than add a complex code to synchronize the two
> contexts. They are currently synchronized via TIF_MEMDIE and that should
> be sufficient until the TIF_MEMDIE stops being the oom synchronization
> point.
> 

I think that the TIF_MEMDIE is not the OOM synchronization point for the OOM
reaper because the OOM killer sets TIF_MEMDIE on only one thread.

Say, there is a process with three threads (T1, T2, T3). We assume that, when
the OOM killer sends SIGKILL on this process and sets TIF_MEMDIE on T1, T1 is
about to call do_exit() and T2 is waiting at mutex_lock() and T3 is looping
inside the allocator with mmap_sem held for write.

When T1 got TIF_MEMDIE, wake_oom_reaper() is called and task_to_reap becomes T1.
But it is possible that T3 is sleeping at schedule_timeout_uninterruptible(1) in
__alloc_pages_may_oom().

The OOM reaper tries to reap T1's memory, but first down_read_trylock(&mm->mmap_sem)
attempt fails due to T3 sleeping for a jiffie. Then, the OOM reaper sleeps for 0.1
second at schedule_timeout_idle(HZ/10) in oom_reap_task().

While the OOM reaper is sleeping, T1 exits and clears TIF_MEMDIE, which can result
in choosing T2 as next OOM victim and calling wake_oom_reaper() before task_to_reap
becomes NULL. Now, the chain for reclaiming memory got disconnected.

T3 will get TIF_MEMDIE because T3 will eventually succeed mutex_trylock(&oom_lock),
and get TIF_MEMDIE by calling out_of_memory() and complete current memory allocation
request and call up_write(&mm->mmap_sem). But where is the guarantee that T3 can do it
before the OOM reaper gives up waiting for T3? If the OOM reaper failed to reap that
process's memory, task_to_reap becomes NULL when T2 already got TIF_MEMDIE which is not
under control of the OOM reaper, and the OOM killer will not choose next OOM victim.

Where is the guarantee that T2 can make forward progress after T3 successfully exited
and cleared TIF_MEMDIE? Since the OOM reaper is not monitoring T2 which has TIF_MEMDIE,
we can hit OOM livelock.

Polling or queuing is needed for handling such corner cases.

> >     This is needed for making sure that any OOM victim is made under
> >     monitor of the OOM reaper in order to let the OOM reaper take action
> >     before leaving oom_reap_vmas() (or oom_reap_task()).
> > 
> >     Since the OOM reaper can do mm_to_reap (or task_to_reap) = NULL shortly
> >     (e.g. within a second if it retries for 10 times with 0.1 second interval),
> >     waiting should not become a problem.
> > 
> > (5) Decrease oom_score_adj value after the OOM reaper reclaimed memory.
> > 
> >     If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) succeeded, set oom_score_adj
> >     value of all tasks sharing the same mm to -1000 (by walking the process list)
> >     and clear TIF_MEMDIE.
> > 

I guess I need to scratch "and clear TIF_MEMDIE" part, for clearing TIF_MEMDIE needs
a switch for avoid re-setting TIF_MEMDIE forever and such switch is complicated.

> >     Changing only the OOM victim's oom_score_adj is not sufficient
> >     when there are other thread groups sharing the OOM victim's memory
> >     (i.e. clone(!CLONE_THREAD && CLONE_VM) case).
> >
> > (6) Decrease oom_score_adj value even if the OOM reaper failed to reclaim memory.
> > 
> >     If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) failed for 10 times, decrease
> >     oom_score_adj value of all tasks sharing the same mm and clear TIF_MEMDIE.
> >     This is needed for preventing the OOM killer from selecting the same thread
> >     group forever.
> 
> I understand what you mean but I would consider this outside of the
> scope of the patchset as I want to pursue it right now. I really want to
> introduce a simple async OOM handling. Further steps can be built on top
> but please let's not make it a huge monster right away. The same applies
> to the point 5. mm shared between processes is a border line to focus on
> it in the first submission.
> 

I really do not want you to pursue the OOM reaper without providing a guarantee
that the chain for reclaiming memory never gets disconnected.

> >     An example is, set oom_score_adj to -999 if oom_score_adj is greater than
> >     -999, set -1000 if oom_score_adj is already -999. This will allow the OOM
> >     killer try to choose different OOM victims before retrying __oom_reap_vmas(mm)
> >     (or __oom_reap_task(tsk)) of this OOM victim, then trigger kernel panic if
> >     all OOM victims got -1000.
> > 
> >     Changing mmap_sem lock killable increases possibility of __oom_reap_vmas(mm)
> >     (or __oom_reap_task(tsk)) to succeed. But due to the changes in (3) and (4),
> >     there is no guarantee that TIF_MEMDIE is set to the thread which is looping at
> >     __alloc_pages_slowpath() with the mmap_sem held for writing. If the OOM killer
> >     were able to know which thread is looping at __alloc_pages_slowpath() with the
> >     mmap_sem held for writing (via per task_struct variable), the OOM killer would
> >     set TIF_MEMDIE on that thread before randomly choosing one thread using
> >     find_lock_task_mm().
> 
> If mmap_sem (for write) holder is looping in the allocator and the
> process gets killed it will get access to memory reserves automatically,
> so I am not sure what do you mean here.
> 

Please stop assuming that "dying tasks get access to memory reserves
automatically (by getting TIF_MEMDIE)". It is broken.

Dying tasks might be doing !__GFP_FS allocation requests. Even if we set
TIF_MEMDIE to all threads which should terminate (in case they are doing
!__GFP_FS allocation requests), there is no guarantee that they will not
wait for locks in unkillable state after their current memory allocation
request completes (e.g. getname() followed by mutex_lock() shown at
http://lkml.kernel.org/r/201509290118.BCJ43256.tSFFFMOLHVOJOQ@I-love.SAKURA.ne.jp ).

Dying tasks cannot access memory reserves unless they are doing memory
allocation. Therefore, we cannot wait for dying tasks forever, even if
the OOM reaper can reclaim some memory.

Please do not disconnect the chain for reclaiming memory.

> Thank you for your feedback. There are some improvements and additional
> heuristics proposed and they might be really valuable in some cases but
> I believe that none of the points you are rising are blockers for the
> current code. My intention here is to push the initial version which
> would handle the most probable cases and build more on top. I would
> really prefer this doesn't grow into a hard to evaluate bloat from the
> early beginning.

I like the OOM reaper approach but I can't agree on merging the OOM reaper
without providing a guaranteed last resort at the same time. If you do want
to start the OOM reaper as simple as possible (without being bothered by
a lot of possible corner cases), please pursue a guaranteed last resort
at the same time.

We can remove the guaranteed last resort after we made the OOM reaper (and
other preferable approaches) enough reliable by doing incremental development.

Even if all OOM-killable tasks are OOM-killed, it is better than unable to
login from ssh due to OOM livelock as well as torturing with unhandled
corner cases.

If you don't like the guaranteed last resort, I can tolerate with kmallocwd.
Without a mean to understand what is happening, we will simply pretend as if
there is no bug because we are not receiving reports from users. Most users
are not as skillful as you about reporting problems related to memory allocation.
Would you please right now stop torturing administrators and technical staffs
at support center with unexplained hangups?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-28 11:24         ` Tetsuo Handa
  0 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-28 11:24 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 18-01-16 13:35:44, Tetsuo Handa wrote:
> [...]
> > (1) Make the OOM reaper available on CONFIG_MMU=n kernels.
> > 
> >     I don't know about MMU, but I assume we can handle these errors.
> 
> What is the usecase for this on !MMU configurations? Why does it make
> sense to add more code to such a restricted environments? I haven't
> heard of a single OOM report from that land.
> 

For making a guarantee that the OOM reaper always takes care of OOM
victims. What I'm asking for is a guarantee, not a best-effort. I'm
fed up with responding to unexpected corner cases.

If you agree to delegate duty for handling such corner cases to a
guaranteed last resort
( http://lkml.kernel.org/r/201601222259.GJB90663.MLOJtFFOQFVHSO@I-love.SAKURA.ne.jp ),
I will stop pointing out such corner cases in the OOM reaper.

> >     slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
> >     slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
> >     slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
> > 
> > (2) Do not boot the system if failed to create the OOM reaper thread.
> > 
> >     We are already heavily depending on the OOM reaper.
> 
> Hohmm, does this really bother you that much? This all happens really
> early during the boot. If a single kernel thread creation fails that
> early then we are screwed anyway and OOM killer will not help a tiny
> bit. The only place where the current benevolence matters is a test for
> oom_reaper_th != NULL in wake_oom_reaper and I doubt it adds an
> overhead. BUG_ON is suited for unrecoverable errors and we can clearly
> live without oom_reaper.
>  

The OOM reaper is the only chain you are trying to add, but we can't make a
guarantee without reliable chain for reclaiming memory (unless we delegate
duty for handling such corner cases to a guaranteed last resort).

By the way, it saves size of mm/oom_kill.o a bit. ;-)

  pr_err(): 18624 bytes
  BUG_ON(): 18432 bytes

> >     pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> >                     PTR_ERR(oom_reaper_th));
> > 
> > (3) Eliminate locations that call mark_oom_victim() without
> >     making the OOM victim task under monitor of the OOM reaper.
> > 
> >     The OOM reaper needs to take actions when the OOM victim task got stuck
> >     because we (except me) do not want to use my sysctl-controlled timeout-
> >     based OOM victim selection.
> 
> I do not think this is a correct way to approach the problem. I think we
> should involve oom_reaper for those cases. I just want to do that in an
> incremental steps. Originally I had the oom_reaper invocation in
> mark_oom_victim but that didn't work out (for reasons I do not remember
> right now and would have to find them in the archive).
> [...]

The archive is http://lkml.kernel.org/r/201511270024.DFJ57385.OFtJQSMOFFLOHV@I-love.SAKURA.ne.jp .
These threads might be sharing memory with OOM-unkillable threads or not-yet-SIGKILL-pending
threads. In that case, the OOM reaper must not reap memory (which means we disconnect
the chain for reclaiming memory). I showed you a polling version ("[PATCH 4/2] oom: change
OOM reaper to walk the process list") for handling those cases.

> 
> > (4) Don't select an OOM victim until mm_to_reap (or task_to_reap) becomes NULL.
> 
> If we ever see a realistic case where the OOM killer hits in such a pace
> that the oom reaper cannot cope with it then I would rather introduce a
> queuing mechanism than add a complex code to synchronize the two
> contexts. They are currently synchronized via TIF_MEMDIE and that should
> be sufficient until the TIF_MEMDIE stops being the oom synchronization
> point.
> 

I think that the TIF_MEMDIE is not the OOM synchronization point for the OOM
reaper because the OOM killer sets TIF_MEMDIE on only one thread.

Say, there is a process with three threads (T1, T2, T3). We assume that, when
the OOM killer sends SIGKILL on this process and sets TIF_MEMDIE on T1, T1 is
about to call do_exit() and T2 is waiting at mutex_lock() and T3 is looping
inside the allocator with mmap_sem held for write.

When T1 got TIF_MEMDIE, wake_oom_reaper() is called and task_to_reap becomes T1.
But it is possible that T3 is sleeping at schedule_timeout_uninterruptible(1) in
__alloc_pages_may_oom().

The OOM reaper tries to reap T1's memory, but first down_read_trylock(&mm->mmap_sem)
attempt fails due to T3 sleeping for a jiffie. Then, the OOM reaper sleeps for 0.1
second at schedule_timeout_idle(HZ/10) in oom_reap_task().

While the OOM reaper is sleeping, T1 exits and clears TIF_MEMDIE, which can result
in choosing T2 as next OOM victim and calling wake_oom_reaper() before task_to_reap
becomes NULL. Now, the chain for reclaiming memory got disconnected.

T3 will get TIF_MEMDIE because T3 will eventually succeed mutex_trylock(&oom_lock),
and get TIF_MEMDIE by calling out_of_memory() and complete current memory allocation
request and call up_write(&mm->mmap_sem). But where is the guarantee that T3 can do it
before the OOM reaper gives up waiting for T3? If the OOM reaper failed to reap that
process's memory, task_to_reap becomes NULL when T2 already got TIF_MEMDIE which is not
under control of the OOM reaper, and the OOM killer will not choose next OOM victim.

Where is the guarantee that T2 can make forward progress after T3 successfully exited
and cleared TIF_MEMDIE? Since the OOM reaper is not monitoring T2 which has TIF_MEMDIE,
we can hit OOM livelock.

Polling or queuing is needed for handling such corner cases.

> >     This is needed for making sure that any OOM victim is made under
> >     monitor of the OOM reaper in order to let the OOM reaper take action
> >     before leaving oom_reap_vmas() (or oom_reap_task()).
> > 
> >     Since the OOM reaper can do mm_to_reap (or task_to_reap) = NULL shortly
> >     (e.g. within a second if it retries for 10 times with 0.1 second interval),
> >     waiting should not become a problem.
> > 
> > (5) Decrease oom_score_adj value after the OOM reaper reclaimed memory.
> > 
> >     If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) succeeded, set oom_score_adj
> >     value of all tasks sharing the same mm to -1000 (by walking the process list)
> >     and clear TIF_MEMDIE.
> > 

I guess I need to scratch "and clear TIF_MEMDIE" part, for clearing TIF_MEMDIE needs
a switch for avoid re-setting TIF_MEMDIE forever and such switch is complicated.

> >     Changing only the OOM victim's oom_score_adj is not sufficient
> >     when there are other thread groups sharing the OOM victim's memory
> >     (i.e. clone(!CLONE_THREAD && CLONE_VM) case).
> >
> > (6) Decrease oom_score_adj value even if the OOM reaper failed to reclaim memory.
> > 
> >     If __oom_reap_vmas(mm) (or __oom_reap_task(tsk)) failed for 10 times, decrease
> >     oom_score_adj value of all tasks sharing the same mm and clear TIF_MEMDIE.
> >     This is needed for preventing the OOM killer from selecting the same thread
> >     group forever.
> 
> I understand what you mean but I would consider this outside of the
> scope of the patchset as I want to pursue it right now. I really want to
> introduce a simple async OOM handling. Further steps can be built on top
> but please let's not make it a huge monster right away. The same applies
> to the point 5. mm shared between processes is a border line to focus on
> it in the first submission.
> 

I really do not want you to pursue the OOM reaper without providing a guarantee
that the chain for reclaiming memory never gets disconnected.

> >     An example is, set oom_score_adj to -999 if oom_score_adj is greater than
> >     -999, set -1000 if oom_score_adj is already -999. This will allow the OOM
> >     killer try to choose different OOM victims before retrying __oom_reap_vmas(mm)
> >     (or __oom_reap_task(tsk)) of this OOM victim, then trigger kernel panic if
> >     all OOM victims got -1000.
> > 
> >     Changing mmap_sem lock killable increases possibility of __oom_reap_vmas(mm)
> >     (or __oom_reap_task(tsk)) to succeed. But due to the changes in (3) and (4),
> >     there is no guarantee that TIF_MEMDIE is set to the thread which is looping at
> >     __alloc_pages_slowpath() with the mmap_sem held for writing. If the OOM killer
> >     were able to know which thread is looping at __alloc_pages_slowpath() with the
> >     mmap_sem held for writing (via per task_struct variable), the OOM killer would
> >     set TIF_MEMDIE on that thread before randomly choosing one thread using
> >     find_lock_task_mm().
> 
> If mmap_sem (for write) holder is looping in the allocator and the
> process gets killed it will get access to memory reserves automatically,
> so I am not sure what do you mean here.
> 

Please stop assuming that "dying tasks get access to memory reserves
automatically (by getting TIF_MEMDIE)". It is broken.

Dying tasks might be doing !__GFP_FS allocation requests. Even if we set
TIF_MEMDIE to all threads which should terminate (in case they are doing
!__GFP_FS allocation requests), there is no guarantee that they will not
wait for locks in unkillable state after their current memory allocation
request completes (e.g. getname() followed by mutex_lock() shown at
http://lkml.kernel.org/r/201509290118.BCJ43256.tSFFFMOLHVOJOQ@I-love.SAKURA.ne.jp ).

Dying tasks cannot access memory reserves unless they are doing memory
allocation. Therefore, we cannot wait for dying tasks forever, even if
the OOM reaper can reclaim some memory.

Please do not disconnect the chain for reclaiming memory.

> Thank you for your feedback. There are some improvements and additional
> heuristics proposed and they might be really valuable in some cases but
> I believe that none of the points you are rising are blockers for the
> current code. My intention here is to push the initial version which
> would handle the most probable cases and build more on top. I would
> really prefer this doesn't grow into a hard to evaluate bloat from the
> early beginning.

I like the OOM reaper approach but I can't agree on merging the OOM reaper
without providing a guaranteed last resort at the same time. If you do want
to start the OOM reaper as simple as possible (without being bothered by
a lot of possible corner cases), please pursue a guaranteed last resort
at the same time.

We can remove the guaranteed last resort after we made the OOM reaper (and
other preferable approaches) enough reliable by doing incremental development.

Even if all OOM-killable tasks are OOM-killed, it is better than unable to
login from ssh due to OOM livelock as well as torturing with unhandled
corner cases.

If you don't like the guaranteed last resort, I can tolerate with kmallocwd.
Without a mean to understand what is happening, we will simply pretend as if
there is no bug because we are not receiving reports from users. Most users
are not as skillful as you about reporting problems related to memory allocation.
Would you please right now stop torturing administrators and technical staffs
at support center with unexplained hangups?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-28  1:28     ` David Rientjes
@ 2016-01-28 21:42       ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-28 21:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 27-01-16 17:28:10, David Rientjes wrote:
> On Wed, 6 Jan 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> 
> Suggested-bys?

Sure, why not.
 
> > The OOM killer currently allows to kill only a single task in a good
> > hope that the task will terminate in a reasonable time and frees up its
> > memory.  Such a task (oom victim) will get an access to memory reserves
> > via mark_oom_victim to allow a forward progress should there be a need
> > for additional memory during exit path.
> > 
> > It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> > construct workloads which break the core assumption mentioned above and
> > the OOM victim might take unbounded amount of time to exit because it
> > might be blocked in the uninterruptible state waiting for on an event
> > (e.g. lock) which is blocked by another task looping in the page
> > allocator.
> > 
> 
> s/for on/for/

fixed
 
> I think it would be good to note in either of the two paragraphs above 
> that each victim is per-memcg hierarchy or system-wide and the oom reaper 
> is used for memcg oom conditions as well.  Otherwise, there's no mention 
> of the memcg usecase.

I didn't mention memcg usecase because that doesn't suffer from the
deadlock issue because the OOM is invoked from the lockless context. I
think this would just make the wording more confusing.

[...]
> > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +	struct mmu_gather tlb;
> > +	struct vm_area_struct *vma;
> > +	struct zap_details details = {.check_swap_entries = true,
> > +				      .ignore_dirty = true};
> > +	bool ret = true;
> > +
> > +	/* We might have raced with exit path */
> > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > +		return true;
> > +
> > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > +		ret = false;
> > +		goto out;
> > +	}
> > +
> > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +
> > +		/*
> > +		 * mlocked VMAs require explicit munlocking before unmap.
> > +		 * Let's keep it simple here and skip such VMAs.
> > +		 */
> > +		if (vma->vm_flags & VM_LOCKED)
> > +			continue;
> 
> Shouldn't there be VM_PFNMAP handling here?

What would be the reason to exclude them?

> I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> used here for simplicity?

I didn't use zap_page_range because I wanted to have a full control over
what and how gets torn down. E.g. it is much more easier to skip over
hugetlb pages than relying on i_mmap_lock_write which might be blocked
and the whole oom_reaper will get stuck.

[...]
> > +static void wake_oom_reaper(struct mm_struct *mm)
> > +{
> > +	struct mm_struct *old_mm;
> > +
> > +	if (!oom_reaper_th)
> > +		return;
> > +
> > +	/*
> > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > +	 * we do not want to delay the address space tear down.
> > +	 */
> > +	atomic_inc(&mm->mm_count);
> > +
> > +	/*
> > +	 * Make sure that only a single mm is ever queued for the reaper
> > +	 * because multiple are not necessary and the operation might be
> > +	 * disruptive so better reduce it to the bare minimum.
> > +	 */
> > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > +	if (!old_mm)
> > +		wake_up(&oom_reaper_wait);
> > +	else
> > +		mmdrop(mm);
> 
> This behavior is probably the only really significant concern I have about 
> the patch: we just drop the mm and don't try any reaping if there is 
> already reaping in progress.

This is based on the assumption that OOM killer will not select another
task to kill until the previous one drops its TIF_MEMDIE. Should this
change in the future we will have to come up with a queuing mechanism. I
didn't want to do it right away to make the change as simple as
possible.

> We don't always have control over the amount of memory that can be reaped 
> from the victim, either because of oom kill prioritization through 
> /proc/pid/oom_score_adj or because the memory of the victim is not 
> eligible.
> 
> I'm imagining a scenario where the oom reaper has raced with a follow-up 
> oom kill before mm_to_reap has been set to NULL so there's no subsequent 
> reaping.  It's also possible that oom reaping of the first victim actually 
> freed little memory.
> 
> Would it really be difficult to queue mm's to reap from?  If memory has 
> already been freed before the reaper can get to it, the 
> find_lock_task_mm() should just fail and we're done.  I'm not sure why 
> this is being limited to a single mm system-wide.

It is not that complicated but I believe we can implement it on top once
we see this is really needed. So unless this is a strong requirement I
would rather go with a simpler way.

> > +}
> > +
> > +static int __init oom_init(void)
> > +{
> > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > +	if (IS_ERR(oom_reaper_th)) {
> > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > +				PTR_ERR(oom_reaper_th));
> > +		oom_reaper_th = NULL;
> > +	} else {
> > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > +
> > +		/*
> > +		 * Make sure our oom reaper thread will get scheduled when
> > +		 * ASAP and that it won't get preempted by malicious userspace.
> > +		 */
> > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> 
> Eeek, do you really show this is necessary?  I would imagine that we would 
> want to limit high priority processes system-wide and that we wouldn't 
> want to be interferred with by memcg oom conditions that trigger the oom 
> reaper, for example.

The idea was that we do not want to allow a high priority userspace to
preempt this important operation. I do understand your concern about the
memcg oom interference but I find it more important that oom_reaper is
runnable when needed. I guess that memcg oom heavy loads can change the
priority from userspace if necessary?

[...]
> > @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD))
> > -			continue;
> >  		if (is_global_init(p))
> >  			continue;
> > -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +		if (unlikely(p->flags & PF_KTHREAD) ||
> > +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +			/*
> > +			 * We cannot use oom_reaper for the mm shared by this
> > +			 * process because it wouldn't get killed and so the
> > +			 * memory might be still used.
> > +			 */
> > +			can_oom_reap = false;
> >  			continue;
> > -
> > +		}
> >  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> 
> Is it possible to just do wake_oom_reaper(mm) here and eliminate 
> can_oom_reap with a little bit of moving around?

I am not sure how do you mean it. We have to check all processes before
we can tell that reaping is safe. Care to elaborate some more? I am all
for making the code easier to follow and understand.

> 
> >  	}
> >  	rcu_read_unlock();
> >  
> > +	if (can_oom_reap)
> > +		wake_oom_reaper(mm);
> > +
> >  	mmdrop(mm);
> >  	put_task_struct(victim);
> >  }

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-28 21:42       ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-28 21:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 27-01-16 17:28:10, David Rientjes wrote:
> On Wed, 6 Jan 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> 
> Suggested-bys?

Sure, why not.
 
> > The OOM killer currently allows to kill only a single task in a good
> > hope that the task will terminate in a reasonable time and frees up its
> > memory.  Such a task (oom victim) will get an access to memory reserves
> > via mark_oom_victim to allow a forward progress should there be a need
> > for additional memory during exit path.
> > 
> > It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> > construct workloads which break the core assumption mentioned above and
> > the OOM victim might take unbounded amount of time to exit because it
> > might be blocked in the uninterruptible state waiting for on an event
> > (e.g. lock) which is blocked by another task looping in the page
> > allocator.
> > 
> 
> s/for on/for/

fixed
 
> I think it would be good to note in either of the two paragraphs above 
> that each victim is per-memcg hierarchy or system-wide and the oom reaper 
> is used for memcg oom conditions as well.  Otherwise, there's no mention 
> of the memcg usecase.

I didn't mention memcg usecase because that doesn't suffer from the
deadlock issue because the OOM is invoked from the lockless context. I
think this would just make the wording more confusing.

[...]
> > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +	struct mmu_gather tlb;
> > +	struct vm_area_struct *vma;
> > +	struct zap_details details = {.check_swap_entries = true,
> > +				      .ignore_dirty = true};
> > +	bool ret = true;
> > +
> > +	/* We might have raced with exit path */
> > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > +		return true;
> > +
> > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > +		ret = false;
> > +		goto out;
> > +	}
> > +
> > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +
> > +		/*
> > +		 * mlocked VMAs require explicit munlocking before unmap.
> > +		 * Let's keep it simple here and skip such VMAs.
> > +		 */
> > +		if (vma->vm_flags & VM_LOCKED)
> > +			continue;
> 
> Shouldn't there be VM_PFNMAP handling here?

What would be the reason to exclude them?

> I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> used here for simplicity?

I didn't use zap_page_range because I wanted to have a full control over
what and how gets torn down. E.g. it is much more easier to skip over
hugetlb pages than relying on i_mmap_lock_write which might be blocked
and the whole oom_reaper will get stuck.

[...]
> > +static void wake_oom_reaper(struct mm_struct *mm)
> > +{
> > +	struct mm_struct *old_mm;
> > +
> > +	if (!oom_reaper_th)
> > +		return;
> > +
> > +	/*
> > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > +	 * we do not want to delay the address space tear down.
> > +	 */
> > +	atomic_inc(&mm->mm_count);
> > +
> > +	/*
> > +	 * Make sure that only a single mm is ever queued for the reaper
> > +	 * because multiple are not necessary and the operation might be
> > +	 * disruptive so better reduce it to the bare minimum.
> > +	 */
> > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > +	if (!old_mm)
> > +		wake_up(&oom_reaper_wait);
> > +	else
> > +		mmdrop(mm);
> 
> This behavior is probably the only really significant concern I have about 
> the patch: we just drop the mm and don't try any reaping if there is 
> already reaping in progress.

This is based on the assumption that OOM killer will not select another
task to kill until the previous one drops its TIF_MEMDIE. Should this
change in the future we will have to come up with a queuing mechanism. I
didn't want to do it right away to make the change as simple as
possible.

> We don't always have control over the amount of memory that can be reaped 
> from the victim, either because of oom kill prioritization through 
> /proc/pid/oom_score_adj or because the memory of the victim is not 
> eligible.
> 
> I'm imagining a scenario where the oom reaper has raced with a follow-up 
> oom kill before mm_to_reap has been set to NULL so there's no subsequent 
> reaping.  It's also possible that oom reaping of the first victim actually 
> freed little memory.
> 
> Would it really be difficult to queue mm's to reap from?  If memory has 
> already been freed before the reaper can get to it, the 
> find_lock_task_mm() should just fail and we're done.  I'm not sure why 
> this is being limited to a single mm system-wide.

It is not that complicated but I believe we can implement it on top once
we see this is really needed. So unless this is a strong requirement I
would rather go with a simpler way.

> > +}
> > +
> > +static int __init oom_init(void)
> > +{
> > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > +	if (IS_ERR(oom_reaper_th)) {
> > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > +				PTR_ERR(oom_reaper_th));
> > +		oom_reaper_th = NULL;
> > +	} else {
> > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > +
> > +		/*
> > +		 * Make sure our oom reaper thread will get scheduled when
> > +		 * ASAP and that it won't get preempted by malicious userspace.
> > +		 */
> > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> 
> Eeek, do you really show this is necessary?  I would imagine that we would 
> want to limit high priority processes system-wide and that we wouldn't 
> want to be interferred with by memcg oom conditions that trigger the oom 
> reaper, for example.

The idea was that we do not want to allow a high priority userspace to
preempt this important operation. I do understand your concern about the
memcg oom interference but I find it more important that oom_reaper is
runnable when needed. I guess that memcg oom heavy loads can change the
priority from userspace if necessary?

[...]
> > @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD))
> > -			continue;
> >  		if (is_global_init(p))
> >  			continue;
> > -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +		if (unlikely(p->flags & PF_KTHREAD) ||
> > +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +			/*
> > +			 * We cannot use oom_reaper for the mm shared by this
> > +			 * process because it wouldn't get killed and so the
> > +			 * memory might be still used.
> > +			 */
> > +			can_oom_reap = false;
> >  			continue;
> > -
> > +		}
> >  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> 
> Is it possible to just do wake_oom_reaper(mm) here and eliminate 
> can_oom_reap with a little bit of moving around?

I am not sure how do you mean it. We have to check all processes before
we can tell that reaping is safe. Care to elaborate some more? I am all
for making the code easier to follow and understand.

> 
> >  	}
> >  	rcu_read_unlock();
> >  
> > +	if (can_oom_reap)
> > +		wake_oom_reaper(mm);
> > +
> >  	mmdrop(mm);
> >  	put_task_struct(victim);
> >  }

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-28 11:24         ` Tetsuo Handa
@ 2016-01-28 21:51           ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-28 21:51 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Thu 28-01-16 20:24:36, Tetsuo Handa wrote:
[...]
> I like the OOM reaper approach but I can't agree on merging the OOM reaper
> without providing a guaranteed last resort at the same time. If you do want
> to start the OOM reaper as simple as possible (without being bothered by
> a lot of possible corner cases), please pursue a guaranteed last resort
> at the same time.

I am getting tired of this level of argumentation. oom_reaper in its
current form is a step forward. I have acknowledged there are possible
improvements doable on top but I do not see them necessary for the core
part being merged. I am not trying to rush this in because I am very
well aware of how subtle and complex all the interactions might be.
So please stop your "we must have it all at once" attitude. This is
nothing we have to rush in. We are not talking about a regression which
has to be absolutely fixed in few days.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-28 21:51           ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-28 21:51 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Thu 28-01-16 20:24:36, Tetsuo Handa wrote:
[...]
> I like the OOM reaper approach but I can't agree on merging the OOM reaper
> without providing a guaranteed last resort at the same time. If you do want
> to start the OOM reaper as simple as possible (without being bothered by
> a lot of possible corner cases), please pursue a guaranteed last resort
> at the same time.

I am getting tired of this level of argumentation. oom_reaper in its
current form is a step forward. I have acknowledged there are possible
improvements doable on top but I do not see them necessary for the core
part being merged. I am not trying to rush this in because I am very
well aware of how subtle and complex all the interactions might be.
So please stop your "we must have it all at once" attitude. This is
nothing we have to rush in. We are not talking about a regression which
has to be absolutely fixed in few days.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-28 21:51           ` Michal Hocko
@ 2016-01-28 22:26             ` Tetsuo Handa
  -1 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-28 22:26 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 28-01-16 20:24:36, Tetsuo Handa wrote:
> [...]
> > I like the OOM reaper approach but I can't agree on merging the OOM reaper
> > without providing a guaranteed last resort at the same time. If you do want
> > to start the OOM reaper as simple as possible (without being bothered by
> > a lot of possible corner cases), please pursue a guaranteed last resort
> > at the same time.
> 
> I am getting tired of this level of argumentation. oom_reaper in its
> current form is a step forward. I have acknowledged there are possible
> improvements doable on top but I do not see them necessary for the core
> part being merged. I am not trying to rush this in because I am very
> well aware of how subtle and complex all the interactions might be.
> So please stop your "we must have it all at once" attitude. This is
> nothing we have to rush in. We are not talking about a regression which
> has to be absolutely fixed in few days.

I'm not asking you to merge a perfect version of oom_reaper from the
beginning. I know it is too difficult. Instead, I'm asking you to allow
using timeout based approaches (shown below) as temporarily workaround
because there are environments which cannot wait for oom_reaper to become
enough reliable. Would you please reply to the thread which proposed a
guaranteed last resort (shown below)?

Tetsuo Handa wrote:
> I consider phases for managing system-wide OOM events as follows.
> 
>   (1) Design and use a system with appropriate memory capacity in mind.
> 
>   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
>       an OOM victim and allow that victim access to memory reserves by
>       setting TIF_MEMDIE to it.
> 
>   (3) When (2) did not solve the OOM condition, start allowing all tasks
>       access to memory reserves by your approach.
> 
>   (4) When (3) did not solve the OOM condition, start selecting more OOM
>       victims by my approach.
> 
>   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-28 22:26             ` Tetsuo Handa
  0 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-01-28 22:26 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 28-01-16 20:24:36, Tetsuo Handa wrote:
> [...]
> > I like the OOM reaper approach but I can't agree on merging the OOM reaper
> > without providing a guaranteed last resort at the same time. If you do want
> > to start the OOM reaper as simple as possible (without being bothered by
> > a lot of possible corner cases), please pursue a guaranteed last resort
> > at the same time.
> 
> I am getting tired of this level of argumentation. oom_reaper in its
> current form is a step forward. I have acknowledged there are possible
> improvements doable on top but I do not see them necessary for the core
> part being merged. I am not trying to rush this in because I am very
> well aware of how subtle and complex all the interactions might be.
> So please stop your "we must have it all at once" attitude. This is
> nothing we have to rush in. We are not talking about a regression which
> has to be absolutely fixed in few days.

I'm not asking you to merge a perfect version of oom_reaper from the
beginning. I know it is too difficult. Instead, I'm asking you to allow
using timeout based approaches (shown below) as temporarily workaround
because there are environments which cannot wait for oom_reaper to become
enough reliable. Would you please reply to the thread which proposed a
guaranteed last resort (shown below)?

Tetsuo Handa wrote:
> I consider phases for managing system-wide OOM events as follows.
> 
>   (1) Design and use a system with appropriate memory capacity in mind.
> 
>   (2) When (1) failed, the OOM killer is invoked. The OOM killer selects
>       an OOM victim and allow that victim access to memory reserves by
>       setting TIF_MEMDIE to it.
> 
>   (3) When (2) did not solve the OOM condition, start allowing all tasks
>       access to memory reserves by your approach.
> 
>   (4) When (3) did not solve the OOM condition, start selecting more OOM
>       victims by my approach.
> 
>   (5) When (4) did not solve the OOM condition, trigger the kernel panic.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-11 12:42   ` Michal Hocko
@ 2016-01-28 22:33     ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-28 22:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

I have missed one important point. exit_oom_victim has to do
test_and_clear to make sure we do not race now with this patch. So we
need to fold the following into the patch
---
>From 296166b049ceb8b3199019a24aebab032998ebb0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 28 Jan 2016 23:27:26 +0100
Subject: [PATCH] 
 oom-clear-tif_memdie-after-oom_reaper-managed-to-unmap-the-address-space-fix

Now that exit_oom_victim might be called on a remote task from
__oom_reap_task we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up
waiters too early.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7209e517adf2..8f5488345c42 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -603,7 +603,8 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk, TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
-- 
2.7.0.rc3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-28 22:33     ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-28 22:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

I have missed one important point. exit_oom_victim has to do
test_and_clear to make sure we do not race now with this patch. So we
need to fold the following into the patch
---

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-28 22:26             ` Tetsuo Handa
@ 2016-01-28 22:36               ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-28 22:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri 29-01-16 07:26:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 28-01-16 20:24:36, Tetsuo Handa wrote:
> > [...]
> > > I like the OOM reaper approach but I can't agree on merging the OOM reaper
> > > without providing a guaranteed last resort at the same time. If you do want
> > > to start the OOM reaper as simple as possible (without being bothered by
> > > a lot of possible corner cases), please pursue a guaranteed last resort
> > > at the same time.
> > 
> > I am getting tired of this level of argumentation. oom_reaper in its
> > current form is a step forward. I have acknowledged there are possible
> > improvements doable on top but I do not see them necessary for the core
> > part being merged. I am not trying to rush this in because I am very
> > well aware of how subtle and complex all the interactions might be.
> > So please stop your "we must have it all at once" attitude. This is
> > nothing we have to rush in. We are not talking about a regression which
> > has to be absolutely fixed in few days.
> 
> I'm not asking you to merge a perfect version of oom_reaper from the
> beginning. I know it is too difficult. Instead, I'm asking you to allow
> using timeout based approaches (shown below) as temporarily workaround
> because there are environments which cannot wait for oom_reaper to become
> enough reliable. Would you please reply to the thread which proposed a
> guaranteed last resort (shown below)?

I really fail to see why you have to bring that part in this particular
thread or in any other oom related discussion. I didn't get to read
through that discussion and make my opinion yet.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-01-28 22:36               ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-01-28 22:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, hannes, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

On Fri 29-01-16 07:26:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 28-01-16 20:24:36, Tetsuo Handa wrote:
> > [...]
> > > I like the OOM reaper approach but I can't agree on merging the OOM reaper
> > > without providing a guaranteed last resort at the same time. If you do want
> > > to start the OOM reaper as simple as possible (without being bothered by
> > > a lot of possible corner cases), please pursue a guaranteed last resort
> > > at the same time.
> > 
> > I am getting tired of this level of argumentation. oom_reaper in its
> > current form is a step forward. I have acknowledged there are possible
> > improvements doable on top but I do not see them necessary for the core
> > part being merged. I am not trying to rush this in because I am very
> > well aware of how subtle and complex all the interactions might be.
> > So please stop your "we must have it all at once" attitude. This is
> > nothing we have to rush in. We are not talking about a regression which
> > has to be absolutely fixed in few days.
> 
> I'm not asking you to merge a perfect version of oom_reaper from the
> beginning. I know it is too difficult. Instead, I'm asking you to allow
> using timeout based approaches (shown below) as temporarily workaround
> because there are environments which cannot wait for oom_reaper to become
> enough reliable. Would you please reply to the thread which proposed a
> guaranteed last resort (shown below)?

I really fail to see why you have to bring that part in this particular
thread or in any other oom related discussion. I didn't get to read
through that discussion and make my opinion yet.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-28 21:42       ` Michal Hocko
@ 2016-02-02  3:02         ` David Rientjes
  -1 siblings, 0 replies; 56+ messages in thread
From: David Rientjes @ 2016-02-02  3:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 28 Jan 2016, Michal Hocko wrote:

> [...]
> > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > +	struct mmu_gather tlb;
> > > +	struct vm_area_struct *vma;
> > > +	struct zap_details details = {.check_swap_entries = true,
> > > +				      .ignore_dirty = true};
> > > +	bool ret = true;
> > > +
> > > +	/* We might have raced with exit path */
> > > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > > +		return true;
> > > +
> > > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > > +		ret = false;
> > > +		goto out;
> > > +	}
> > > +
> > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > +		if (is_vm_hugetlb_page(vma))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * mlocked VMAs require explicit munlocking before unmap.
> > > +		 * Let's keep it simple here and skip such VMAs.
> > > +		 */
> > > +		if (vma->vm_flags & VM_LOCKED)
> > > +			continue;
> > 
> > Shouldn't there be VM_PFNMAP handling here?
> 
> What would be the reason to exclude them?
> 

Not exclude them, but I would have expected untrack_pfn().

> > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > used here for simplicity?
> 
> I didn't use zap_page_range because I wanted to have a full control over
> what and how gets torn down. E.g. it is much more easier to skip over
> hugetlb pages than relying on i_mmap_lock_write which might be blocked
> and the whole oom_reaper will get stuck.
> 

Let me be clear that I think the implementation is fine, minus the missing 
handling for VM_PFNMAP.  However, I think this implementation is better 
placed into mm/memory.c to do the iteration, selection criteria, and then 
unmap_page_range().  I don't think we should be exposing 
unmap_page_range() globally, but rather add a new function to do the 
iteration in mm/memory.c with the others.

> [...]
> > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > +{
> > > +	struct mm_struct *old_mm;
> > > +
> > > +	if (!oom_reaper_th)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > > +	 * we do not want to delay the address space tear down.
> > > +	 */
> > > +	atomic_inc(&mm->mm_count);
> > > +
> > > +	/*
> > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > +	 * because multiple are not necessary and the operation might be
> > > +	 * disruptive so better reduce it to the bare minimum.
> > > +	 */
> > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > > +	if (!old_mm)
> > > +		wake_up(&oom_reaper_wait);
> > > +	else
> > > +		mmdrop(mm);
> > 
> > This behavior is probably the only really significant concern I have about 
> > the patch: we just drop the mm and don't try any reaping if there is 
> > already reaping in progress.
> 
> This is based on the assumption that OOM killer will not select another
> task to kill until the previous one drops its TIF_MEMDIE. Should this
> change in the future we will have to come up with a queuing mechanism. I
> didn't want to do it right away to make the change as simple as
> possible.
> 

The problem is that this is racy and quite easy to trigger: imagine if 
__oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
victim has been freed, and then another system-wide oom condition occurs 
before the oom reaper's mm_to_reap has been set to NULL.  No 
synchronization prevents that from happening (not sure what the reference 
to TIF_MEMDIE is about).

In this case, the oom reaper has ignored the next victim and doesn't do 
anything; the simple race has prevented it from zapping memory and does 
not reduce the livelock probability.

This can be solved either by queueing mm's to reap or involving the oom 
reaper into the oom killer synchronization itself.

> > > +static int __init oom_init(void)
> > > +{
> > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > +	if (IS_ERR(oom_reaper_th)) {
> > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > +				PTR_ERR(oom_reaper_th));
> > > +		oom_reaper_th = NULL;
> > > +	} else {
> > > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > +
> > > +		/*
> > > +		 * Make sure our oom reaper thread will get scheduled when
> > > +		 * ASAP and that it won't get preempted by malicious userspace.
> > > +		 */
> > > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> > 
> > Eeek, do you really show this is necessary?  I would imagine that we would 
> > want to limit high priority processes system-wide and that we wouldn't 
> > want to be interferred with by memcg oom conditions that trigger the oom 
> > reaper, for example.
> 
> The idea was that we do not want to allow a high priority userspace to
> preempt this important operation. I do understand your concern about the
> memcg oom interference but I find it more important that oom_reaper is
> runnable when needed. I guess that memcg oom heavy loads can change the
> priority from userspace if necessary?
> 

I'm baffled by any reference to "memcg oom heavy loads", I don't 
understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
disrupting the global runqueue by running oom_reaper at a high priority.  
The disruption itself is not only in first wakeup but also in how long the 
reaper can run and when it is rescheduled: for a lot of memory this is 
potentially long.  The reaper is best-effort, as the changelog indicates, 
and we shouldn't have a reliance on this high priority: oom kill exiting 
can't possibly be expected to be immediate.  This high priority should be 
removed so memcg oom conditions are isolated and don't affect other loads.

"Memcg oom heavy loads" cannot always be determined and the suggested fix 
cannot possibly be to adjust the priority of a global resource.  ??

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02  3:02         ` David Rientjes
  0 siblings, 0 replies; 56+ messages in thread
From: David Rientjes @ 2016-02-02  3:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 28 Jan 2016, Michal Hocko wrote:

> [...]
> > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > +	struct mmu_gather tlb;
> > > +	struct vm_area_struct *vma;
> > > +	struct zap_details details = {.check_swap_entries = true,
> > > +				      .ignore_dirty = true};
> > > +	bool ret = true;
> > > +
> > > +	/* We might have raced with exit path */
> > > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > > +		return true;
> > > +
> > > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > > +		ret = false;
> > > +		goto out;
> > > +	}
> > > +
> > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > +		if (is_vm_hugetlb_page(vma))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * mlocked VMAs require explicit munlocking before unmap.
> > > +		 * Let's keep it simple here and skip such VMAs.
> > > +		 */
> > > +		if (vma->vm_flags & VM_LOCKED)
> > > +			continue;
> > 
> > Shouldn't there be VM_PFNMAP handling here?
> 
> What would be the reason to exclude them?
> 

Not exclude them, but I would have expected untrack_pfn().

> > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > used here for simplicity?
> 
> I didn't use zap_page_range because I wanted to have a full control over
> what and how gets torn down. E.g. it is much more easier to skip over
> hugetlb pages than relying on i_mmap_lock_write which might be blocked
> and the whole oom_reaper will get stuck.
> 

Let me be clear that I think the implementation is fine, minus the missing 
handling for VM_PFNMAP.  However, I think this implementation is better 
placed into mm/memory.c to do the iteration, selection criteria, and then 
unmap_page_range().  I don't think we should be exposing 
unmap_page_range() globally, but rather add a new function to do the 
iteration in mm/memory.c with the others.

> [...]
> > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > +{
> > > +	struct mm_struct *old_mm;
> > > +
> > > +	if (!oom_reaper_th)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > > +	 * we do not want to delay the address space tear down.
> > > +	 */
> > > +	atomic_inc(&mm->mm_count);
> > > +
> > > +	/*
> > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > +	 * because multiple are not necessary and the operation might be
> > > +	 * disruptive so better reduce it to the bare minimum.
> > > +	 */
> > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > > +	if (!old_mm)
> > > +		wake_up(&oom_reaper_wait);
> > > +	else
> > > +		mmdrop(mm);
> > 
> > This behavior is probably the only really significant concern I have about 
> > the patch: we just drop the mm and don't try any reaping if there is 
> > already reaping in progress.
> 
> This is based on the assumption that OOM killer will not select another
> task to kill until the previous one drops its TIF_MEMDIE. Should this
> change in the future we will have to come up with a queuing mechanism. I
> didn't want to do it right away to make the change as simple as
> possible.
> 

The problem is that this is racy and quite easy to trigger: imagine if 
__oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
victim has been freed, and then another system-wide oom condition occurs 
before the oom reaper's mm_to_reap has been set to NULL.  No 
synchronization prevents that from happening (not sure what the reference 
to TIF_MEMDIE is about).

In this case, the oom reaper has ignored the next victim and doesn't do 
anything; the simple race has prevented it from zapping memory and does 
not reduce the livelock probability.

This can be solved either by queueing mm's to reap or involving the oom 
reaper into the oom killer synchronization itself.

> > > +static int __init oom_init(void)
> > > +{
> > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > +	if (IS_ERR(oom_reaper_th)) {
> > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > +				PTR_ERR(oom_reaper_th));
> > > +		oom_reaper_th = NULL;
> > > +	} else {
> > > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > +
> > > +		/*
> > > +		 * Make sure our oom reaper thread will get scheduled when
> > > +		 * ASAP and that it won't get preempted by malicious userspace.
> > > +		 */
> > > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> > 
> > Eeek, do you really show this is necessary?  I would imagine that we would 
> > want to limit high priority processes system-wide and that we wouldn't 
> > want to be interferred with by memcg oom conditions that trigger the oom 
> > reaper, for example.
> 
> The idea was that we do not want to allow a high priority userspace to
> preempt this important operation. I do understand your concern about the
> memcg oom interference but I find it more important that oom_reaper is
> runnable when needed. I guess that memcg oom heavy loads can change the
> priority from userspace if necessary?
> 

I'm baffled by any reference to "memcg oom heavy loads", I don't 
understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
disrupting the global runqueue by running oom_reaper at a high priority.  
The disruption itself is not only in first wakeup but also in how long the 
reaper can run and when it is rescheduled: for a lot of memory this is 
potentially long.  The reaper is best-effort, as the changelog indicates, 
and we shouldn't have a reliance on this high priority: oom kill exiting 
can't possibly be expected to be immediate.  This high priority should be 
removed so memcg oom conditions are isolated and don't affect other loads.

"Memcg oom heavy loads" cannot always be determined and the suggested fix 
cannot possibly be to adjust the priority of a global resource.  ??

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02  3:02         ` David Rientjes
@ 2016-02-02  8:57           ` Michal Hocko
  -1 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-02-02  8:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 01-02-16 19:02:06, David Rientjes wrote:
> On Thu, 28 Jan 2016, Michal Hocko wrote:
> 
> > [...]
> > > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > > +{
> > > > +	struct mmu_gather tlb;
> > > > +	struct vm_area_struct *vma;
> > > > +	struct zap_details details = {.check_swap_entries = true,
> > > > +				      .ignore_dirty = true};
> > > > +	bool ret = true;
> > > > +
> > > > +	/* We might have raced with exit path */
> > > > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > > > +		return true;
> > > > +
> > > > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > > > +		ret = false;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > > +		if (is_vm_hugetlb_page(vma))
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * mlocked VMAs require explicit munlocking before unmap.
> > > > +		 * Let's keep it simple here and skip such VMAs.
> > > > +		 */
> > > > +		if (vma->vm_flags & VM_LOCKED)
> > > > +			continue;
> > > 
> > > Shouldn't there be VM_PFNMAP handling here?
> > 
> > What would be the reason to exclude them?
> > 
> 
> Not exclude them, but I would have expected untrack_pfn().

My understanding is that vm_normal_page will do the right thing for
those mappings - especially for CoW VM_PFNMAP which are normal pages
AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
enter exit_mmap and do the remaining house keepining. Maybe I am missing
something but untrack_pfn shouldn't lead to releasing a considerable
amount of memory. So is this really necessary or we can wait for
exit_mmap?

> > > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > > used here for simplicity?
> > 
> > I didn't use zap_page_range because I wanted to have a full control over
> > what and how gets torn down. E.g. it is much more easier to skip over
> > hugetlb pages than relying on i_mmap_lock_write which might be blocked
> > and the whole oom_reaper will get stuck.
> > 
> 
> Let me be clear that I think the implementation is fine, minus the missing 
> handling for VM_PFNMAP.  However, I think this implementation is better 
> placed into mm/memory.c to do the iteration, selection criteria, and then 
> unmap_page_range().  I don't think we should be exposing 
> unmap_page_range() globally, but rather add a new function to do the 
> iteration in mm/memory.c with the others.

I do not have any objections to moving the code but I felt this is a
single purpose thingy which doesn't need a wider exposure. The exclusion
criteria is tightly coupled to what oom reaper is allowed to do. In
other words such a function wouldn't be reusable for say MADV_DONTNEED
because it has different criteria. Having all the selection criteria
close to __oom_reap_task on the other hand makes it easier to evaluate
their relevance. So I am not really convinced. I can move it if you feel
strongly about that, though.

> > [...]
> > > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > > +{
> > > > +	struct mm_struct *old_mm;
> > > > +
> > > > +	if (!oom_reaper_th)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > > > +	 * we do not want to delay the address space tear down.
> > > > +	 */
> > > > +	atomic_inc(&mm->mm_count);
> > > > +
> > > > +	/*
> > > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > > +	 * because multiple are not necessary and the operation might be
> > > > +	 * disruptive so better reduce it to the bare minimum.
> > > > +	 */
> > > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > > > +	if (!old_mm)
> > > > +		wake_up(&oom_reaper_wait);
> > > > +	else
> > > > +		mmdrop(mm);
> > > 
> > > This behavior is probably the only really significant concern I have about 
> > > the patch: we just drop the mm and don't try any reaping if there is 
> > > already reaping in progress.
> > 
> > This is based on the assumption that OOM killer will not select another
> > task to kill until the previous one drops its TIF_MEMDIE. Should this
> > change in the future we will have to come up with a queuing mechanism. I
> > didn't want to do it right away to make the change as simple as
> > possible.
> > 
> 
> The problem is that this is racy and quite easy to trigger: imagine if 
> __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> victim has been freed, and then another system-wide oom condition occurs 
> before the oom reaper's mm_to_reap has been set to NULL.

Yes I realize this is potentially racy. I just didn't consider the race
important enough to justify task queuing in the first submission. Tetsuo
was pushing for this already and I tried to push back for simplicity in
the first submission. But ohh well... I will queue up a patch to do this
on top. I plan to repost the full patchset shortly.

> No synchronization prevents that from happening (not sure what the
> reference to TIF_MEMDIE is about).

Now that I am reading my response again I see how it could be
misleading. I was referring to possibility of choosing multiple oom
victims which was discussed recently. I didn't mean TIF_MEMDIE to exclude
oom reaper vs. exit exclusion.

> In this case, the oom reaper has ignored the next victim and doesn't do 
> anything; the simple race has prevented it from zapping memory and does 
> not reduce the livelock probability.
> 
> This can be solved either by queueing mm's to reap or involving the oom 
> reaper into the oom killer synchronization itself.

as we have already discussed previously oom reaper is really tricky to
be called from the direct OOM context. I will go with queuing. 
 
> > > > +static int __init oom_init(void)
> > > > +{
> > > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > > +	if (IS_ERR(oom_reaper_th)) {
> > > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > > +				PTR_ERR(oom_reaper_th));
> > > > +		oom_reaper_th = NULL;
> > > > +	} else {
> > > > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > > +
> > > > +		/*
> > > > +		 * Make sure our oom reaper thread will get scheduled when
> > > > +		 * ASAP and that it won't get preempted by malicious userspace.
> > > > +		 */
> > > > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> > > 
> > > Eeek, do you really show this is necessary?  I would imagine that we would 
> > > want to limit high priority processes system-wide and that we wouldn't 
> > > want to be interferred with by memcg oom conditions that trigger the oom 
> > > reaper, for example.
> > 
> > The idea was that we do not want to allow a high priority userspace to
> > preempt this important operation. I do understand your concern about the
> > memcg oom interference but I find it more important that oom_reaper is
> > runnable when needed. I guess that memcg oom heavy loads can change the
> > priority from userspace if necessary?
> > 
> 
> I'm baffled by any reference to "memcg oom heavy loads", I don't 
> understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> disrupting the global runqueue by running oom_reaper at a high priority.  
> The disruption itself is not only in first wakeup but also in how long the 
> reaper can run and when it is rescheduled: for a lot of memory this is 
> potentially long.  The reaper is best-effort, as the changelog indicates, 
> and we shouldn't have a reliance on this high priority: oom kill exiting 
> can't possibly be expected to be immediate.  This high priority should be 
> removed so memcg oom conditions are isolated and don't affect other loads.

If this is a concern then I would be tempted to simply disable oom
reaper for memcg oom altogether. For me it is much more important that
the reaper, even though a best effort, is guaranteed to schedule if
something goes terribly wrong on the machine.

Is this acceptable?

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02  8:57           ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2016-02-02  8:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 01-02-16 19:02:06, David Rientjes wrote:
> On Thu, 28 Jan 2016, Michal Hocko wrote:
> 
> > [...]
> > > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > > +{
> > > > +	struct mmu_gather tlb;
> > > > +	struct vm_area_struct *vma;
> > > > +	struct zap_details details = {.check_swap_entries = true,
> > > > +				      .ignore_dirty = true};
> > > > +	bool ret = true;
> > > > +
> > > > +	/* We might have raced with exit path */
> > > > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > > > +		return true;
> > > > +
> > > > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > > > +		ret = false;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > > +		if (is_vm_hugetlb_page(vma))
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * mlocked VMAs require explicit munlocking before unmap.
> > > > +		 * Let's keep it simple here and skip such VMAs.
> > > > +		 */
> > > > +		if (vma->vm_flags & VM_LOCKED)
> > > > +			continue;
> > > 
> > > Shouldn't there be VM_PFNMAP handling here?
> > 
> > What would be the reason to exclude them?
> > 
> 
> Not exclude them, but I would have expected untrack_pfn().

My understanding is that vm_normal_page will do the right thing for
those mappings - especially for CoW VM_PFNMAP which are normal pages
AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
enter exit_mmap and do the remaining house keepining. Maybe I am missing
something but untrack_pfn shouldn't lead to releasing a considerable
amount of memory. So is this really necessary or we can wait for
exit_mmap?

> > > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > > used here for simplicity?
> > 
> > I didn't use zap_page_range because I wanted to have a full control over
> > what and how gets torn down. E.g. it is much more easier to skip over
> > hugetlb pages than relying on i_mmap_lock_write which might be blocked
> > and the whole oom_reaper will get stuck.
> > 
> 
> Let me be clear that I think the implementation is fine, minus the missing 
> handling for VM_PFNMAP.  However, I think this implementation is better 
> placed into mm/memory.c to do the iteration, selection criteria, and then 
> unmap_page_range().  I don't think we should be exposing 
> unmap_page_range() globally, but rather add a new function to do the 
> iteration in mm/memory.c with the others.

I do not have any objections to moving the code but I felt this is a
single purpose thingy which doesn't need a wider exposure. The exclusion
criteria is tightly coupled to what oom reaper is allowed to do. In
other words such a function wouldn't be reusable for say MADV_DONTNEED
because it has different criteria. Having all the selection criteria
close to __oom_reap_task on the other hand makes it easier to evaluate
their relevance. So I am not really convinced. I can move it if you feel
strongly about that, though.

> > [...]
> > > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > > +{
> > > > +	struct mm_struct *old_mm;
> > > > +
> > > > +	if (!oom_reaper_th)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > > > +	 * we do not want to delay the address space tear down.
> > > > +	 */
> > > > +	atomic_inc(&mm->mm_count);
> > > > +
> > > > +	/*
> > > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > > +	 * because multiple are not necessary and the operation might be
> > > > +	 * disruptive so better reduce it to the bare minimum.
> > > > +	 */
> > > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > > > +	if (!old_mm)
> > > > +		wake_up(&oom_reaper_wait);
> > > > +	else
> > > > +		mmdrop(mm);
> > > 
> > > This behavior is probably the only really significant concern I have about 
> > > the patch: we just drop the mm and don't try any reaping if there is 
> > > already reaping in progress.
> > 
> > This is based on the assumption that OOM killer will not select another
> > task to kill until the previous one drops its TIF_MEMDIE. Should this
> > change in the future we will have to come up with a queuing mechanism. I
> > didn't want to do it right away to make the change as simple as
> > possible.
> > 
> 
> The problem is that this is racy and quite easy to trigger: imagine if 
> __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> victim has been freed, and then another system-wide oom condition occurs 
> before the oom reaper's mm_to_reap has been set to NULL.

Yes I realize this is potentially racy. I just didn't consider the race
important enough to justify task queuing in the first submission. Tetsuo
was pushing for this already and I tried to push back for simplicity in
the first submission. But ohh well... I will queue up a patch to do this
on top. I plan to repost the full patchset shortly.

> No synchronization prevents that from happening (not sure what the
> reference to TIF_MEMDIE is about).

Now that I am reading my response again I see how it could be
misleading. I was referring to possibility of choosing multiple oom
victims which was discussed recently. I didn't mean TIF_MEMDIE to exclude
oom reaper vs. exit exclusion.

> In this case, the oom reaper has ignored the next victim and doesn't do 
> anything; the simple race has prevented it from zapping memory and does 
> not reduce the livelock probability.
> 
> This can be solved either by queueing mm's to reap or involving the oom 
> reaper into the oom killer synchronization itself.

as we have already discussed previously oom reaper is really tricky to
be called from the direct OOM context. I will go with queuing. 
 
> > > > +static int __init oom_init(void)
> > > > +{
> > > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > > +	if (IS_ERR(oom_reaper_th)) {
> > > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > > +				PTR_ERR(oom_reaper_th));
> > > > +		oom_reaper_th = NULL;
> > > > +	} else {
> > > > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > > +
> > > > +		/*
> > > > +		 * Make sure our oom reaper thread will get scheduled when
> > > > +		 * ASAP and that it won't get preempted by malicious userspace.
> > > > +		 */
> > > > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> > > 
> > > Eeek, do you really show this is necessary?  I would imagine that we would 
> > > want to limit high priority processes system-wide and that we wouldn't 
> > > want to be interferred with by memcg oom conditions that trigger the oom 
> > > reaper, for example.
> > 
> > The idea was that we do not want to allow a high priority userspace to
> > preempt this important operation. I do understand your concern about the
> > memcg oom interference but I find it more important that oom_reaper is
> > runnable when needed. I guess that memcg oom heavy loads can change the
> > priority from userspace if necessary?
> > 
> 
> I'm baffled by any reference to "memcg oom heavy loads", I don't 
> understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> disrupting the global runqueue by running oom_reaper at a high priority.  
> The disruption itself is not only in first wakeup but also in how long the 
> reaper can run and when it is rescheduled: for a lot of memory this is 
> potentially long.  The reaper is best-effort, as the changelog indicates, 
> and we shouldn't have a reliance on this high priority: oom kill exiting 
> can't possibly be expected to be immediate.  This high priority should be 
> removed so memcg oom conditions are isolated and don't affect other loads.

If this is a concern then I would be tempted to simply disable oom
reaper for memcg oom altogether. For me it is much more important that
the reaper, even though a best effort, is guaranteed to schedule if
something goes terribly wrong on the machine.

Is this acceptable?

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02  8:57           ` Michal Hocko
@ 2016-02-02 11:48             ` Tetsuo Handa
  -1 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-02-02 11:48 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, mgorman, penguin-kernel, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

OK. But it is not easy to build a reliable OOM-reap queuing chain. I think
that a dedicated kernel thread which does OOM-kill operation and OOM-reap
operation will be expected. That will also handle the "sleeping for too
long with oom_lock held after sending SIGKILL" problem.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.

I think that if something goes terribly wrong on the machine, a guarantee for
scheduling the reaper will not help unless we build a reliable queuing chain.
Building a reliable queuing chain will break some of assumptions provided by
current behavior. For me, a guarantee for scheduling for next OOM-kill
operation (with globally opening some or all of memory reserves) before
building a reliable queuing chain is much more important.

>                       But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.

Maybe we all agree with introducing OOM reaper without queuing, but I do
want to see a guarantee for scheduling for next OOM-kill operation before
trying to build a reliable queuing chain.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02 11:48             ` Tetsuo Handa
  0 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-02-02 11:48 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, mgorman, penguin-kernel, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

OK. But it is not easy to build a reliable OOM-reap queuing chain. I think
that a dedicated kernel thread which does OOM-kill operation and OOM-reap
operation will be expected. That will also handle the "sleeping for too
long with oom_lock held after sending SIGKILL" problem.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.

I think that if something goes terribly wrong on the machine, a guarantee for
scheduling the reaper will not help unless we build a reliable queuing chain.
Building a reliable queuing chain will break some of assumptions provided by
current behavior. For me, a guarantee for scheduling for next OOM-kill
operation (with globally opening some or all of memory reserves) before
building a reliable queuing chain is much more important.

>                       But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.

Maybe we all agree with introducing OOM reaper without queuing, but I do
want to see a guarantee for scheduling for next OOM-kill operation before
trying to build a reliable queuing chain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02  8:57           ` Michal Hocko
@ 2016-02-02 22:51             ` David Rientjes
  -1 siblings, 0 replies; 56+ messages in thread
From: David Rientjes @ 2016-02-02 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Tue, 2 Feb 2016, Michal Hocko wrote:

> > Not exclude them, but I would have expected untrack_pfn().
> 
> My understanding is that vm_normal_page will do the right thing for
> those mappings - especially for CoW VM_PFNMAP which are normal pages
> AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
> enter exit_mmap and do the remaining house keepining. Maybe I am missing
> something but untrack_pfn shouldn't lead to releasing a considerable
> amount of memory. So is this really necessary or we can wait for
> exit_mmap?
> 

I think if you move the code to mm/memory.c that you may find a greater 
opportunity to share code with the implementations there and this will 
take care of itself :)  I'm concerned about this also from a 
maintainability standpoint where a future patch might modify one 
implementation while forgetting about the other.  I think there's a great 
opportunity here for a really clean and shiny interfance that doesn't 
introduce any more complexity.

> > The problem is that this is racy and quite easy to trigger: imagine if 
> > __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> > victim has been freed, and then another system-wide oom condition occurs 
> > before the oom reaper's mm_to_reap has been set to NULL.
> 
> Yes I realize this is potentially racy. I just didn't consider the race
> important enough to justify task queuing in the first submission. Tetsuo
> was pushing for this already and I tried to push back for simplicity in
> the first submission. But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.
> 

Ok, thanks!  It should probably be dropped from -mm in the interim until 
it has some acked-by's, but I think those will come pretty quickly once 
it's refreshed if all of this is handled.

> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

Hmm, I wasn't referring to oom context: it would be possible without 
queueing with an mm_to_reap_lock (or cmpxchg) in the oom reaper and when 
the final mmput() is done.  Set it when the mm is ready for reaping, clear 
it when the mm is being destroyed, and test it before calling the oom 
killer.  I think we'd want to defer the oom killer until potential reaping 
could be done anyway and I don't anticipate an issue where oom_reaper 
fails to schedule.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.
> 

I don't believe the higher priority guarantees it is able to schedule any 
more than it was guaranteed to schedule before.  It will run, but it won't 
preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
nodemasks.  I think it would be disappointing to leave those out.  I think 
the higher priority should simply be removed in terms of fairness.

Other than these issues, I don't see any reason why a refreshed series 
wouldn't be immediately acked.  Thanks very much for continuing to work on 
this!

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02 22:51             ` David Rientjes
  0 siblings, 0 replies; 56+ messages in thread
From: David Rientjes @ 2016-02-02 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Tue, 2 Feb 2016, Michal Hocko wrote:

> > Not exclude them, but I would have expected untrack_pfn().
> 
> My understanding is that vm_normal_page will do the right thing for
> those mappings - especially for CoW VM_PFNMAP which are normal pages
> AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
> enter exit_mmap and do the remaining house keepining. Maybe I am missing
> something but untrack_pfn shouldn't lead to releasing a considerable
> amount of memory. So is this really necessary or we can wait for
> exit_mmap?
> 

I think if you move the code to mm/memory.c that you may find a greater 
opportunity to share code with the implementations there and this will 
take care of itself :)  I'm concerned about this also from a 
maintainability standpoint where a future patch might modify one 
implementation while forgetting about the other.  I think there's a great 
opportunity here for a really clean and shiny interfance that doesn't 
introduce any more complexity.

> > The problem is that this is racy and quite easy to trigger: imagine if 
> > __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> > victim has been freed, and then another system-wide oom condition occurs 
> > before the oom reaper's mm_to_reap has been set to NULL.
> 
> Yes I realize this is potentially racy. I just didn't consider the race
> important enough to justify task queuing in the first submission. Tetsuo
> was pushing for this already and I tried to push back for simplicity in
> the first submission. But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.
> 

Ok, thanks!  It should probably be dropped from -mm in the interim until 
it has some acked-by's, but I think those will come pretty quickly once 
it's refreshed if all of this is handled.

> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

Hmm, I wasn't referring to oom context: it would be possible without 
queueing with an mm_to_reap_lock (or cmpxchg) in the oom reaper and when 
the final mmput() is done.  Set it when the mm is ready for reaping, clear 
it when the mm is being destroyed, and test it before calling the oom 
killer.  I think we'd want to defer the oom killer until potential reaping 
could be done anyway and I don't anticipate an issue where oom_reaper 
fails to schedule.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.
> 

I don't believe the higher priority guarantees it is able to schedule any 
more than it was guaranteed to schedule before.  It will run, but it won't 
preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
nodemasks.  I think it would be disappointing to leave those out.  I think 
the higher priority should simply be removed in terms of fairness.

Other than these issues, I don't see any reason why a refreshed series 
wouldn't be immediately acked.  Thanks very much for continuing to work on 
this!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02 11:48             ` Tetsuo Handa
@ 2016-02-02 22:55               ` David Rientjes
  -1 siblings, 0 replies; 56+ messages in thread
From: David Rientjes @ 2016-02-02 22:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Tue, 2 Feb 2016, Tetsuo Handa wrote:

> Maybe we all agree with introducing OOM reaper without queuing, but I do
> want to see a guarantee for scheduling for next OOM-kill operation before
> trying to build a reliable queuing chain.
> 

The race can be fixed in two ways which I've already enumerated, but the 
scheduling issue is tangential: the oom_reaper kthread is going to run; 
increasing it's priority will only interfere with other innocent processes 
that are not attached to the oom memcg hierarchy, have disjoint cpuset 
mems, or are happily allocating from mempolicy nodes with free memory.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02 22:55               ` David Rientjes
  0 siblings, 0 replies; 56+ messages in thread
From: David Rientjes @ 2016-02-02 22:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Tue, 2 Feb 2016, Tetsuo Handa wrote:

> Maybe we all agree with introducing OOM reaper without queuing, but I do
> want to see a guarantee for scheduling for next OOM-kill operation before
> trying to build a reliable queuing chain.
> 

The race can be fixed in two ways which I've already enumerated, but the 
scheduling issue is tangential: the oom_reaper kthread is going to run; 
increasing it's priority will only interfere with other innocent processes 
that are not attached to the oom memcg hierarchy, have disjoint cpuset 
mems, or are happily allocating from mempolicy nodes with free memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02 22:51             ` David Rientjes
@ 2016-02-03 10:31               ` Tetsuo Handa
  -1 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-02-03 10:31 UTC (permalink / raw)
  To: rientjes, mhocko, hannes
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

David Rientjes wrote:
> On Tue, 2 Feb 2016, Michal Hocko wrote:
> > > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > > disrupting the global runqueue by running oom_reaper at a high priority.  
> > > The disruption itself is not only in first wakeup but also in how long the 
> > > reaper can run and when it is rescheduled: for a lot of memory this is 
> > > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > > can't possibly be expected to be immediate.  This high priority should be 
> > > removed so memcg oom conditions are isolated and don't affect other loads.
> > 
> > If this is a concern then I would be tempted to simply disable oom
> > reaper for memcg oom altogether. For me it is much more important that
> > the reaper, even though a best effort, is guaranteed to schedule if
> > something goes terribly wrong on the machine.
> > 
> 
> I don't believe the higher priority guarantees it is able to schedule any 
> more than it was guaranteed to schedule before.  It will run, but it won't 
> preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
> only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
> nodemasks.  I think it would be disappointing to leave those out.  I think 
> the higher priority should simply be removed in terms of fairness.
> 
> Other than these issues, I don't see any reason why a refreshed series 
> wouldn't be immediately acked.  Thanks very much for continuing to work on 
> this!
> 

Excuse me, but I came to think that we should try to wake up the OOM reaper at

    if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
        if (!is_sysrq_oom(oc))
            return OOM_SCAN_ABORT;
    }

in oom_scan_process_thread() rather than at oom_kill_process() or at
mark_oom_victim(). Waking up the OOM reaper there will try to reap
task->mm, and give up eventually which will in turn naturally allow the
OOM killer to choose next OOM victim. The key point is PATCH 2/5 shown
below. What do you think?

PATCH 1/5 is (I think) a bug fix.
PATCH 2/5 is for waking up the OOM reaper from victim selection loop.
PATCH 3/5 is for helping the OOM killer to choose next OOM victim.
PATCH 4/5 is for handling corner cases.
PATCH 5/5 is for changing the OOM reaper to use default priority.

 include/linux/oom.h |    3
 mm/oom_kill.c       |  173 ++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 136 insertions(+), 40 deletions(-)

----------------------------------------
>From e1c0a78fbfd0a76f367efac269cbcf22c7df9292 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:18:19 +0900
Subject: [PATCH 1/5] mm,oom: Fix incorrect oom_task_origin check.

Currently, the OOM killer unconditionally selects p if oom_task_origin(p)
is true, but p should not be OOM-killed if p is marked as OOM-unkillable.

This patch does not fix a race condition where p is selected when p was
by chance between set_current_oom_origin() and actually start operations
that might trigger an OOM event when an OOM event is triggered for some
reason other than operations by p.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..59481e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,7 +67,8 @@ static inline void clear_current_oom_origin(void)
 
 static inline bool oom_task_origin(const struct task_struct *p)
 {
-	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
+	return (p->signal->oom_flags & OOM_FLAG_ORIGIN) &&
+		p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN;
 }
 
 extern void mark_oom_victim(struct task_struct *tsk);
-- 
1.8.3.1

>From 76cf60d33e4e1daa475e4c1e39087415a309c6e9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:20:07 +0900
Subject: [PATCH 2/5] mm,oom: Change timing of waking up the OOM reaper

Currently, the OOM reaper kernel thread is woken up when we set TIF_MEMDIE
on a task. But it is not easy to build a reliable OOM-reap queuing chain.

Since the OOM livelock problem occurs when we find TIF_MEMDIE on a task
which cannot terminate, waking up the OOM reaper when we found TIF_MEMDIE
on a task can simplify handling of the chain. Also, we don't need to wake
up the OOM reaper if the victim can smoothly terminate. Therefore, this
patch replaces wake_oom_reaper() called from oom_kill_process() with
try_oom_reap() called from oom_scan_process_thread().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b51bcce..07c6389 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -268,6 +268,8 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+static bool try_oom_reap(struct task_struct *tsk);
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
@@ -279,7 +281,7 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (!is_sysrq_oom(oc))
+		if (!is_sysrq_oom(oc) && try_oom_reap(task))
 			return OOM_SCAN_ABORT;
 	}
 	if (!task->mm)
@@ -420,6 +422,40 @@ static struct task_struct *oom_reaper_th;
 static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
+static bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *g;
+	struct task_struct *p;
+
+	/*
+	 * Since it is possible that p voluntarily called do_exit() or
+	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
+	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
+	 * has pending SIGKILL or already reached do_exit().
+	 *
+	 * On the other hand, it is possible that mark_oom_victim(p) is called
+	 * without sending SIGKILL to all tasks using this mm. In this case,
+	 * the OOM reaper cannot reap this mm unless p is the only task using
+	 * this mm.
+	 *
+	 * Therefore, determine whether this mm is reapable by testing whether
+	 * all tasks using this mm are dying or already exiting rather than
+	 * depending on p->signal->oom_score_adj value which is updated by the
+	 * OOM reaper.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (mm != READ_ONCE(p->mm) ||
+		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
+			continue;
+		mm = NULL;
+		goto out;
+	}
+ out:
+	rcu_read_unlock();
+	return mm != NULL;
+}
+
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -448,7 +484,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
 	}
@@ -500,7 +536,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
-	/* Drop a reference taken by wake_oom_reaper */
+	/* Drop a reference taken by try_oom_reap */
 	put_task_struct(tsk);
 }
 
@@ -512,18 +548,44 @@ static int oom_reaper(void *unused)
 		wait_event_freezable(oom_reaper_wait,
 				     (tsk = READ_ONCE(task_to_reap)));
 		oom_reap_task(tsk);
+		/*
+		 * The OOM killer might be about to call try_oom_reap() after
+		 * seeing TIF_MEMDIE.
+		 */
+		smp_wmb();
 		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static bool try_oom_reap(struct task_struct *tsk)
 {
 	struct task_struct *old_tsk;
 
+	/*
+	 * We will livelock if we unconditionally return true.
+	 * We will kill all tasks if we unconditionally return false.
+	 */
 	if (!oom_reaper_th)
-		return;
+		return true;
+
+	/*
+	 * Wait for the OOM reaper to reap this task and mark this task
+	 * as OOM-unkillable and clear TIF_MEMDIE. Since the OOM reaper
+	 * has high scheduling priority, we can unconditionally wait for
+	 * completion.
+	 */
+	if (task_to_reap)
+		return true;
+
+	/*
+	 * The OOM reaper might be about to clear task_to_reap after
+	 * clearing TIF_MEMDIE.
+	 */
+	smp_rmb();
+	if (!test_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return false;
 
 	get_task_struct(tsk);
 
@@ -537,6 +599,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 		wake_up(&oom_reaper_wait);
 	else
 		put_task_struct(tsk);
+	return true;
 }
 
 static int __init oom_init(void)
@@ -559,8 +622,13 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct task_struct *mm)
+static bool try_oom_reap(struct task_struct *tsk)
 {
+	/*
+	 * We will livelock if we unconditionally return true.
+	 * We will kill all tasks if we unconditionally return false.
+	 */
+	return true;
 }
 #endif
 
@@ -592,7 +660,8 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk, TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -669,7 +738,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -762,23 +830,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (is_global_init(p))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
-		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
1.8.3.1

>From 8c6024b963d5b4e8d38a3416e14b458e1e073607 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:42:28 +0900
Subject: [PATCH 3/5] mm,oom: Always update OOM score and clear TIF_MEMDIE
 after OOM reap.

This patch updates victim's oom_score_adj and clear TIF_MEMDIE
even if the OOM reaper failed to reap victim's memory. This is
needed for handling corner cases where TIF_MEMDIE is set on a victim
without sending SIGKILL to all tasks sharing the same memory.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 50 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 07c6389..a0ae8dc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,6 +413,44 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+static void update_victim_score(struct task_struct *tsk, bool reap_success)
+{
+	/*
+	 * If we succeeded to reap a mm, mark that task using it as
+	 * OOM-unkillable and clear TIF_MEMDIE, for the task shouldn't be
+	 * sitting on a reasonably reclaimable memory anymore.
+	 * OOM killer can continue by selecting other victim if unmapping
+	 * hasn't led to any improvements. This also means that selecting
+	 * this task doesn't make any sense.
+	 */
+	if (reap_success)
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	/*
+	 * If we failed to reap a mm, mark that task using it as almost
+	 * OOM-unkillable and clear TIF_MEMDIE. This will help future
+	 * select_bad_process() try to select other OOM-killable tasks
+	 * before selecting that task again.
+	 */
+	else if (tsk->signal->oom_score_adj > OOM_SCORE_ADJ_MIN + 1)
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN + 1;
+	/*
+	 * But if that task got TIF_MEMDIE when that task is already marked as
+	 * almost OOM-unkillable, mark that task completely OOM-unkillable.
+	 * Otherwise, we cannot make progress when all OOM-killable tasks are
+	 * marked as almost OOM-unkillable.
+	 *
+	 * Note that the reason we fail to reap a mm might be that there are
+	 * tasks using this mm without neither pending SIGKILL nor PF_EXITING
+	 * which means that we set TIF_MEMDIE on a task without sending SIGKILL
+	 * to tasks sharing this mm. In this case, we will call panic() without
+	 * sending SIGKILL to tasks sharing this mm when all OOM-killable tasks
+	 * are marked as completely OOM-unkillable.
+	 */
+	else
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
+}
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -513,16 +551,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
-	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
@@ -536,6 +564,8 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
+	update_victim_score(tsk, attempts < 10);
+
 	/* Drop a reference taken by try_oom_reap */
 	put_task_struct(tsk);
 }
-- 
1.8.3.1

>From d6254acc565af7456fb21c0bb7568452fb227f3c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:44:47 +0900
Subject: [PATCH 4/5] mm,oom: Add timeout counter for handling corner cases.

Currently, we can hit OOM livelock if the OOM reaper kernel thread is
not available. This patch adds a simple timeout based next victim
selection logic in case the OOM reaper kernel thread is unavailable.

Future patch will add hooks for allowing global access to memory
reserves before this timeout counter expires.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a0ae8dc..e4e955b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -451,6 +451,11 @@ static void update_victim_score(struct task_struct *tsk, bool reap_success)
 	exit_oom_victim(tsk);
 }
 
+static void oomkiller_reset(unsigned long arg)
+{
+}
+static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -596,9 +601,14 @@ static bool try_oom_reap(struct task_struct *tsk)
 	/*
 	 * We will livelock if we unconditionally return true.
 	 * We will kill all tasks if we unconditionally return false.
+	 * Thus, use a simple timeout counter if the OOM reaper is unavailable.
 	 */
-	if (!oom_reaper_th)
-		return true;
+	if (!oom_reaper_th) {
+		if (timer_pending(&oomkiller_victim_wait_timer))
+			return true;
+		update_victim_score(tsk, false);
+		return false;
+	}
 
 	/*
 	 * Wait for the OOM reaper to reap this task and mark this task
@@ -654,11 +664,11 @@ subsys_initcall(oom_init)
 #else
 static bool try_oom_reap(struct task_struct *tsk)
 {
-	/*
-	 * We will livelock if we unconditionally return true.
-	 * We will kill all tasks if we unconditionally return false.
-	 */
-	return true;
+	/* Use a simple timeout counter, for the OOM reaper is unavailable. */
+	if (timer_pending(&oomkiller_victim_wait_timer))
+		return true;
+	update_victim_score(tsk, false);
+	return false;
 }
 #endif

@@ -683,6 +693,8 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	/* Make sure that we won't wait for this task forever. */
+	mod_timer(&oomkiller_victim_wait_timer, jiffies + 5 * HZ);
 }
 
 /**
-- 
1.8.3.1

>From 6156462d2db03bfc9fe76ca5a3f0ebcc5a88a12e Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:45:29 +0900
Subject: [PATCH 5/5] mm,oom: Use normal scheduling priority for the OOM reaper

Currently, the OOM reaper kernel thread has high scheduling priority
in order to make sure that OOM-reap operation occurs immediately.

This patch changes the scheduling priority to normal, and fallback to
a simple timeout based next victim selection logic if the OOM reaper
fails to get enough CPU resource.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e4e955b..b55159f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -604,6 +604,7 @@ static bool try_oom_reap(struct task_struct *tsk)
 	 * Thus, use a simple timeout counter if the OOM reaper is unavailable.
 	 */
 	if (!oom_reaper_th) {
+check_timeout:
 		if (timer_pending(&oomkiller_victim_wait_timer))
 			return true;
 		update_victim_score(tsk, false);
@@ -613,11 +614,12 @@ static bool try_oom_reap(struct task_struct *tsk)
 	/*
 	 * Wait for the OOM reaper to reap this task and mark this task
 	 * as OOM-unkillable and clear TIF_MEMDIE. Since the OOM reaper
-	 * has high scheduling priority, we can unconditionally wait for
-	 * completion.
+	 * has normal scheduling priority, we can't wait for completion
+	 * forever. Thus, use a simple timeout counter in case the OOM
+	 * reaper fails to get enough CPU resource.
 	 */
 	if (task_to_reap)
-		return true;
+		goto check_timeout;
 
 	/*
 	 * The OOM reaper might be about to clear task_to_reap after
@@ -649,14 +651,6 @@ static int __init oom_init(void)
 		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
 				PTR_ERR(oom_reaper_th));
 		oom_reaper_th = NULL;
-	} else {
-		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
-		/*
-		 * Make sure our oom reaper thread will get scheduled when
-		 * ASAP and that it won't get preempted by malicious userspace.
-		 */
-		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
 	}
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-03 10:31               ` Tetsuo Handa
  0 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-02-03 10:31 UTC (permalink / raw)
  To: rientjes, mhocko, hannes
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

David Rientjes wrote:
> On Tue, 2 Feb 2016, Michal Hocko wrote:
> > > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > > disrupting the global runqueue by running oom_reaper at a high priority.  
> > > The disruption itself is not only in first wakeup but also in how long the 
> > > reaper can run and when it is rescheduled: for a lot of memory this is 
> > > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > > can't possibly be expected to be immediate.  This high priority should be 
> > > removed so memcg oom conditions are isolated and don't affect other loads.
> > 
> > If this is a concern then I would be tempted to simply disable oom
> > reaper for memcg oom altogether. For me it is much more important that
> > the reaper, even though a best effort, is guaranteed to schedule if
> > something goes terribly wrong on the machine.
> > 
> 
> I don't believe the higher priority guarantees it is able to schedule any 
> more than it was guaranteed to schedule before.  It will run, but it won't 
> preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
> only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
> nodemasks.  I think it would be disappointing to leave those out.  I think 
> the higher priority should simply be removed in terms of fairness.
> 
> Other than these issues, I don't see any reason why a refreshed series 
> wouldn't be immediately acked.  Thanks very much for continuing to work on 
> this!
> 

Excuse me, but I came to think that we should try to wake up the OOM reaper at

    if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
        if (!is_sysrq_oom(oc))
            return OOM_SCAN_ABORT;
    }

in oom_scan_process_thread() rather than at oom_kill_process() or at
mark_oom_victim(). Waking up the OOM reaper there will try to reap
task->mm, and give up eventually which will in turn naturally allow the
OOM killer to choose next OOM victim. The key point is PATCH 2/5 shown
below. What do you think?

PATCH 1/5 is (I think) a bug fix.
PATCH 2/5 is for waking up the OOM reaper from victim selection loop.
PATCH 3/5 is for helping the OOM killer to choose next OOM victim.
PATCH 4/5 is for handling corner cases.
PATCH 5/5 is for changing the OOM reaper to use default priority.

 include/linux/oom.h |    3
 mm/oom_kill.c       |  173 ++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 136 insertions(+), 40 deletions(-)

----------------------------------------
>From e1c0a78fbfd0a76f367efac269cbcf22c7df9292 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:18:19 +0900
Subject: [PATCH 1/5] mm,oom: Fix incorrect oom_task_origin check.

Currently, the OOM killer unconditionally selects p if oom_task_origin(p)
is true, but p should not be OOM-killed if p is marked as OOM-unkillable.

This patch does not fix a race condition where p is selected when p was
by chance between set_current_oom_origin() and actually start operations
that might trigger an OOM event when an OOM event is triggered for some
reason other than operations by p.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..59481e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,7 +67,8 @@ static inline void clear_current_oom_origin(void)
 
 static inline bool oom_task_origin(const struct task_struct *p)
 {
-	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
+	return (p->signal->oom_flags & OOM_FLAG_ORIGIN) &&
+		p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN;
 }
 
 extern void mark_oom_victim(struct task_struct *tsk);
-- 
1.8.3.1

>From 76cf60d33e4e1daa475e4c1e39087415a309c6e9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:20:07 +0900
Subject: [PATCH 2/5] mm,oom: Change timing of waking up the OOM reaper

Currently, the OOM reaper kernel thread is woken up when we set TIF_MEMDIE
on a task. But it is not easy to build a reliable OOM-reap queuing chain.

Since the OOM livelock problem occurs when we find TIF_MEMDIE on a task
which cannot terminate, waking up the OOM reaper when we found TIF_MEMDIE
on a task can simplify handling of the chain. Also, we don't need to wake
up the OOM reaper if the victim can smoothly terminate. Therefore, this
patch replaces wake_oom_reaper() called from oom_kill_process() with
try_oom_reap() called from oom_scan_process_thread().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b51bcce..07c6389 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -268,6 +268,8 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+static bool try_oom_reap(struct task_struct *tsk);
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
@@ -279,7 +281,7 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (!is_sysrq_oom(oc))
+		if (!is_sysrq_oom(oc) && try_oom_reap(task))
 			return OOM_SCAN_ABORT;
 	}
 	if (!task->mm)
@@ -420,6 +422,40 @@ static struct task_struct *oom_reaper_th;
 static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
+static bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *g;
+	struct task_struct *p;
+
+	/*
+	 * Since it is possible that p voluntarily called do_exit() or
+	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
+	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
+	 * has pending SIGKILL or already reached do_exit().
+	 *
+	 * On the other hand, it is possible that mark_oom_victim(p) is called
+	 * without sending SIGKILL to all tasks using this mm. In this case,
+	 * the OOM reaper cannot reap this mm unless p is the only task using
+	 * this mm.
+	 *
+	 * Therefore, determine whether this mm is reapable by testing whether
+	 * all tasks using this mm are dying or already exiting rather than
+	 * depending on p->signal->oom_score_adj value which is updated by the
+	 * OOM reaper.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (mm != READ_ONCE(p->mm) ||
+		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
+			continue;
+		mm = NULL;
+		goto out;
+	}
+ out:
+	rcu_read_unlock();
+	return mm != NULL;
+}
+
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -448,7 +484,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
 	}
@@ -500,7 +536,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
-	/* Drop a reference taken by wake_oom_reaper */
+	/* Drop a reference taken by try_oom_reap */
 	put_task_struct(tsk);
 }
 
@@ -512,18 +548,44 @@ static int oom_reaper(void *unused)
 		wait_event_freezable(oom_reaper_wait,
 				     (tsk = READ_ONCE(task_to_reap)));
 		oom_reap_task(tsk);
+		/*
+		 * The OOM killer might be about to call try_oom_reap() after
+		 * seeing TIF_MEMDIE.
+		 */
+		smp_wmb();
 		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static bool try_oom_reap(struct task_struct *tsk)
 {
 	struct task_struct *old_tsk;
 
+	/*
+	 * We will livelock if we unconditionally return true.
+	 * We will kill all tasks if we unconditionally return false.
+	 */
 	if (!oom_reaper_th)
-		return;
+		return true;
+
+	/*
+	 * Wait for the OOM reaper to reap this task and mark this task
+	 * as OOM-unkillable and clear TIF_MEMDIE. Since the OOM reaper
+	 * has high scheduling priority, we can unconditionally wait for
+	 * completion.
+	 */
+	if (task_to_reap)
+		return true;
+
+	/*
+	 * The OOM reaper might be about to clear task_to_reap after
+	 * clearing TIF_MEMDIE.
+	 */
+	smp_rmb();
+	if (!test_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return false;
 
 	get_task_struct(tsk);
 
@@ -537,6 +599,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 		wake_up(&oom_reaper_wait);
 	else
 		put_task_struct(tsk);
+	return true;
 }
 
 static int __init oom_init(void)
@@ -559,8 +622,13 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct task_struct *mm)
+static bool try_oom_reap(struct task_struct *tsk)
 {
+	/*
+	 * We will livelock if we unconditionally return true.
+	 * We will kill all tasks if we unconditionally return false.
+	 */
+	return true;
 }
 #endif
 
@@ -592,7 +660,8 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk, TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -669,7 +738,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -762,23 +830,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (is_global_init(p))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
-		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
1.8.3.1

>From 8c6024b963d5b4e8d38a3416e14b458e1e073607 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:42:28 +0900
Subject: [PATCH 3/5] mm,oom: Always update OOM score and clear TIF_MEMDIE
 after OOM reap.

This patch updates victim's oom_score_adj and clear TIF_MEMDIE
even if the OOM reaper failed to reap victim's memory. This is
needed for handling corner cases where TIF_MEMDIE is set on a victim
without sending SIGKILL to all tasks sharing the same memory.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 50 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 07c6389..a0ae8dc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,6 +413,44 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+static void update_victim_score(struct task_struct *tsk, bool reap_success)
+{
+	/*
+	 * If we succeeded to reap a mm, mark that task using it as
+	 * OOM-unkillable and clear TIF_MEMDIE, for the task shouldn't be
+	 * sitting on a reasonably reclaimable memory anymore.
+	 * OOM killer can continue by selecting other victim if unmapping
+	 * hasn't led to any improvements. This also means that selecting
+	 * this task doesn't make any sense.
+	 */
+	if (reap_success)
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	/*
+	 * If we failed to reap a mm, mark that task using it as almost
+	 * OOM-unkillable and clear TIF_MEMDIE. This will help future
+	 * select_bad_process() try to select other OOM-killable tasks
+	 * before selecting that task again.
+	 */
+	else if (tsk->signal->oom_score_adj > OOM_SCORE_ADJ_MIN + 1)
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN + 1;
+	/*
+	 * But if that task got TIF_MEMDIE when that task is already marked as
+	 * almost OOM-unkillable, mark that task completely OOM-unkillable.
+	 * Otherwise, we cannot make progress when all OOM-killable tasks are
+	 * marked as almost OOM-unkillable.
+	 *
+	 * Note that the reason we fail to reap a mm might be that there are
+	 * tasks using this mm without neither pending SIGKILL nor PF_EXITING
+	 * which means that we set TIF_MEMDIE on a task without sending SIGKILL
+	 * to tasks sharing this mm. In this case, we will call panic() without
+	 * sending SIGKILL to tasks sharing this mm when all OOM-killable tasks
+	 * are marked as completely OOM-unkillable.
+	 */
+	else
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
+}
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -513,16 +551,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
-	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
@@ -536,6 +564,8 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
+	update_victim_score(tsk, attempts < 10);
+
 	/* Drop a reference taken by try_oom_reap */
 	put_task_struct(tsk);
 }
-- 
1.8.3.1

>From d6254acc565af7456fb21c0bb7568452fb227f3c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:44:47 +0900
Subject: [PATCH 4/5] mm,oom: Add timeout counter for handling corner cases.

Currently, we can hit OOM livelock if the OOM reaper kernel thread is
not available. This patch adds a simple timeout based next victim
selection logic in case the OOM reaper kernel thread is unavailable.

Future patch will add hooks for allowing global access to memory
reserves before this timeout counter expires.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a0ae8dc..e4e955b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -451,6 +451,11 @@ static void update_victim_score(struct task_struct *tsk, bool reap_success)
 	exit_oom_victim(tsk);
 }
 
+static void oomkiller_reset(unsigned long arg)
+{
+}
+static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -596,9 +601,14 @@ static bool try_oom_reap(struct task_struct *tsk)
 	/*
 	 * We will livelock if we unconditionally return true.
 	 * We will kill all tasks if we unconditionally return false.
+	 * Thus, use a simple timeout counter if the OOM reaper is unavailable.
 	 */
-	if (!oom_reaper_th)
-		return true;
+	if (!oom_reaper_th) {
+		if (timer_pending(&oomkiller_victim_wait_timer))
+			return true;
+		update_victim_score(tsk, false);
+		return false;
+	}
 
 	/*
 	 * Wait for the OOM reaper to reap this task and mark this task
@@ -654,11 +664,11 @@ subsys_initcall(oom_init)
 #else
 static bool try_oom_reap(struct task_struct *tsk)
 {
-	/*
-	 * We will livelock if we unconditionally return true.
-	 * We will kill all tasks if we unconditionally return false.
-	 */
-	return true;
+	/* Use a simple timeout counter, for the OOM reaper is unavailable. */
+	if (timer_pending(&oomkiller_victim_wait_timer))
+		return true;
+	update_victim_score(tsk, false);
+	return false;
 }
 #endif

@@ -683,6 +693,8 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	/* Make sure that we won't wait for this task forever. */
+	mod_timer(&oomkiller_victim_wait_timer, jiffies + 5 * HZ);
 }
 
 /**
-- 
1.8.3.1

>From 6156462d2db03bfc9fe76ca5a3f0ebcc5a88a12e Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:45:29 +0900
Subject: [PATCH 5/5] mm,oom: Use normal scheduling priority for the OOM reaper

Currently, the OOM reaper kernel thread has high scheduling priority
in order to make sure that OOM-reap operation occurs immediately.

This patch changes the scheduling priority to normal, and fallback to
a simple timeout based next victim selection logic if the OOM reaper
fails to get enough CPU resource.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e4e955b..b55159f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -604,6 +604,7 @@ static bool try_oom_reap(struct task_struct *tsk)
 	 * Thus, use a simple timeout counter if the OOM reaper is unavailable.
 	 */
 	if (!oom_reaper_th) {
+check_timeout:
 		if (timer_pending(&oomkiller_victim_wait_timer))
 			return true;
 		update_victim_score(tsk, false);
@@ -613,11 +614,12 @@ static bool try_oom_reap(struct task_struct *tsk)
 	/*
 	 * Wait for the OOM reaper to reap this task and mark this task
 	 * as OOM-unkillable and clear TIF_MEMDIE. Since the OOM reaper
-	 * has high scheduling priority, we can unconditionally wait for
-	 * completion.
+	 * has normal scheduling priority, we can't wait for completion
+	 * forever. Thus, use a simple timeout counter in case the OOM
+	 * reaper fails to get enough CPU resource.
 	 */
 	if (task_to_reap)
-		return true;
+		goto check_timeout;
 
 	/*
 	 * The OOM reaper might be about to clear task_to_reap after
@@ -649,14 +651,6 @@ static int __init oom_init(void)
 		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
 				PTR_ERR(oom_reaper_th));
 		oom_reaper_th = NULL;
-	} else {
-		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
-		/*
-		 * Make sure our oom reaper thread will get scheduled when
-		 * ASAP and that it won't get preempted by malicious userspace.
-		 */
-		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
 	}
 	return 0;
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-01-11 16:52     ` Johannes Weiner
@ 2016-02-15 10:58       ` Tetsuo Handa
  -1 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-02-15 10:58 UTC (permalink / raw)
  To: hannes, mhocko
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

Andrew Morton wrote:
> 
> The patch titled
>      Subject: mm/oom_kill.c: don't ignore oom score on exiting tasks
> has been removed from the -mm tree.  Its filename was
>      mm-oom_killc-dont-skip-pf_exiting-tasks-when-searching-for-a-victim.patch
> 
> This patch was dropped because an updated version will be merged
> 
> ------------------------------------------------------
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: mm/oom_kill.c: don't ignore oom score on exiting tasks
> 
> When the OOM killer scans tasks and encounters a PF_EXITING one, it
> force-selects that one regardless of the score.  Is there a possibility
> that the task might hang after it has set PF_EXITING?  In that case the
> OOM killer should be able to move on to the next task.
> 
> Frankly, I don't even know why we check for exiting tasks in the OOM
> killer.  We've tried direct reclaim at least 15 times by the time we
> decide the system is OOM, there was plenty of time to exit and free
> memory; and a task might exit voluntarily right after we issue a kill. 
> This is testing pure noise.
> 

I can't find updated version of this patch in linux-next. Why don't you submit?
I think the patch description should be updated because this patch solves yet
another silent OOM livelock bug.

Say, there is a process with two threads named Thread1 and Thread2.
Since the OOM killer sets TIF_MEMDIE only on the first non-NULL mm task,
it is possible that Thread2 invokes the OOM killer and Thread1 gets
TIF_MEMDIE (without sending SIGKILL to processes using Thread1's mm).

----------
Thread1                       Thread2
                              Calls mmap()
Calls _exit(0)
                              Arrives at vm_mmap_pgoff()
Arrives at do_exit()
Gets PF_EXITING via exit_signals()
                              Calls down_write(&mm->mmap_sem)
                              Calls do_mmap_pgoff()
Calls down_read(&mm->mmap_sem) from exit_mm()
                              Does a GFP_KERNEL allocation
                              Calls out_of_memory()
                              oom_scan_process_thread(Thread1) returns OOM_SCAN_ABORT

down_read(&mm->mmap_sem) is waiting for Thread2 to call up_write(&mm->mmap_sem)
                              but Thread2 is waiting for Thread1 to set Thread1->mm = NULL ... silent OOM livelock!
----------

The OOM reaper tries to avoid this livelock by using down_read_trylock()
instead of down_read(), but core_state check in exit_mm() cannot avoid this
livelock unless we use non-blocking allocation (i.e. GFP_ATOMIC or GFP_NOWAIT)
for allocations between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem).

I think that the same problem exists for any task_will_free_mem()-based
optimizations such as

        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

in out_of_memory() and

        task_lock(p);
        if (p->mm && task_will_free_mem(p)) {
                mark_oom_victim(p);
                task_unlock(p);
                put_task_struct(p);
                return;
        }
        task_unlock(p);

in oom_kill_process() and

        if (fatal_signal_pending(current) || task_will_free_mem(current)) {
                mark_oom_victim(current);
                goto unlock;
        }

in mem_cgroup_out_of_memory().

Well, what are possible callers of task_will_free_mem(current) between getting
PF_EXITING and doing current->mm = NULL ? tty_audit_exit() seems to be an example
which does a GFP_KERNEL allocation from tty_audit_log() and can be later blocked
at down_read() in exit_mm() after TIF_MEMDIE is set at tty_audit_log() called from
tty_audit_exit() ?

Is task_will_free_mem(current) possible for mem_cgroup_out_of_memory() case?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-15 10:58       ` Tetsuo Handa
  0 siblings, 0 replies; 56+ messages in thread
From: Tetsuo Handa @ 2016-02-15 10:58 UTC (permalink / raw)
  To: hannes, mhocko
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, mhocko

Andrew Morton wrote:
> 
> The patch titled
>      Subject: mm/oom_kill.c: don't ignore oom score on exiting tasks
> has been removed from the -mm tree.  Its filename was
>      mm-oom_killc-dont-skip-pf_exiting-tasks-when-searching-for-a-victim.patch
> 
> This patch was dropped because an updated version will be merged
> 
> ------------------------------------------------------
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: mm/oom_kill.c: don't ignore oom score on exiting tasks
> 
> When the OOM killer scans tasks and encounters a PF_EXITING one, it
> force-selects that one regardless of the score.  Is there a possibility
> that the task might hang after it has set PF_EXITING?  In that case the
> OOM killer should be able to move on to the next task.
> 
> Frankly, I don't even know why we check for exiting tasks in the OOM
> killer.  We've tried direct reclaim at least 15 times by the time we
> decide the system is OOM, there was plenty of time to exit and free
> memory; and a task might exit voluntarily right after we issue a kill. 
> This is testing pure noise.
> 

I can't find updated version of this patch in linux-next. Why don't you submit?
I think the patch description should be updated because this patch solves yet
another silent OOM livelock bug.

Say, there is a process with two threads named Thread1 and Thread2.
Since the OOM killer sets TIF_MEMDIE only on the first non-NULL mm task,
it is possible that Thread2 invokes the OOM killer and Thread1 gets
TIF_MEMDIE (without sending SIGKILL to processes using Thread1's mm).

----------
Thread1                       Thread2
                              Calls mmap()
Calls _exit(0)
                              Arrives at vm_mmap_pgoff()
Arrives at do_exit()
Gets PF_EXITING via exit_signals()
                              Calls down_write(&mm->mmap_sem)
                              Calls do_mmap_pgoff()
Calls down_read(&mm->mmap_sem) from exit_mm()
                              Does a GFP_KERNEL allocation
                              Calls out_of_memory()
                              oom_scan_process_thread(Thread1) returns OOM_SCAN_ABORT

down_read(&mm->mmap_sem) is waiting for Thread2 to call up_write(&mm->mmap_sem)
                              but Thread2 is waiting for Thread1 to set Thread1->mm = NULL ... silent OOM livelock!
----------

The OOM reaper tries to avoid this livelock by using down_read_trylock()
instead of down_read(), but core_state check in exit_mm() cannot avoid this
livelock unless we use non-blocking allocation (i.e. GFP_ATOMIC or GFP_NOWAIT)
for allocations between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem).

I think that the same problem exists for any task_will_free_mem()-based
optimizations such as

        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

in out_of_memory() and

        task_lock(p);
        if (p->mm && task_will_free_mem(p)) {
                mark_oom_victim(p);
                task_unlock(p);
                put_task_struct(p);
                return;
        }
        task_unlock(p);

in oom_kill_process() and

        if (fatal_signal_pending(current) || task_will_free_mem(current)) {
                mark_oom_victim(current);
                goto unlock;
        }

in mem_cgroup_out_of_memory().

Well, what are possible callers of task_will_free_mem(current) between getting
PF_EXITING and doing current->mm = NULL ? tty_audit_exit() seems to be an example
which does a GFP_KERNEL allocation from tty_audit_log() and can be later blocked
at down_read() in exit_mm() after TIF_MEMDIE is set at tty_audit_log() called from
tty_audit_exit() ?

Is task_will_free_mem(current) possible for mem_cgroup_out_of_memory() case?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2016-02-15 10:59 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-06 15:42 [PATCH 0/2 -mm] oom reaper v4 Michal Hocko
2016-01-06 15:42 ` Michal Hocko
2016-01-06 15:42 ` [PATCH 1/2] mm, oom: introduce oom reaper Michal Hocko
2016-01-06 15:42   ` Michal Hocko
2016-01-07 11:23   ` Tetsuo Handa
2016-01-07 11:23     ` Tetsuo Handa
2016-01-07 12:30     ` Michal Hocko
2016-01-07 12:30       ` Michal Hocko
2016-01-11 22:54   ` Andrew Morton
2016-01-11 22:54     ` Andrew Morton
2016-01-12  8:16     ` Michal Hocko
2016-01-12  8:16       ` Michal Hocko
2016-01-28  1:28   ` David Rientjes
2016-01-28  1:28     ` David Rientjes
2016-01-28 21:42     ` Michal Hocko
2016-01-28 21:42       ` Michal Hocko
2016-02-02  3:02       ` David Rientjes
2016-02-02  3:02         ` David Rientjes
2016-02-02  8:57         ` Michal Hocko
2016-02-02  8:57           ` Michal Hocko
2016-02-02 11:48           ` Tetsuo Handa
2016-02-02 11:48             ` Tetsuo Handa
2016-02-02 22:55             ` David Rientjes
2016-02-02 22:55               ` David Rientjes
2016-02-02 22:51           ` David Rientjes
2016-02-02 22:51             ` David Rientjes
2016-02-03 10:31             ` Tetsuo Handa
2016-02-03 10:31               ` Tetsuo Handa
2016-01-06 15:42 ` [PATCH 2/2] oom reaper: handle anonymous mlocked pages Michal Hocko
2016-01-06 15:42   ` Michal Hocko
2016-01-07  8:14   ` Michal Hocko
2016-01-07  8:14     ` Michal Hocko
2016-01-11 12:42 ` [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Michal Hocko
2016-01-11 12:42   ` Michal Hocko
2016-01-11 16:52   ` Johannes Weiner
2016-01-11 16:52     ` Johannes Weiner
2016-01-11 17:46     ` Michal Hocko
2016-01-11 17:46       ` Michal Hocko
2016-02-15 10:58     ` Tetsuo Handa
2016-02-15 10:58       ` Tetsuo Handa
2016-01-18  4:35   ` Tetsuo Handa
2016-01-18  4:35     ` Tetsuo Handa
2016-01-18 10:22     ` Tetsuo Handa
2016-01-18 10:22       ` Tetsuo Handa
2016-01-26 16:38     ` Michal Hocko
2016-01-26 16:38       ` Michal Hocko
2016-01-28 11:24       ` Tetsuo Handa
2016-01-28 11:24         ` Tetsuo Handa
2016-01-28 21:51         ` Michal Hocko
2016-01-28 21:51           ` Michal Hocko
2016-01-28 22:26           ` Tetsuo Handa
2016-01-28 22:26             ` Tetsuo Handa
2016-01-28 22:36             ` Michal Hocko
2016-01-28 22:36               ` Michal Hocko
2016-01-28 22:33   ` Michal Hocko
2016-01-28 22:33     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.