All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-15 18:36 ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-15 18:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
this is version 3 of the patch. Thanks to Tetsuo for his exhaustive
review.  There haven't been any fundamental objections to the approach
so I have dropped the RFC status. Any further feedback is welcome of
course.

 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  12 ++---
 mm/oom_kill.c      | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 149 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..ccfd48c7286e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/* oom_reaper cannot tear down dirty pages */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1125,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1231,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1239,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5314b206caa5..48025a21f8c4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,105 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+		wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -517,6 +621,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,15 +712,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this process
+			 * because it wouldn't get killed and so the memory might be
+			 * still used.
+			 */
+			can_oom_reap = false;
 			continue;
+		}
 
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -767,3 +880,23 @@ void pagefault_out_of_memory(void)
 
 	mutex_unlock(&oom_lock);
 }
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+module_init(oom_init)
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-15 18:36 ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-15 18:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
Hi,
this is version 3 of the patch. Thanks to Tetsuo for his exhaustive
review.  There haven't been any fundamental objections to the approach
so I have dropped the RFC status. Any further feedback is welcome of
course.

 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  12 ++---
 mm/oom_kill.c      | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 149 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..ccfd48c7286e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/* oom_reaper cannot tear down dirty pages */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1125,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1231,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1239,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5314b206caa5..48025a21f8c4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,105 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+		wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -517,6 +621,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,15 +712,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this process
+			 * because it wouldn't get killed and so the memory might be
+			 * still used.
+			 */
+			can_oom_reap = false;
 			continue;
+		}
 
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -767,3 +880,23 @@ void pagefault_out_of_memory(void)
 
 	mutex_unlock(&oom_lock);
 }
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+module_init(oom_init)
-- 
2.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-15 18:36 ` Michal Hocko
@ 2015-12-17  0:50   ` Andrew Morton
  -1 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-17  0:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> ...
>
> +static void oom_reap_vmas(struct mm_struct *mm)
> +{
> +	int attempts = 0;
> +
> +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> +		schedule_timeout(HZ/10);

schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
msleep() or msleep_interruptible().  I can't decide which is more
appropriate - it only affects the load average display.

Which prompts the obvious question: as the no-operativeness of this
call wasn't noticed in testing, why do we have it there...

I guess it means that the __oom_reap_vmas() success rate is nice and
high ;)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-17  0:50   ` Andrew Morton
  0 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-17  0:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> ...
>
> +static void oom_reap_vmas(struct mm_struct *mm)
> +{
> +	int attempts = 0;
> +
> +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> +		schedule_timeout(HZ/10);

schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
msleep() or msleep_interruptible().  I can't decide which is more
appropriate - it only affects the load average display.

Which prompts the obvious question: as the no-operativeness of this
call wasn't noticed in testing, why do we have it there...

I guess it means that the __oom_reap_vmas() success rate is nice and
high ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-17  0:50   ` Andrew Morton
@ 2015-12-17 13:02     ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-17 13:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > +static void oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +	int attempts = 0;
> > +
> > +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > +		schedule_timeout(HZ/10);
> 
> schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
> msleep() or msleep_interruptible().  I can't decide which is more
> appropriate - it only affects the load average display.

Ups. You are right. I will go with msleep_interruptible(100).
 
> Which prompts the obvious question: as the no-operativeness of this
> call wasn't noticed in testing, why do we have it there...

Well, the idea was that an interfering mmap_sem operation which holds
it for write might block us for a short time period - e.g. when not
depending on an allocation or accessing the memory reserves helps
to progress the allocation. If the holder of the semaphore is stuck
then the retry is pointless. On the other hand the retry shouldn't be
harmful. All in all this is just a heuristic and we do not depend on
it. I guess we can drop it and nobody would actually notice. Let me know
if you prefer that and I will respin the patch.


> I guess it means that the __oom_reap_vmas() success rate is nice anud
> high ;)

I had a debugging trace_printks around this and there were no reties
during my testing so I was probably lucky to not trigger the mmap_sem
contention.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 48025a21f8c4..f53f87cfd899 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
 	int attempts = 0;
 
 	while (attempts++ < 10 && !__oom_reap_vmas(mm))
-		schedule_timeout(HZ/10);
+		msleep_interruptible(100);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	mmdrop(mm);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-17 13:02     ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-17 13:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > +static void oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +	int attempts = 0;
> > +
> > +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > +		schedule_timeout(HZ/10);
> 
> schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
> msleep() or msleep_interruptible().  I can't decide which is more
> appropriate - it only affects the load average display.

Ups. You are right. I will go with msleep_interruptible(100).
 
> Which prompts the obvious question: as the no-operativeness of this
> call wasn't noticed in testing, why do we have it there...

Well, the idea was that an interfering mmap_sem operation which holds
it for write might block us for a short time period - e.g. when not
depending on an allocation or accessing the memory reserves helps
to progress the allocation. If the holder of the semaphore is stuck
then the retry is pointless. On the other hand the retry shouldn't be
harmful. All in all this is just a heuristic and we do not depend on
it. I guess we can drop it and nobody would actually notice. Let me know
if you prefer that and I will respin the patch.


> I guess it means that the __oom_reap_vmas() success rate is nice anud
> high ;)

I had a debugging trace_printks around this and there were no reties
during my testing so I was probably lucky to not trigger the mmap_sem
contention.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 48025a21f8c4..f53f87cfd899 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
 	int attempts = 0;
 
 	while (attempts++ < 10 && !__oom_reap_vmas(mm))
-		schedule_timeout(HZ/10);
+		msleep_interruptible(100);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	mmdrop(mm);

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-17 13:02     ` Michal Hocko
@ 2015-12-17 19:55       ` Linus Torvalds
  -1 siblings, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2015-12-17 19:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko <mhocko@kernel.org> wrote:
> Ups. You are right. I will go with msleep_interruptible(100).

I don't think that's right.

If a signal happens, that loop is now (again) just busy-looping. That
doesn't sound right, although with the maximum limit of 10 attempts,
maybe it's fine - the thing is technically "busylooping", but it will
definitely not busy-loop for very long.

So maybe that code is fine, but I think the signal case might at least
merit a comment?

Also, if you actually do want UNINTERRUPTIBLE (no reaction to signals
at all), but don't want to be seen as being "load" on the system, you
can use TASK_IDLE, which is a combination of TASK_UNINTERRUPTIBLE |
TASK_NOLOAD.

Because if you sleep interruptibly, you do generally need to handle
signals (although that limit count may make it ok in this case).

There's basically three levels:

 - TASK_UNINTERRUPTIBLE: no signal handling at all

 - TASK_KILLABLE: no normal signal handling, but ok to be killed
(needs to check fatal_signal_pending() and exit)

 - TASK_INTERRUPTIBLE: will react to signals

(and then that TASK_IDLE thing that is semantically the same as
uninterruptible, but doesn't count against the load average).

The main use for TASK_KILLABLE is in places where expected semantics
do not allow a EINTR return, but we know that because the process is
about to be killed, we can ignore that, for the simple reason that
nobody will ever *see* the EINTR.

Btw, I think you might want to re-run your test-case after this
change, since the whole "busy loop vs actually sleeping" might just
have changed the result..

              Linus

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-17 19:55       ` Linus Torvalds
  0 siblings, 0 replies; 78+ messages in thread
From: Linus Torvalds @ 2015-12-17 19:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko <mhocko@kernel.org> wrote:
> Ups. You are right. I will go with msleep_interruptible(100).

I don't think that's right.

If a signal happens, that loop is now (again) just busy-looping. That
doesn't sound right, although with the maximum limit of 10 attempts,
maybe it's fine - the thing is technically "busylooping", but it will
definitely not busy-loop for very long.

So maybe that code is fine, but I think the signal case might at least
merit a comment?

Also, if you actually do want UNINTERRUPTIBLE (no reaction to signals
at all), but don't want to be seen as being "load" on the system, you
can use TASK_IDLE, which is a combination of TASK_UNINTERRUPTIBLE |
TASK_NOLOAD.

Because if you sleep interruptibly, you do generally need to handle
signals (although that limit count may make it ok in this case).

There's basically three levels:

 - TASK_UNINTERRUPTIBLE: no signal handling at all

 - TASK_KILLABLE: no normal signal handling, but ok to be killed
(needs to check fatal_signal_pending() and exit)

 - TASK_INTERRUPTIBLE: will react to signals

(and then that TASK_IDLE thing that is semantically the same as
uninterruptible, but doesn't count against the load average).

The main use for TASK_KILLABLE is in places where expected semantics
do not allow a EINTR return, but we know that because the process is
about to be killed, we can ignore that, for the simple reason that
nobody will ever *see* the EINTR.

Btw, I think you might want to re-run your test-case after this
change, since the whole "busy loop vs actually sleeping" might just
have changed the result..

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-17 19:55       ` Linus Torvalds
@ 2015-12-17 20:00         ` Andrew Morton
  -1 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-17 20:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > Ups. You are right. I will go with msleep_interruptible(100).
> 
> I don't think that's right.
> 
> If a signal happens, that loop is now (again) just busy-looping.

It's called only by a kernel thread so no signal_pending().  This
relationship is a bit unobvious and fragile, but we do it in quite a
few places.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-17 20:00         ` Andrew Morton
  0 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-17 20:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > Ups. You are right. I will go with msleep_interruptible(100).
> 
> I don't think that's right.
> 
> If a signal happens, that loop is now (again) just busy-looping.

It's called only by a kernel thread so no signal_pending().  This
relationship is a bit unobvious and fragile, but we do it in quite a
few places.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-17 13:02     ` Michal Hocko
@ 2015-12-17 21:13       ` Andrew Morton
  -1 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-17 21:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 17 Dec 2015 14:02:24 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
> 
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 48025a21f8c4..f53f87cfd899 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
>  	int attempts = 0;
>  
>  	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> -		schedule_timeout(HZ/10);
> +		msleep_interruptible(100);
>  
>  	/* Drop a reference taken by wake_oom_reaper */
>  	mmdrop(mm);

Timeliness matter here.  Over on the other CPU, direct reclaim is
pounding along, on its way to declaring oom.  Sometimes the oom_reaper
thread will end up scavenging memory on behalf of a caller who gave up
a long time ago.  But we shouldn't atempt to "fix" that unless we can
demonstrate that it's a problem.


Also, re-reading your description:

: It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
: construct workloads which break the core assumption mentioned above and
: the OOM victim might take unbounded amount of time to exit because it
: might be blocked in the uninterruptible state waiting for on an event
: (e.g.  lock) which is blocked by another task looping in the page
: allocator.

So the allocating task has done an oom-kill and is waiting for memory
to become available.  The killed task is stuck on some lock, unable to
free memory.

But the problematic lock will sometimes be the killed tasks's mmap_sem,
so the reaper won't reap anything.  This scenario requires that the
mmap_sem is held for writing, which sounds like it will be uncommon. 
hm.  sigh.  I hate the oom-killer.  Just buy some more memory already!



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-17 21:13       ` Andrew Morton
  0 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-17 21:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 17 Dec 2015 14:02:24 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
> 
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 48025a21f8c4..f53f87cfd899 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
>  	int attempts = 0;
>  
>  	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> -		schedule_timeout(HZ/10);
> +		msleep_interruptible(100);
>  
>  	/* Drop a reference taken by wake_oom_reaper */
>  	mmdrop(mm);

Timeliness matter here.  Over on the other CPU, direct reclaim is
pounding along, on its way to declaring oom.  Sometimes the oom_reaper
thread will end up scavenging memory on behalf of a caller who gave up
a long time ago.  But we shouldn't atempt to "fix" that unless we can
demonstrate that it's a problem.


Also, re-reading your description:

: It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
: construct workloads which break the core assumption mentioned above and
: the OOM victim might take unbounded amount of time to exit because it
: might be blocked in the uninterruptible state waiting for on an event
: (e.g.  lock) which is blocked by another task looping in the page
: allocator.

So the allocating task has done an oom-kill and is waiting for memory
to become available.  The killed task is stuck on some lock, unable to
free memory.

But the problematic lock will sometimes be the killed tasks's mmap_sem,
so the reaper won't reap anything.  This scenario requires that the
mmap_sem is held for writing, which sounds like it will be uncommon. 
hm.  sigh.  I hate the oom-killer.  Just buy some more memory already!


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-15 18:36 ` Michal Hocko
@ 2015-12-18  0:15   ` Andrew Morton
  -1 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-18  0:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) 

CONFIG_MMU=n:

slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'

I did the below so I can get an mmotm out the door, but hopefully
there's a cleaner way.

--- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-fix-3
+++ a/mm/oom_kill.c
@@ -415,6 +415,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victi
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
@@ -517,6 +518,27 @@ static void wake_oom_reaper(struct mm_st
 		mmdrop(mm);
 }
 
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+module_init(oom_init)
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -626,7 +648,9 @@ void oom_kill_process(struct oom_control
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+#ifdef CONFIG_MMU
 	bool can_oom_reap = true;
+#endif
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -719,6 +743,7 @@ void oom_kill_process(struct oom_control
 			continue;
 		if (is_global_init(p))
 			continue;
+#ifdef CONFIG_MMU
 		if (unlikely(p->flags & PF_KTHREAD) ||
 		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 			/*
@@ -729,13 +754,16 @@ void oom_kill_process(struct oom_control
 			can_oom_reap = false;
 			continue;
 		}
-
+#endif
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+#ifdef CONFIG_MMU
 	if (can_oom_reap)
 		wake_oom_reaper(mm);
+#endif
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -887,23 +915,3 @@ void pagefault_out_of_memory(void)
 
 	mutex_unlock(&oom_lock);
 }
-
-static int __init oom_init(void)
-{
-	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
-	if (IS_ERR(oom_reaper_th)) {
-		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
-				PTR_ERR(oom_reaper_th));
-		oom_reaper_th = NULL;
-	} else {
-		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
-		/*
-		 * Make sure our oom reaper thread will get scheduled when
-		 * ASAP and that it won't get preempted by malicious userspace.
-		 */
-		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
-	}
-	return 0;
-}
-module_init(oom_init)
_


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-18  0:15   ` Andrew Morton
  0 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-18  0:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) 

CONFIG_MMU=n:

slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'

I did the below so I can get an mmotm out the door, but hopefully
there's a cleaner way.

--- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-fix-3
+++ a/mm/oom_kill.c
@@ -415,6 +415,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victi
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
  * victim (if that is possible) to help the OOM killer to move on.
@@ -517,6 +518,27 @@ static void wake_oom_reaper(struct mm_st
 		mmdrop(mm);
 }
 
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+module_init(oom_init)
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -626,7 +648,9 @@ void oom_kill_process(struct oom_control
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+#ifdef CONFIG_MMU
 	bool can_oom_reap = true;
+#endif
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -719,6 +743,7 @@ void oom_kill_process(struct oom_control
 			continue;
 		if (is_global_init(p))
 			continue;
+#ifdef CONFIG_MMU
 		if (unlikely(p->flags & PF_KTHREAD) ||
 		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 			/*
@@ -729,13 +754,16 @@ void oom_kill_process(struct oom_control
 			can_oom_reap = false;
 			continue;
 		}
-
+#endif
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+#ifdef CONFIG_MMU
 	if (can_oom_reap)
 		wake_oom_reaper(mm);
+#endif
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -887,23 +915,3 @@ void pagefault_out_of_memory(void)
 
 	mutex_unlock(&oom_lock);
 }
-
-static int __init oom_init(void)
-{
-	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
-	if (IS_ERR(oom_reaper_th)) {
-		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
-				PTR_ERR(oom_reaper_th));
-		oom_reaper_th = NULL;
-	} else {
-		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
-		/*
-		 * Make sure our oom reaper thread will get scheduled when
-		 * ASAP and that it won't get preempted by malicious userspace.
-		 */
-		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
-	}
-	return 0;
-}
-module_init(oom_init)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-18  0:15   ` Andrew Morton
@ 2015-12-18 11:48     ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-18 11:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu 17-12-15 16:15:21, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > This patch reduces the probability of such a lockup by introducing a
> > specialized kernel thread (oom_reaper) 
> 
> CONFIG_MMU=n:
> 
> slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
> slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
> slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
> 
> I did the below so I can get an mmotm out the door, but hopefully
> there's a cleaner way.

Sorry about that and thanks for your fixup! I am not very familiar with
!MMU world and haven't heard about issues with the OOM deadlocks yet. So
I guess making this MMU only makes some sense. I would just get rid of
ifdefs in oom_kill_process and provide an empty wake_oom_reaper for
!CONFIG_MMU.  The following on top of yours:
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..56ff1ff18c0e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -537,6 +537,10 @@ static int __init oom_init(void)
 	return 0;
 }
 module_init(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
 #endif
 
 /**
@@ -648,9 +652,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-#ifdef CONFIG_MMU
 	bool can_oom_reap = true;
-#endif
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -743,7 +745,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (is_global_init(p))
 			continue;
-#ifdef CONFIG_MMU
 		if (unlikely(p->flags & PF_KTHREAD) ||
 		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 			/*
@@ -754,15 +755,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			can_oom_reap = false;
 			continue;
 		}
-#endif
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-#ifdef CONFIG_MMU
 	if (can_oom_reap)
 		wake_oom_reaper(mm);
-#endif
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-18 11:48     ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-18 11:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu 17-12-15 16:15:21, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > This patch reduces the probability of such a lockup by introducing a
> > specialized kernel thread (oom_reaper) 
> 
> CONFIG_MMU=n:
> 
> slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
> slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
> slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
> 
> I did the below so I can get an mmotm out the door, but hopefully
> there's a cleaner way.

Sorry about that and thanks for your fixup! I am not very familiar with
!MMU world and haven't heard about issues with the OOM deadlocks yet. So
I guess making this MMU only makes some sense. I would just get rid of
ifdefs in oom_kill_process and provide an empty wake_oom_reaper for
!CONFIG_MMU.  The following on top of yours:
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..56ff1ff18c0e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -537,6 +537,10 @@ static int __init oom_init(void)
 	return 0;
 }
 module_init(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
 #endif
 
 /**
@@ -648,9 +652,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-#ifdef CONFIG_MMU
 	bool can_oom_reap = true;
-#endif
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -743,7 +745,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (is_global_init(p))
 			continue;
-#ifdef CONFIG_MMU
 		if (unlikely(p->flags & PF_KTHREAD) ||
 		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 			/*
@@ -754,15 +755,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			can_oom_reap = false;
 			continue;
 		}
-#endif
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-#ifdef CONFIG_MMU
 	if (can_oom_reap)
 		wake_oom_reaper(mm);
-#endif
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-17 20:00         ` Andrew Morton
@ 2015-12-18 11:54           ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-18 11:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu 17-12-15 12:00:04, Andrew Morton wrote:
> On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > Ups. You are right. I will go with msleep_interruptible(100).
> > 
> > I don't think that's right.
> > 
> > If a signal happens, that loop is now (again) just busy-looping.
> 
> It's called only by a kernel thread so no signal_pending().

Yes that was the thinking.

> This relationship is a bit unobvious and fragile, but we do it in
> quite a few places.

I guess Linus is right and __set_task_state(current, TASK_IDLE) would be
better and easier to read.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..eed99506b891 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -472,8 +472,10 @@ static void oom_reap_vmas(struct mm_struct *mm)
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_vmas(mm))
-		msleep_interruptible(100);
+	while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
+		__set_task_state(current, TASK_IDLE);
+		schedule_timeout(HZ/10);
+	}
 
 	/* Drop a reference taken by wake_oom_reaper */
 	mmdrop(mm);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-18 11:54           ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-18 11:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu 17-12-15 12:00:04, Andrew Morton wrote:
> On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > Ups. You are right. I will go with msleep_interruptible(100).
> > 
> > I don't think that's right.
> > 
> > If a signal happens, that loop is now (again) just busy-looping.
> 
> It's called only by a kernel thread so no signal_pending().

Yes that was the thinking.

> This relationship is a bit unobvious and fragile, but we do it in
> quite a few places.

I guess Linus is right and __set_task_state(current, TASK_IDLE) would be
better and easier to read.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..eed99506b891 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -472,8 +472,10 @@ static void oom_reap_vmas(struct mm_struct *mm)
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_vmas(mm))
-		msleep_interruptible(100);
+	while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
+		__set_task_state(current, TASK_IDLE);
+		schedule_timeout(HZ/10);
+	}
 
 	/* Drop a reference taken by wake_oom_reaper */
 	mmdrop(mm);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-17 13:02     ` Michal Hocko
@ 2015-12-18 12:10       ` Tetsuo Handa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2015-12-18 12:10 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> > On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > +static void oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > +	int attempts = 0;
> > > +
> > > +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > > +		schedule_timeout(HZ/10);
> > 
> > schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
> > msleep() or msleep_interruptible().  I can't decide which is more
> > appropriate - it only affects the load average display.
> 
> Ups. You are right. I will go with msleep_interruptible(100).
>  

I didn't know that. My testing was almost without oom_reap_vmas().

> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
> 
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.

Yes, you are lucky that you did not hit the mmap_sem contention.
I retested with

 static void oom_reap_vmas(struct mm_struct *mm)
 {
 	int attempts = 0;
 
 	while (attempts++ < 10 && !__oom_reap_vmas(mm))
-		schedule_timeout(HZ/10);
+		msleep_interruptible(100);
+	printk(KERN_WARNING "oom_reaper: attempts=%u\n", attempts);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	mmdrop(mm);
 }

and I can hit that attempts becomes 11 (i.e. oom_reap_vmas() gives up
waiting) if I ran a memory stressing program with many contending
mmap_sem readers and writers shown below.

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>

static cpu_set_t set = { { 1 } }; /* Allow only CPU 0. */
static char filename[32] = { };

/* down_read(&mm->mmap_sem) requester. */
static int reader(void *unused)
{
	const int fd = open(filename, O_RDONLY);
	char buffer[128];
	sched_setaffinity(0, sizeof(set), &set);
	sleep(2);
	while (pread(fd, buffer, sizeof(buffer), 0) > 0);
	while (1)
		pause();
	return 0;
}

/* down_write(&mm->mmap_sem) requester. */
static int writer(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	sched_setaffinity(0, sizeof(set), &set);
	sleep(2);
	while (1) {
		void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
		munmap(ptr, 4096);
	}
	return 0;
}

static void my_clone(int (*func) (void *))
{
	char *stack = malloc(4096);
	if (stack)
		clone(func, stack + 4096,
		      CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
}

/* Memory consumer for invoking the OOM killer. */
static void memory_eater(void) {
	char *buf = NULL;
	unsigned long i;
	unsigned long size = 0;
	sleep(4);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	fprintf(stderr, "Start eating memory\n");
	for (i = 0; i < size; i += 4096)
		buf[i] = '\0'; /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
	int i;
	const pid_t pid = fork();
	if (pid == 0) {
		for (i = 0; i < 9; i++)
			my_clone(writer);
		writer(NULL);
		_exit(0);
	} else if (pid > 0) {
		snprintf(filename, sizeof(filename), "/proc/%u/stat", pid);
		for (i = 0; i < 1000; i++)
			my_clone(reader);
	}
	memory_eater();
	return *(char *) NULL; /* Not reached. */
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
----------
[   90.790847] Killed process 9560 (oom_reaper-test) total-vm:4312kB, anon-rss:124kB, file-rss:0kB, shmem-rss:0kB
[   91.803154] oom_reaper: attempts=11
[  100.701494] MemAlloc-Info: 509 stalling task, 0 dying task, 1 victim task.
[  102.439082] Killed process 9559 (oom_reaper-test) total-vm:2170960kB, anon-rss:1564600kB, file-rss:0kB, shmem-rss:0kB
[  102.441937] Killed process 9561 (oom_reaper-test) total-vm:2170960kB, anon-rss:1564776kB, file-rss:0kB, shmem-rss:0kB
[  102.731326] oom_reaper: attempts=1
[  125.420727] Killed process 10573 (oom_reaper-test) total-vm:4340kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  126.440392] oom_reaper: attempts=11
[  135.354193] MemAlloc-Info: 450 stalling task, 0 dying task, 0 victim task.
[  240.023256] MemAlloc-Info: 1016 stalling task, 0 dying task, 0 victim task.
[  302.246975] Killed process 10572 (oom_reaper-test) total-vm:2170960kB, anon-rss:1562128kB, file-rss:0kB, shmem-rss:0kB
[  302.263515] oom_reaper: attempts=1
[  382.961343] Killed process 11667 (oom_reaper-test) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  383.980541] oom_reaper: attempts=11
[  392.592658] MemAlloc-Info: 758 stalling task, 10 dying task, 1 victim task.
[  399.497478] Killed process 11666 (oom_reaper-test) total-vm:2170960kB, anon-rss:1556072kB, file-rss:0kB, shmem-rss:0kB
[  399.499101] Killed process 11668 (oom_reaper-test) total-vm:2170960kB, anon-rss:1556260kB, file-rss:0kB, shmem-rss:0kB
[  399.778283] oom_reaper: attempts=1
[  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
[  439.318951] oom_reaper: attempts=11
[  445.581171] MemAlloc-Info: 796 stalling task, 0 dying task, 0 victim task.
[  618.955215] MemAlloc-Info: 979 stalling task, 0 dying task, 0 victim task.
----------

Yes, this is an insane program. But what is important will be we prepare for
cases when oom_reap_vmas() gave up waiting. Silent hang up is annoying.
Like Andrew said
( http://lkml.kernel.org/r/20151216153513.e432dc70e035e5d07984710c@linux-foundation.org ),
I want to add a watchdog for printk()ing.
( http://lkml.kernel.org/r/201512170011.IAC73451.FLtFMSJHOQFVOO@I-love.SAKURA.ne.jp ).

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-18 12:10       ` Tetsuo Handa
  0 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2015-12-18 12:10 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> > On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > +static void oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > +	int attempts = 0;
> > > +
> > > +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > > +		schedule_timeout(HZ/10);
> > 
> > schedule_timeout() in state TASK_RUNNING doesn't do anything.  Use
> > msleep() or msleep_interruptible().  I can't decide which is more
> > appropriate - it only affects the load average display.
> 
> Ups. You are right. I will go with msleep_interruptible(100).
>  

I didn't know that. My testing was almost without oom_reap_vmas().

> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
> 
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.

Yes, you are lucky that you did not hit the mmap_sem contention.
I retested with

 static void oom_reap_vmas(struct mm_struct *mm)
 {
 	int attempts = 0;
 
 	while (attempts++ < 10 && !__oom_reap_vmas(mm))
-		schedule_timeout(HZ/10);
+		msleep_interruptible(100);
+	printk(KERN_WARNING "oom_reaper: attempts=%u\n", attempts);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	mmdrop(mm);
 }

and I can hit that attempts becomes 11 (i.e. oom_reap_vmas() gives up
waiting) if I ran a memory stressing program with many contending
mmap_sem readers and writers shown below.

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>

static cpu_set_t set = { { 1 } }; /* Allow only CPU 0. */
static char filename[32] = { };

/* down_read(&mm->mmap_sem) requester. */
static int reader(void *unused)
{
	const int fd = open(filename, O_RDONLY);
	char buffer[128];
	sched_setaffinity(0, sizeof(set), &set);
	sleep(2);
	while (pread(fd, buffer, sizeof(buffer), 0) > 0);
	while (1)
		pause();
	return 0;
}

/* down_write(&mm->mmap_sem) requester. */
static int writer(void *unused)
{
	const int fd = open("/proc/self/exe", O_RDONLY);
	sched_setaffinity(0, sizeof(set), &set);
	sleep(2);
	while (1) {
		void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
		munmap(ptr, 4096);
	}
	return 0;
}

static void my_clone(int (*func) (void *))
{
	char *stack = malloc(4096);
	if (stack)
		clone(func, stack + 4096,
		      CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
}

/* Memory consumer for invoking the OOM killer. */
static void memory_eater(void) {
	char *buf = NULL;
	unsigned long i;
	unsigned long size = 0;
	sleep(4);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	fprintf(stderr, "Start eating memory\n");
	for (i = 0; i < size; i += 4096)
		buf[i] = '\0'; /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
	int i;
	const pid_t pid = fork();
	if (pid == 0) {
		for (i = 0; i < 9; i++)
			my_clone(writer);
		writer(NULL);
		_exit(0);
	} else if (pid > 0) {
		snprintf(filename, sizeof(filename), "/proc/%u/stat", pid);
		for (i = 0; i < 1000; i++)
			my_clone(reader);
	}
	memory_eater();
	return *(char *) NULL; /* Not reached. */
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
----------
[   90.790847] Killed process 9560 (oom_reaper-test) total-vm:4312kB, anon-rss:124kB, file-rss:0kB, shmem-rss:0kB
[   91.803154] oom_reaper: attempts=11
[  100.701494] MemAlloc-Info: 509 stalling task, 0 dying task, 1 victim task.
[  102.439082] Killed process 9559 (oom_reaper-test) total-vm:2170960kB, anon-rss:1564600kB, file-rss:0kB, shmem-rss:0kB
[  102.441937] Killed process 9561 (oom_reaper-test) total-vm:2170960kB, anon-rss:1564776kB, file-rss:0kB, shmem-rss:0kB
[  102.731326] oom_reaper: attempts=1
[  125.420727] Killed process 10573 (oom_reaper-test) total-vm:4340kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  126.440392] oom_reaper: attempts=11
[  135.354193] MemAlloc-Info: 450 stalling task, 0 dying task, 0 victim task.
[  240.023256] MemAlloc-Info: 1016 stalling task, 0 dying task, 0 victim task.
[  302.246975] Killed process 10572 (oom_reaper-test) total-vm:2170960kB, anon-rss:1562128kB, file-rss:0kB, shmem-rss:0kB
[  302.263515] oom_reaper: attempts=1
[  382.961343] Killed process 11667 (oom_reaper-test) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  383.980541] oom_reaper: attempts=11
[  392.592658] MemAlloc-Info: 758 stalling task, 10 dying task, 1 victim task.
[  399.497478] Killed process 11666 (oom_reaper-test) total-vm:2170960kB, anon-rss:1556072kB, file-rss:0kB, shmem-rss:0kB
[  399.499101] Killed process 11668 (oom_reaper-test) total-vm:2170960kB, anon-rss:1556260kB, file-rss:0kB, shmem-rss:0kB
[  399.778283] oom_reaper: attempts=1
[  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
[  439.318951] oom_reaper: attempts=11
[  445.581171] MemAlloc-Info: 796 stalling task, 0 dying task, 0 victim task.
[  618.955215] MemAlloc-Info: 979 stalling task, 0 dying task, 0 victim task.
----------

Yes, this is an insane program. But what is important will be we prepare for
cases when oom_reap_vmas() gave up waiting. Silent hang up is annoying.
Like Andrew said
( http://lkml.kernel.org/r/20151216153513.e432dc70e035e5d07984710c@linux-foundation.org ),
I want to add a watchdog for printk()ing.
( http://lkml.kernel.org/r/201512170011.IAC73451.FLtFMSJHOQFVOO@I-love.SAKURA.ne.jp ).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-17 21:13       ` Andrew Morton
@ 2015-12-18 12:11         ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-18 12:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu 17-12-15 13:13:56, Andrew Morton wrote:
[...]
> Also, re-reading your description:
> 
> : It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
> : construct workloads which break the core assumption mentioned above and
> : the OOM victim might take unbounded amount of time to exit because it
> : might be blocked in the uninterruptible state waiting for on an event
> : (e.g.  lock) which is blocked by another task looping in the page
> : allocator.
> 
> So the allocating task has done an oom-kill and is waiting for memory
> to become available.  The killed task is stuck on some lock, unable to
> free memory.
> 
> But the problematic lock will sometimes be the killed tasks's mmap_sem,
> so the reaper won't reap anything.  This scenario requires that the
> mmap_sem is held for writing, which sounds like it will be uncommon. 

Yes, I have mentioned that in the changelog:
"
oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above.
"

Another thing is to do is to change down_write(mmap_sem) to
down_write_killable in most cases where we have a clear ENITR semantic.
This is on my todo list.

> hm.  sigh.  I hate the oom-killer.  Just buy some more memory already!

Tell me something about that...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-18 12:11         ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-18 12:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu 17-12-15 13:13:56, Andrew Morton wrote:
[...]
> Also, re-reading your description:
> 
> : It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
> : construct workloads which break the core assumption mentioned above and
> : the OOM victim might take unbounded amount of time to exit because it
> : might be blocked in the uninterruptible state waiting for on an event
> : (e.g.  lock) which is blocked by another task looping in the page
> : allocator.
> 
> So the allocating task has done an oom-kill and is waiting for memory
> to become available.  The killed task is stuck on some lock, unable to
> free memory.
> 
> But the problematic lock will sometimes be the killed tasks's mmap_sem,
> so the reaper won't reap anything.  This scenario requires that the
> mmap_sem is held for writing, which sounds like it will be uncommon. 

Yes, I have mentioned that in the changelog:
"
oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above.
"

Another thing is to do is to change down_write(mmap_sem) to
down_write_killable in most cases where we have a clear ENITR semantic.
This is on my todo list.

> hm.  sigh.  I hate the oom-killer.  Just buy some more memory already!

Tell me something about that...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-18 11:54           ` Michal Hocko
@ 2015-12-18 21:14             ` Andrew Morton
  -1 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-18 21:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko <mhocko@kernel.org> wrote:

>  	/* Retry the down_read_trylock(mmap_sem) a few times */
> -	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> -		msleep_interruptible(100);
> +	while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> +		__set_task_state(current, TASK_IDLE);
> +		schedule_timeout(HZ/10);
> +	}

If you won't, I shall ;)


From: Andrew Morton <akpm@linux-foundation.org>
Subject: sched: add schedule_timeout_idle()

This will be needed in the patch "mm, oom: introduce oom reaper".

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/sched.h |    1 +
 kernel/time/timer.c   |   11 +++++++++++
 2 files changed, 12 insertions(+)

diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle kernel/time/timer.c
--- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
+++ a/kernel/time/timer.c
@@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
 }
 EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
+/*
+ * Like schedule_timeout_uninterruptible(), except this task will not contribute
+ * to load average.
+ */
+signed long __sched schedule_timeout_idle(signed long timeout)
+{
+	__set_current_state(TASK_IDLE);
+	return schedule_timeout(timeout);
+}
+EXPORT_SYMBOL(schedule_timeout_idle);
+
 #ifdef CONFIG_HOTPLUG_CPU
 static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
 {
diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle include/linux/sched.h
--- a/include/linux/sched.h~sched-add-schedule_timeout_idle
+++ a/include/linux/sched.h
@@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern signed long schedule_timeout_idle(signed long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
_


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-18 21:14             ` Andrew Morton
  0 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2015-12-18 21:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko <mhocko@kernel.org> wrote:

>  	/* Retry the down_read_trylock(mmap_sem) a few times */
> -	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> -		msleep_interruptible(100);
> +	while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> +		__set_task_state(current, TASK_IDLE);
> +		schedule_timeout(HZ/10);
> +	}

If you won't, I shall ;)


From: Andrew Morton <akpm@linux-foundation.org>
Subject: sched: add schedule_timeout_idle()

This will be needed in the patch "mm, oom: introduce oom reaper".

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/sched.h |    1 +
 kernel/time/timer.c   |   11 +++++++++++
 2 files changed, 12 insertions(+)

diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle kernel/time/timer.c
--- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
+++ a/kernel/time/timer.c
@@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
 }
 EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
+/*
+ * Like schedule_timeout_uninterruptible(), except this task will not contribute
+ * to load average.
+ */
+signed long __sched schedule_timeout_idle(signed long timeout)
+{
+	__set_current_state(TASK_IDLE);
+	return schedule_timeout(timeout);
+}
+EXPORT_SYMBOL(schedule_timeout_idle);
+
 #ifdef CONFIG_HOTPLUG_CPU
 static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
 {
diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle include/linux/sched.h
--- a/include/linux/sched.h~sched-add-schedule_timeout_idle
+++ a/include/linux/sched.h
@@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern signed long schedule_timeout_idle(signed long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-18 12:10       ` Tetsuo Handa
@ 2015-12-20  7:14         ` Tetsuo Handa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2015-12-20  7:14 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

Tetsuo Handa wrote:
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
> ----------
> [  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
> [  439.318951] oom_reaper: attempts=11
> [  445.581171] MemAlloc-Info: 796 stalling task, 0 dying task, 0 victim task.
> [  618.955215] MemAlloc-Info: 979 stalling task, 0 dying task, 0 victim task.
> ----------
> 
> Yes, this is an insane program. But what is important will be we prepare for
> cases when oom_reap_vmas() gave up waiting. Silent hang up is annoying.

s/gave up waiting/did not help/



I noticed yet another problem with this program.

The OOM victim (a child of memory hog process) received SIGKILL at
uptime = 438 and it terminated before uptime = 445 even though
oom_reap_vmas() gave up waiting at uptime = 439. However, the OOM killer
was not invoked again in order to kill the memory hog process before I
gave up waiting at uptime = 679. The OOM killer was needlessly kept
disabled for more than 234 seconds after the OOM victim terminated.

----------
[  438.180596] oom_reaper-test invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x26040c0(GFP_KERNEL|GFP_COMP|GFP_NOTRACK)
[  438.183524] oom_reaper-test cpuset=/ mems_allowed=0
[  438.185440] CPU: 0 PID: 13451 Comm: oom_reaper-test Not tainted 4.4.0-rc5-next-20151217+ #248
(...snipped...)
[  438.301687] Out of memory: Kill process 12679 (oom_reaper-test) score 876 or sacrifice child
[  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  568.185582] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 delay=7403
[  568.187593] oom_reaper-test R  running task        0 13451   8130 0x00000080
[  568.189546]  ffff88007a4cb918 ffff88007a5c4140 ffff88007a4c0000 ffff88007a4cc000
[  568.191637]  ffff88007a4cb950 ffff88007fc10240 0000000100021bb8 ffffffff81c11730
[  568.193713]  ffff88007a4cb930 ffffffff816f5b77 ffff88007fc10240 ffff88007a4cb9d0
[  568.195798] Call Trace:
[  568.196878]  [<ffffffff816f5b77>] schedule+0x37/0x90
[  568.198402]  [<ffffffff816f9f37>] schedule_timeout+0x117/0x1c0
[  568.200078]  [<ffffffff810dfd00>] ? init_timer_key+0x40/0x40
[  568.201725]  [<ffffffff816fa034>] schedule_timeout_killable+0x24/0x30
[  568.203512]  [<ffffffff81142d49>] out_of_memory+0x1f9/0x5a0
[  568.205144]  [<ffffffff81142dfd>] ? out_of_memory+0x2ad/0x5a0
[  568.206800]  [<ffffffff811486f3>] __alloc_pages_nodemask+0xc43/0xc80
[  568.208564]  [<ffffffff8118f786>] alloc_pages_current+0x96/0x1b0
[  568.210259]  [<ffffffff81198177>] ? new_slab+0x357/0x470
[  568.211811]  [<ffffffff811981ee>] new_slab+0x3ce/0x470
[  568.213329]  [<ffffffff8119a41a>] ___slab_alloc+0x42a/0x5c0
[  568.214917]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  568.216486]  [<ffffffff811aa75d>] ? mem_cgroup_end_page_stat+0x2d/0xb0
[  568.218240]  [<ffffffff810ba2a9>] ? __lock_is_held+0x49/0x70
[  568.219815]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  568.221388]  [<ffffffff811bc45b>] __slab_alloc+0x4a/0x81
[  568.222980]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  568.224565]  [<ffffffff8119aab3>] __kmalloc+0x163/0x1b0
[  568.226075]  [<ffffffff811e5ff5>] seq_buf_alloc+0x35/0x40
[  568.227710]  [<ffffffff811e660b>] seq_read+0x31b/0x3c0
[  568.229184]  [<ffffffff811beaf2>] __vfs_read+0x32/0xf0
[  568.230673]  [<ffffffff81302339>] ? security_file_permission+0xa9/0xc0
[  568.232408]  [<ffffffff811bf49d>] ? rw_verify_area+0x4d/0xd0
[  568.234068]  [<ffffffff811bf59a>] vfs_read+0x7a/0x120
[  568.235561]  [<ffffffff811c0700>] SyS_pread64+0x90/0xb0
[  568.237062]  [<ffffffff816fb0f2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  568.238766] 2 locks held by oom_reaper-test/13451:
[  568.240188]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff811e6337>] seq_read+0x47/0x3c0
[  568.242303]  #1:  (oom_lock){+.+...}, at: [<ffffffff81148358>] __alloc_pages_nodemask+0x8a8/0xc80
(...snipped...)
[  658.711079] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 delay=180777
[  658.713110] oom_reaper-test R  running task        0 13451   8130 0x00000080
[  658.715073]  ffff88007a4cb918 ffff88007a5c4140 ffff88007a4c0000 ffff88007a4cc000
[  658.717166]  ffff88007a4cb950 ffff88007fc10240 0000000100021bb8 ffffffff81c11730
[  658.719248]  ffff88007a4cb930 ffffffff816f5b77 ffff88007fc10240 ffff88007a4cb9d0
[  658.721345] Call Trace:
[  658.722426]  [<ffffffff816f5b77>] schedule+0x37/0x90
[  658.723950]  [<ffffffff816f9f37>] schedule_timeout+0x117/0x1c0
[  658.725636]  [<ffffffff810dfd00>] ? init_timer_key+0x40/0x40
[  658.727304]  [<ffffffff816fa034>] schedule_timeout_killable+0x24/0x30
[  658.729113]  [<ffffffff81142d49>] out_of_memory+0x1f9/0x5a0
[  658.730757]  [<ffffffff81142dfd>] ? out_of_memory+0x2ad/0x5a0
[  658.732444]  [<ffffffff811486f3>] __alloc_pages_nodemask+0xc43/0xc80
[  658.734314]  [<ffffffff8118f786>] alloc_pages_current+0x96/0x1b0
[  658.736041]  [<ffffffff81198177>] ? new_slab+0x357/0x470
[  658.737643]  [<ffffffff811981ee>] new_slab+0x3ce/0x470
[  658.739239]  [<ffffffff8119a41a>] ___slab_alloc+0x42a/0x5c0
[  658.740899]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  658.742617]  [<ffffffff811aa75d>] ? mem_cgroup_end_page_stat+0x2d/0xb0
[  658.744489]  [<ffffffff810ba2a9>] ? __lock_is_held+0x49/0x70
[  658.746157]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  658.747801]  [<ffffffff811bc45b>] __slab_alloc+0x4a/0x81
[  658.749392]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  658.751021]  [<ffffffff8119aab3>] __kmalloc+0x163/0x1b0
[  658.752562]  [<ffffffff811e5ff5>] seq_buf_alloc+0x35/0x40
[  658.754145]  [<ffffffff811e660b>] seq_read+0x31b/0x3c0
[  658.755678]  [<ffffffff811beaf2>] __vfs_read+0x32/0xf0
[  658.757194]  [<ffffffff81302339>] ? security_file_permission+0xa9/0xc0
[  658.758959]  [<ffffffff811bf49d>] ? rw_verify_area+0x4d/0xd0
[  658.760571]  [<ffffffff811bf59a>] vfs_read+0x7a/0x120
[  658.762746]  [<ffffffff811c0700>] SyS_pread64+0x90/0xb0
[  658.770489]  [<ffffffff816fb0f2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  658.772641] 2 locks held by oom_reaper-test/13451:
[  658.774321]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff811e6337>] seq_read+0x47/0x3c0
[  658.776456]  #1:  (oom_lock){+.+...}, at: [<ffffffff81148358>] __alloc_pages_nodemask+0x8a8/0xc80
(...snipped...)
[  679.648918] sysrq: SysRq : Kill All Tasks
----------

Looking at the traces, the process which invoked the OOM killer kept
the oom_lock mutex held because it had been sleeping at
schedule_timeout_killable(1) at out_of_memory(), which meant to wait for
only one jiffie but actually waited for more than 234 seconds.

        if (p && p != (void *)-1UL) {
                oom_kill_process(oc, p, points, totalpages, NULL,
                                 "Out of memory");
                /*
                 * Give the killed process a good chance to exit before trying
                 * to allocate memory again.
                 */
                schedule_timeout_killable(1);
        }
        return true;

During that period, nobody was able to call out_of_memory() because
everybody assumed that the process which invoked the OOM killer is
making progress for us.

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
        if (!mutex_trylock(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

The side effect is not limited to not choosing the next OOM victim.
SIGKILL but !TIF_MEMDIE tasks (possibly tasks sharing OOM victim's mm)
cannot use ALLOC_NO_WATERMARKS until they can arrive at out_of_memory().
Assumptions like

        /*
         * If current has a pending SIGKILL or is exiting, then automatically
         * select it.  The goal is to allow it to allocate so that it may
         * quickly exit and free its memory.
         *
         * But don't select if current has already released its mm and cleared
         * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
         */
        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

in out_of_memory() and

        /*
         * Kill all user processes sharing victim->mm in other thread groups, if
         * any.  They don't get access to memory reserves, though, to avoid
         * depletion of all memory.  This prevents mm->mmap_sem livelock when an
         * oom killed thread cannot exit because it requires the semaphore and
         * its contended by another thread trying to allocate memory itself.
         * That thread will now get access to memory reserves since it has a
         * pending fatal signal.
         */

in oom_kill_process() can not work. (Yes, we know
fatal_signal_pending(current) check in out_of_memory() is wrong.
http://lkml.kernel.org/r/20151002135201.GA28533@redhat.com )

I think we might want to make sure that the oom_lock mutex is released within
reasonable period after the OOM killer kills a victim. Maybe changing not to
depend on TIF_MEMDIE for using memory reserves. Maybe replacing the whole
operation between mutex_trylock(&oom_lock) and mutex_unlock(&oom_lock) with
request_oom_killer() (like request_module() does) and let a kernel thread do
the OOM kill operation (oom_reaper() can do it?), for it will make easy to
wait for short period after killing the victim, without worrying about huge
unexpected delay caused by low scheduling priority / limited available CPUs.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-20  7:14         ` Tetsuo Handa
  0 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2015-12-20  7:14 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

Tetsuo Handa wrote:
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
> ----------
> [  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
> [  439.318951] oom_reaper: attempts=11
> [  445.581171] MemAlloc-Info: 796 stalling task, 0 dying task, 0 victim task.
> [  618.955215] MemAlloc-Info: 979 stalling task, 0 dying task, 0 victim task.
> ----------
> 
> Yes, this is an insane program. But what is important will be we prepare for
> cases when oom_reap_vmas() gave up waiting. Silent hang up is annoying.

s/gave up waiting/did not help/



I noticed yet another problem with this program.

The OOM victim (a child of memory hog process) received SIGKILL at
uptime = 438 and it terminated before uptime = 445 even though
oom_reap_vmas() gave up waiting at uptime = 439. However, the OOM killer
was not invoked again in order to kill the memory hog process before I
gave up waiting at uptime = 679. The OOM killer was needlessly kept
disabled for more than 234 seconds after the OOM victim terminated.

----------
[  438.180596] oom_reaper-test invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x26040c0(GFP_KERNEL|GFP_COMP|GFP_NOTRACK)
[  438.183524] oom_reaper-test cpuset=/ mems_allowed=0
[  438.185440] CPU: 0 PID: 13451 Comm: oom_reaper-test Not tainted 4.4.0-rc5-next-20151217+ #248
(...snipped...)
[  438.301687] Out of memory: Kill process 12679 (oom_reaper-test) score 876 or sacrifice child
[  438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  568.185582] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 delay=7403
[  568.187593] oom_reaper-test R  running task        0 13451   8130 0x00000080
[  568.189546]  ffff88007a4cb918 ffff88007a5c4140 ffff88007a4c0000 ffff88007a4cc000
[  568.191637]  ffff88007a4cb950 ffff88007fc10240 0000000100021bb8 ffffffff81c11730
[  568.193713]  ffff88007a4cb930 ffffffff816f5b77 ffff88007fc10240 ffff88007a4cb9d0
[  568.195798] Call Trace:
[  568.196878]  [<ffffffff816f5b77>] schedule+0x37/0x90
[  568.198402]  [<ffffffff816f9f37>] schedule_timeout+0x117/0x1c0
[  568.200078]  [<ffffffff810dfd00>] ? init_timer_key+0x40/0x40
[  568.201725]  [<ffffffff816fa034>] schedule_timeout_killable+0x24/0x30
[  568.203512]  [<ffffffff81142d49>] out_of_memory+0x1f9/0x5a0
[  568.205144]  [<ffffffff81142dfd>] ? out_of_memory+0x2ad/0x5a0
[  568.206800]  [<ffffffff811486f3>] __alloc_pages_nodemask+0xc43/0xc80
[  568.208564]  [<ffffffff8118f786>] alloc_pages_current+0x96/0x1b0
[  568.210259]  [<ffffffff81198177>] ? new_slab+0x357/0x470
[  568.211811]  [<ffffffff811981ee>] new_slab+0x3ce/0x470
[  568.213329]  [<ffffffff8119a41a>] ___slab_alloc+0x42a/0x5c0
[  568.214917]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  568.216486]  [<ffffffff811aa75d>] ? mem_cgroup_end_page_stat+0x2d/0xb0
[  568.218240]  [<ffffffff810ba2a9>] ? __lock_is_held+0x49/0x70
[  568.219815]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  568.221388]  [<ffffffff811bc45b>] __slab_alloc+0x4a/0x81
[  568.222980]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  568.224565]  [<ffffffff8119aab3>] __kmalloc+0x163/0x1b0
[  568.226075]  [<ffffffff811e5ff5>] seq_buf_alloc+0x35/0x40
[  568.227710]  [<ffffffff811e660b>] seq_read+0x31b/0x3c0
[  568.229184]  [<ffffffff811beaf2>] __vfs_read+0x32/0xf0
[  568.230673]  [<ffffffff81302339>] ? security_file_permission+0xa9/0xc0
[  568.232408]  [<ffffffff811bf49d>] ? rw_verify_area+0x4d/0xd0
[  568.234068]  [<ffffffff811bf59a>] vfs_read+0x7a/0x120
[  568.235561]  [<ffffffff811c0700>] SyS_pread64+0x90/0xb0
[  568.237062]  [<ffffffff816fb0f2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  568.238766] 2 locks held by oom_reaper-test/13451:
[  568.240188]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff811e6337>] seq_read+0x47/0x3c0
[  568.242303]  #1:  (oom_lock){+.+...}, at: [<ffffffff81148358>] __alloc_pages_nodemask+0x8a8/0xc80
(...snipped...)
[  658.711079] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 delay=180777
[  658.713110] oom_reaper-test R  running task        0 13451   8130 0x00000080
[  658.715073]  ffff88007a4cb918 ffff88007a5c4140 ffff88007a4c0000 ffff88007a4cc000
[  658.717166]  ffff88007a4cb950 ffff88007fc10240 0000000100021bb8 ffffffff81c11730
[  658.719248]  ffff88007a4cb930 ffffffff816f5b77 ffff88007fc10240 ffff88007a4cb9d0
[  658.721345] Call Trace:
[  658.722426]  [<ffffffff816f5b77>] schedule+0x37/0x90
[  658.723950]  [<ffffffff816f9f37>] schedule_timeout+0x117/0x1c0
[  658.725636]  [<ffffffff810dfd00>] ? init_timer_key+0x40/0x40
[  658.727304]  [<ffffffff816fa034>] schedule_timeout_killable+0x24/0x30
[  658.729113]  [<ffffffff81142d49>] out_of_memory+0x1f9/0x5a0
[  658.730757]  [<ffffffff81142dfd>] ? out_of_memory+0x2ad/0x5a0
[  658.732444]  [<ffffffff811486f3>] __alloc_pages_nodemask+0xc43/0xc80
[  658.734314]  [<ffffffff8118f786>] alloc_pages_current+0x96/0x1b0
[  658.736041]  [<ffffffff81198177>] ? new_slab+0x357/0x470
[  658.737643]  [<ffffffff811981ee>] new_slab+0x3ce/0x470
[  658.739239]  [<ffffffff8119a41a>] ___slab_alloc+0x42a/0x5c0
[  658.740899]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  658.742617]  [<ffffffff811aa75d>] ? mem_cgroup_end_page_stat+0x2d/0xb0
[  658.744489]  [<ffffffff810ba2a9>] ? __lock_is_held+0x49/0x70
[  658.746157]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  658.747801]  [<ffffffff811bc45b>] __slab_alloc+0x4a/0x81
[  658.749392]  [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[  658.751021]  [<ffffffff8119aab3>] __kmalloc+0x163/0x1b0
[  658.752562]  [<ffffffff811e5ff5>] seq_buf_alloc+0x35/0x40
[  658.754145]  [<ffffffff811e660b>] seq_read+0x31b/0x3c0
[  658.755678]  [<ffffffff811beaf2>] __vfs_read+0x32/0xf0
[  658.757194]  [<ffffffff81302339>] ? security_file_permission+0xa9/0xc0
[  658.758959]  [<ffffffff811bf49d>] ? rw_verify_area+0x4d/0xd0
[  658.760571]  [<ffffffff811bf59a>] vfs_read+0x7a/0x120
[  658.762746]  [<ffffffff811c0700>] SyS_pread64+0x90/0xb0
[  658.770489]  [<ffffffff816fb0f2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  658.772641] 2 locks held by oom_reaper-test/13451:
[  658.774321]  #0:  (&p->lock){+.+.+.}, at: [<ffffffff811e6337>] seq_read+0x47/0x3c0
[  658.776456]  #1:  (oom_lock){+.+...}, at: [<ffffffff81148358>] __alloc_pages_nodemask+0x8a8/0xc80
(...snipped...)
[  679.648918] sysrq: SysRq : Kill All Tasks
----------

Looking at the traces, the process which invoked the OOM killer kept
the oom_lock mutex held because it had been sleeping at
schedule_timeout_killable(1) at out_of_memory(), which meant to wait for
only one jiffie but actually waited for more than 234 seconds.

        if (p && p != (void *)-1UL) {
                oom_kill_process(oc, p, points, totalpages, NULL,
                                 "Out of memory");
                /*
                 * Give the killed process a good chance to exit before trying
                 * to allocate memory again.
                 */
                schedule_timeout_killable(1);
        }
        return true;

During that period, nobody was able to call out_of_memory() because
everybody assumed that the process which invoked the OOM killer is
making progress for us.

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
        if (!mutex_trylock(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

The side effect is not limited to not choosing the next OOM victim.
SIGKILL but !TIF_MEMDIE tasks (possibly tasks sharing OOM victim's mm)
cannot use ALLOC_NO_WATERMARKS until they can arrive at out_of_memory().
Assumptions like

        /*
         * If current has a pending SIGKILL or is exiting, then automatically
         * select it.  The goal is to allow it to allocate so that it may
         * quickly exit and free its memory.
         *
         * But don't select if current has already released its mm and cleared
         * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
         */
        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

in out_of_memory() and

        /*
         * Kill all user processes sharing victim->mm in other thread groups, if
         * any.  They don't get access to memory reserves, though, to avoid
         * depletion of all memory.  This prevents mm->mmap_sem livelock when an
         * oom killed thread cannot exit because it requires the semaphore and
         * its contended by another thread trying to allocate memory itself.
         * That thread will now get access to memory reserves since it has a
         * pending fatal signal.
         */

in oom_kill_process() can not work. (Yes, we know
fatal_signal_pending(current) check in out_of_memory() is wrong.
http://lkml.kernel.org/r/20151002135201.GA28533@redhat.com )

I think we might want to make sure that the oom_lock mutex is released within
reasonable period after the OOM killer kills a victim. Maybe changing not to
depend on TIF_MEMDIE for using memory reserves. Maybe replacing the whole
operation between mutex_trylock(&oom_lock) and mutex_unlock(&oom_lock) with
request_oom_killer() (like request_module() does) and let a kernel thread do
the OOM kill operation (oom_reaper() can do it?), for it will make easy to
wait for short period after killing the victim, without worrying about huge
unexpected delay caused by low scheduling priority / limited available CPUs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-18 21:14             ` Andrew Morton
@ 2015-12-21  8:38               ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-21  8:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Fri 18-12-15 13:14:00, Andrew Morton wrote:
> On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > -	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > -		msleep_interruptible(100);
> > +	while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> > +		__set_task_state(current, TASK_IDLE);
> > +		schedule_timeout(HZ/10);
> > +	}
> 
> If you won't, I shall ;)
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: sched: add schedule_timeout_idle()
> 
> This will be needed in the patch "mm, oom: introduce oom reaper".
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Thanks! This makes more sense.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> 
>  include/linux/sched.h |    1 +
>  kernel/time/timer.c   |   11 +++++++++++
>  2 files changed, 12 insertions(+)
> 
> diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle kernel/time/timer.c
> --- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
> +++ a/kernel/time/timer.c
> @@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
>  }
>  EXPORT_SYMBOL(schedule_timeout_uninterruptible);
>  
> +/*
> + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> + * to load average.
> + */
> +signed long __sched schedule_timeout_idle(signed long timeout)
> +{
> +	__set_current_state(TASK_IDLE);
> +	return schedule_timeout(timeout);
> +}
> +EXPORT_SYMBOL(schedule_timeout_idle);
> +
>  #ifdef CONFIG_HOTPLUG_CPU
>  static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
>  {
> diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle include/linux/sched.h
> --- a/include/linux/sched.h~sched-add-schedule_timeout_idle
> +++ a/include/linux/sched.h
> @@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
>  extern signed long schedule_timeout_interruptible(signed long timeout);
>  extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern signed long schedule_timeout_idle(signed long timeout);
>  asmlinkage void schedule(void);
>  extern void schedule_preempt_disabled(void);
>  
> _

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-21  8:38               ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-21  8:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Mel Gorman, Tetsuo Handa, David Rientjes,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Fri 18-12-15 13:14:00, Andrew Morton wrote:
> On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> >  	/* Retry the down_read_trylock(mmap_sem) a few times */
> > -	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > -		msleep_interruptible(100);
> > +	while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> > +		__set_task_state(current, TASK_IDLE);
> > +		schedule_timeout(HZ/10);
> > +	}
> 
> If you won't, I shall ;)
> 
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: sched: add schedule_timeout_idle()
> 
> This will be needed in the patch "mm, oom: introduce oom reaper".
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Thanks! This makes more sense.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> 
>  include/linux/sched.h |    1 +
>  kernel/time/timer.c   |   11 +++++++++++
>  2 files changed, 12 insertions(+)
> 
> diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle kernel/time/timer.c
> --- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
> +++ a/kernel/time/timer.c
> @@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
>  }
>  EXPORT_SYMBOL(schedule_timeout_uninterruptible);
>  
> +/*
> + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> + * to load average.
> + */
> +signed long __sched schedule_timeout_idle(signed long timeout)
> +{
> +	__set_current_state(TASK_IDLE);
> +	return schedule_timeout(timeout);
> +}
> +EXPORT_SYMBOL(schedule_timeout_idle);
> +
>  #ifdef CONFIG_HOTPLUG_CPU
>  static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
>  {
> diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle include/linux/sched.h
> --- a/include/linux/sched.h~sched-add-schedule_timeout_idle
> +++ a/include/linux/sched.h
> @@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
>  extern signed long schedule_timeout_interruptible(signed long timeout);
>  extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern signed long schedule_timeout_idle(signed long timeout);
>  asmlinkage void schedule(void);
>  extern void schedule_preempt_disabled(void);
>  
> _

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-15 18:36 ` Michal Hocko
@ 2015-12-21 20:38   ` Paul Gortmaker
  -1 siblings, 0 replies; 78+ messages in thread
From: Paul Gortmaker @ 2015-12-21 20:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Michal Hocko

On Tue, Dec 15, 2015 at 1:36 PM, Michal Hocko <mhocko@kernel.org> wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>

[...]

Since this is built-in always, can we....

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5314b206caa5..48025a21f8c4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,6 +35,11 @@
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
> +#include <linux/kthread.h>
> +#include <linux/module.h>

...use <linux/init.h> instead above, and then...

> +
> +#include <asm/tlb.h>
> +#include "internal.h"
>

[...]

> +                * Make sure our oom reaper thread will get scheduled when
> +                * ASAP and that it won't get preempted by malicious userspace.
> +                */
> +               sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> +       }
> +       return 0;
> +}
> +module_init(oom_init)

...use one of the non-modular initcalls here?   I'm trying to clean up most of
the non-modular uses of modular macros etc. since:

 (1) it is easy to accidentally code up an unused module_exit function
 (2) it can be misleading when reading the source, thinking it can be
      modular when the Makefile and/or Kconfig prohibit it
 (3) it requires the include of the module.h header file which in turn
     includes nearly everything else, thus increasing CPP overhead.

I figured no point in sending a follow on patch since this came in via
the akpm tree into next and that gets rebased/updated regularly.

Thanks,
Paul.
--

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-21 20:38   ` Paul Gortmaker
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Gortmaker @ 2015-12-21 20:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Michal Hocko

On Tue, Dec 15, 2015 at 1:36 PM, Michal Hocko <mhocko@kernel.org> wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>

[...]

Since this is built-in always, can we....

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5314b206caa5..48025a21f8c4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,6 +35,11 @@
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
> +#include <linux/kthread.h>
> +#include <linux/module.h>

...use <linux/init.h> instead above, and then...

> +
> +#include <asm/tlb.h>
> +#include "internal.h"
>

[...]

> +                * Make sure our oom reaper thread will get scheduled when
> +                * ASAP and that it won't get preempted by malicious userspace.
> +                */
> +               sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> +       }
> +       return 0;
> +}
> +module_init(oom_init)

...use one of the non-modular initcalls here?   I'm trying to clean up most of
the non-modular uses of modular macros etc. since:

 (1) it is easy to accidentally code up an unused module_exit function
 (2) it can be misleading when reading the source, thinking it can be
      modular when the Makefile and/or Kconfig prohibit it
 (3) it requires the include of the module.h header file which in turn
     includes nearly everything else, thus increasing CPP overhead.

I figured no point in sending a follow on patch since this came in via
the akpm tree into next and that gets rebased/updated regularly.

Thanks,
Paul.
--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-15 18:36 ` Michal Hocko
@ 2015-12-23 23:00   ` Ross Zwisler
  -1 siblings, 0 replies; 78+ messages in thread
From: Ross Zwisler @ 2015-12-23 23:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Michal Hocko, Ross Zwisler

On Tue, Dec 15, 2015 at 11:36 AM, Michal Hocko <mhocko@kernel.org> wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
>
> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
>
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
>
> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
>
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
>
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

While running xfstests on next-20151223 I hit a pair of kernel BUGs
that bisected to this commit:

1eb3a80d8239 ("mm, oom: introduce oom reaper")

Here is a BUG produced by generic/029 when run against XFS:

[  235.751723] ------------[ cut here ]------------
[  235.752194] kernel BUG at mm/filemap.c:208!
[  235.752595] invalid opcode: 0000 [#1] SMP
[  235.753036] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[  235.753681] CPU: 3 PID: 17586 Comm: xfs_io Not tainted
4.4.0-rc6-next-20151223_new_fsync_v6+ #8
[  235.754535] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140709_153950- 04/01/2014
[  235.755451] task: ffff88040bde19c0 ti: ffff8800bab80000 task.ti:
ffff8800bab80000
[  235.756202] RIP: 0010:[<ffffffff811c81f6>]  [<ffffffff811c81f6>]
__delete_from_page_cache+0x206/0x440
[  235.757151] RSP: 0018:ffff8800bab83b60  EFLAGS: 00010082
[  235.757679] RAX: 0000000000000021 RBX: ffffea0007d37e00 RCX: 0000000000000006
[  235.758360] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8804117ce380
[  235.759043] RBP: ffff8800bab83bb8 R08: 0000000000000001 R09: 0000000000000001
[  235.759749] R10: 00000000ffffffff R11: 0000000000028dc0 R12: ffff8800b1e7db00
[  235.760444] R13: ffff8800b1e7daf8 R14: 0000000000000000 R15: 0000000000000003
[  235.761122] FS:  00007f65dd009700(0000) GS:ffff880411600000(0000)
knlGS:0000000000000000
[  235.761888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  235.762432] CR2: 00007f65dd01f000 CR3: 00000000ba9d7000 CR4: 00000000000406e0
[  235.763150] Stack:
[  235.763347]  ffff88040cf1a800 0000000000000001 0000000000000001
ffff8800ae2a3b50
[  235.764123]  ffff8800ae2a3b80 00000000b4ca5e1a ffffea0007d37e00
ffff8800b1e7db10
[  235.764900]  ffff88040cf1a800 0000000000000000 0000000000000292
ffff8800bab83bf0
[  235.765638] Call Trace:
[  235.765903]  [<ffffffff811c8493>] delete_from_page_cache+0x63/0xd0
[  235.766513]  [<ffffffff811dc3e5>] truncate_inode_page+0xa5/0x120
[  235.767088]  [<ffffffff811dc648>] truncate_inode_pages_range+0x1a8/0x7f0
[  235.767725]  [<ffffffff81021459>] ? sched_clock+0x9/0x10
[  235.768239]  [<ffffffff810db37c>] ? local_clock+0x1c/0x20
[  235.768779]  [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
[  235.769385]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
[  235.770010]  [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
[  235.770501]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
[  235.771092]  [<ffffffff811dcd58>] truncate_pagecache+0x48/0x70
[  235.771646]  [<ffffffff811dcdb2>] truncate_setsize+0x32/0x40
[  235.772276]  [<ffffffff8148e972>] xfs_setattr_size+0x232/0x470
[  235.772839]  [<ffffffff8148ec64>] xfs_vn_setattr+0xb4/0xc0
[  235.773369]  [<ffffffff8127af87>] notify_change+0x237/0x350
[  235.773945]  [<ffffffff81257c87>] do_truncate+0x77/0xc0
[  235.774446]  [<ffffffff8125800f>] do_sys_ftruncate.constprop.15+0xef/0x150
[  235.775156]  [<ffffffff812580ae>] SyS_ftruncate+0xe/0x10
[  235.775650]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
e8 32
[  235.778695] RIP  [<ffffffff811c81f6>] __delete_from_page_cache+0x206/0x440
[  235.779350]  RSP <ffff8800bab83b60>
[  235.779694] ---[ end trace fac9dd65c4cdd828 ]---

And a different BUG produced by generic/095, also with XFS:

[  609.398897] ------------[ cut here ]------------
[  609.399843] kernel BUG at mm/truncate.c:629!
[  609.400666] invalid opcode: 0000 [#1] SMP
[  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: G        W
4.4.0-rc6-next-20151223+ #1
[  609.404267] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140709_153950- 04/01/2014
[  609.405851] task: ffff8801e52119c0 ti: ffff8801f6540000 task.ti:
ffff8801f6540000
[  609.407272] RIP: 0010:[<ffffffff811dc0ab>]  [<ffffffff811dc0ab>]
invalidate_inode_pages2_range+0x30b/0x550
[  609.409111] RSP: 0018:ffff8801f6543c88  EFLAGS: 00010202
[  609.410105] RAX: 0000000000000001 RBX: 0000000000000061 RCX: ffff88041180e440
[  609.411417] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: 0000000000000286
[  609.412737] RBP: ffff8801f6543dd0 R08: 0000000000000008 R09: 0000000000000001
[  609.414069] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8801f6dfb438
[  609.415388] R13: ffffffffffffffff R14: 000000000000000b R15: ffffea0002877c80
[  609.416681] FS:  00007f48e13ed740(0000) GS:ffff880411800000(0000)
knlGS:0000000000000000
[  609.418190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  609.419261] CR2: 00000000012e0000 CR3: 000000040f24c000 CR4: 00000000000406e0
[  609.420577] Stack:
[  609.420968]  0000000000000292 ffff8800ba6f7800 ffff8801f6dfb450
0000000000000000
[  609.422423]  0000000000000001 0000000000000056 0000000000000057
0000000000000058
[  609.423878]  0000000000000059 000000000000005a 000000000000005b
000000000000005c
[  609.425325] Call Trace:
[  609.425797]  [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
[  609.426971]  [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
[  609.428097]  [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
[  609.429073]  [<ffffffff8125a319>] vfs_read+0x89/0x130
[  609.430010]  [<ffffffff8125b418>] SyS_read+0x58/0xd0
[  609.430943]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
74 15
[  609.436956] RIP  [<ffffffff811dc0ab>]
invalidate_inode_pages2_range+0x30b/0x550
[  609.438373]  RSP <ffff8801f6543c88>
[  609.439080] ---[ end trace 10616a16523ccb2c ]---

They both fail 100% of the time with the above signatures with the
"oom reaper" commit, and succeed 100% of the time with the parent
commit.

My test setup is a qemu guest machine with a pair of 4 GiB PMEM
ramdisk test devices, one for the xfstest test disk and one for the
scratch disk.

Please let me know if you have trouble reproducing this.  I'm also
happy to test fixes.

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-23 23:00   ` Ross Zwisler
  0 siblings, 0 replies; 78+ messages in thread
From: Ross Zwisler @ 2015-12-23 23:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Michal Hocko, Ross Zwisler

On Tue, Dec 15, 2015 at 11:36 AM, Michal Hocko <mhocko@kernel.org> wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
>
> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
>
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
>
> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
>
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
>
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

While running xfstests on next-20151223 I hit a pair of kernel BUGs
that bisected to this commit:

1eb3a80d8239 ("mm, oom: introduce oom reaper")

Here is a BUG produced by generic/029 when run against XFS:

[  235.751723] ------------[ cut here ]------------
[  235.752194] kernel BUG at mm/filemap.c:208!
[  235.752595] invalid opcode: 0000 [#1] SMP
[  235.753036] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[  235.753681] CPU: 3 PID: 17586 Comm: xfs_io Not tainted
4.4.0-rc6-next-20151223_new_fsync_v6+ #8
[  235.754535] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140709_153950- 04/01/2014
[  235.755451] task: ffff88040bde19c0 ti: ffff8800bab80000 task.ti:
ffff8800bab80000
[  235.756202] RIP: 0010:[<ffffffff811c81f6>]  [<ffffffff811c81f6>]
__delete_from_page_cache+0x206/0x440
[  235.757151] RSP: 0018:ffff8800bab83b60  EFLAGS: 00010082
[  235.757679] RAX: 0000000000000021 RBX: ffffea0007d37e00 RCX: 0000000000000006
[  235.758360] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8804117ce380
[  235.759043] RBP: ffff8800bab83bb8 R08: 0000000000000001 R09: 0000000000000001
[  235.759749] R10: 00000000ffffffff R11: 0000000000028dc0 R12: ffff8800b1e7db00
[  235.760444] R13: ffff8800b1e7daf8 R14: 0000000000000000 R15: 0000000000000003
[  235.761122] FS:  00007f65dd009700(0000) GS:ffff880411600000(0000)
knlGS:0000000000000000
[  235.761888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  235.762432] CR2: 00007f65dd01f000 CR3: 00000000ba9d7000 CR4: 00000000000406e0
[  235.763150] Stack:
[  235.763347]  ffff88040cf1a800 0000000000000001 0000000000000001
ffff8800ae2a3b50
[  235.764123]  ffff8800ae2a3b80 00000000b4ca5e1a ffffea0007d37e00
ffff8800b1e7db10
[  235.764900]  ffff88040cf1a800 0000000000000000 0000000000000292
ffff8800bab83bf0
[  235.765638] Call Trace:
[  235.765903]  [<ffffffff811c8493>] delete_from_page_cache+0x63/0xd0
[  235.766513]  [<ffffffff811dc3e5>] truncate_inode_page+0xa5/0x120
[  235.767088]  [<ffffffff811dc648>] truncate_inode_pages_range+0x1a8/0x7f0
[  235.767725]  [<ffffffff81021459>] ? sched_clock+0x9/0x10
[  235.768239]  [<ffffffff810db37c>] ? local_clock+0x1c/0x20
[  235.768779]  [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
[  235.769385]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
[  235.770010]  [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
[  235.770501]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
[  235.771092]  [<ffffffff811dcd58>] truncate_pagecache+0x48/0x70
[  235.771646]  [<ffffffff811dcdb2>] truncate_setsize+0x32/0x40
[  235.772276]  [<ffffffff8148e972>] xfs_setattr_size+0x232/0x470
[  235.772839]  [<ffffffff8148ec64>] xfs_vn_setattr+0xb4/0xc0
[  235.773369]  [<ffffffff8127af87>] notify_change+0x237/0x350
[  235.773945]  [<ffffffff81257c87>] do_truncate+0x77/0xc0
[  235.774446]  [<ffffffff8125800f>] do_sys_ftruncate.constprop.15+0xef/0x150
[  235.775156]  [<ffffffff812580ae>] SyS_ftruncate+0xe/0x10
[  235.775650]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
e8 32
[  235.778695] RIP  [<ffffffff811c81f6>] __delete_from_page_cache+0x206/0x440
[  235.779350]  RSP <ffff8800bab83b60>
[  235.779694] ---[ end trace fac9dd65c4cdd828 ]---

And a different BUG produced by generic/095, also with XFS:

[  609.398897] ------------[ cut here ]------------
[  609.399843] kernel BUG at mm/truncate.c:629!
[  609.400666] invalid opcode: 0000 [#1] SMP
[  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: G        W
4.4.0-rc6-next-20151223+ #1
[  609.404267] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140709_153950- 04/01/2014
[  609.405851] task: ffff8801e52119c0 ti: ffff8801f6540000 task.ti:
ffff8801f6540000
[  609.407272] RIP: 0010:[<ffffffff811dc0ab>]  [<ffffffff811dc0ab>]
invalidate_inode_pages2_range+0x30b/0x550
[  609.409111] RSP: 0018:ffff8801f6543c88  EFLAGS: 00010202
[  609.410105] RAX: 0000000000000001 RBX: 0000000000000061 RCX: ffff88041180e440
[  609.411417] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: 0000000000000286
[  609.412737] RBP: ffff8801f6543dd0 R08: 0000000000000008 R09: 0000000000000001
[  609.414069] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8801f6dfb438
[  609.415388] R13: ffffffffffffffff R14: 000000000000000b R15: ffffea0002877c80
[  609.416681] FS:  00007f48e13ed740(0000) GS:ffff880411800000(0000)
knlGS:0000000000000000
[  609.418190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  609.419261] CR2: 00000000012e0000 CR3: 000000040f24c000 CR4: 00000000000406e0
[  609.420577] Stack:
[  609.420968]  0000000000000292 ffff8800ba6f7800 ffff8801f6dfb450
0000000000000000
[  609.422423]  0000000000000001 0000000000000056 0000000000000057
0000000000000058
[  609.423878]  0000000000000059 000000000000005a 000000000000005b
000000000000005c
[  609.425325] Call Trace:
[  609.425797]  [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
[  609.426971]  [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
[  609.428097]  [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
[  609.429073]  [<ffffffff8125a319>] vfs_read+0x89/0x130
[  609.430010]  [<ffffffff8125b418>] SyS_read+0x58/0xd0
[  609.430943]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
74 15
[  609.436956] RIP  [<ffffffff811dc0ab>]
invalidate_inode_pages2_range+0x30b/0x550
[  609.438373]  RSP <ffff8801f6543c88>
[  609.439080] ---[ end trace 10616a16523ccb2c ]---

They both fail 100% of the time with the above signatures with the
"oom reaper" commit, and succeed 100% of the time with the parent
commit.

My test setup is a qemu guest machine with a pair of 4 GiB PMEM
ramdisk test devices, one for the xfstest test disk and one for the
scratch disk.

Please let me know if you have trouble reproducing this.  I'm also
happy to test fixes.

Thanks,
- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-23 23:00   ` Ross Zwisler
@ 2015-12-24  9:47     ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-24  9:47 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
[...]
> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> that bisected to this commit:
> 
> 1eb3a80d8239 ("mm, oom: introduce oom reaper")

Thank you for the report and the bisection.

> Here is a BUG produced by generic/029 when run against XFS:
> 
> [  235.751723] ------------[ cut here ]------------
> [  235.752194] kernel BUG at mm/filemap.c:208!

This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
the full kernel log? It all smells like a race when OOM reaper tears
down the mapping and there is a truncate still in progress. But hitting
the BUG_ON just because of that doesn't make much sense to me. OOM
reaper is essentially MADV_DONTNEED. I have to think about this some
more, though, but I am in a holiday mode until early next year so please
bear with me.

[...]
> [  235.765638] Call Trace:
> [  235.765903]  [<ffffffff811c8493>] delete_from_page_cache+0x63/0xd0
> [  235.766513]  [<ffffffff811dc3e5>] truncate_inode_page+0xa5/0x120
> [  235.767088]  [<ffffffff811dc648>] truncate_inode_pages_range+0x1a8/0x7f0
> [  235.767725]  [<ffffffff81021459>] ? sched_clock+0x9/0x10
> [  235.768239]  [<ffffffff810db37c>] ? local_clock+0x1c/0x20
> [  235.768779]  [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
> [  235.769385]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
> [  235.770010]  [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
> [  235.770501]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
> [  235.771092]  [<ffffffff811dcd58>] truncate_pagecache+0x48/0x70
> [  235.771646]  [<ffffffff811dcdb2>] truncate_setsize+0x32/0x40
> [  235.772276]  [<ffffffff8148e972>] xfs_setattr_size+0x232/0x470
> [  235.772839]  [<ffffffff8148ec64>] xfs_vn_setattr+0xb4/0xc0
> [  235.773369]  [<ffffffff8127af87>] notify_change+0x237/0x350
> [  235.773945]  [<ffffffff81257c87>] do_truncate+0x77/0xc0
> [  235.774446]  [<ffffffff8125800f>] do_sys_ftruncate.constprop.15+0xef/0x150
> [  235.775156]  [<ffffffff812580ae>] SyS_ftruncate+0xe/0x10
> [  235.775650]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
> [  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
> e8 32
> [  235.778695] RIP  [<ffffffff811c81f6>] __delete_from_page_cache+0x206/0x440
> [  235.779350]  RSP <ffff8800bab83b60>
> [  235.779694] ---[ end trace fac9dd65c4cdd828 ]---
> 
> And a different BUG produced by generic/095, also with XFS:
> 
> [  609.398897] ------------[ cut here ]------------
> [  609.399843] kernel BUG at mm/truncate.c:629!

Hmm, I do not see any BUG_ON at this line. But there is
BUG_ON(page_mapped(page)) at line 620.

> [  609.400666] invalid opcode: 0000 [#1] SMP
> [  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
> [  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: G        W

There was a warning before this triggered. The full kernel log would be
helpful as well.

[...]
> [  609.425325] Call Trace:
> [  609.425797]  [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
> [  609.426971]  [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
> [  609.428097]  [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
> [  609.429073]  [<ffffffff8125a319>] vfs_read+0x89/0x130
> [  609.430010]  [<ffffffff8125b418>] SyS_read+0x58/0xd0
> [  609.430943]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> 74 15
[...]
> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> ramdisk test devices, one for the xfstest test disk and one for the
> scratch disk.

Is this just a plain ramdisk device or it needs a special configuration?
Is this somehow DAX related?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-24  9:47     ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-24  9:47 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
[...]
> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> that bisected to this commit:
> 
> 1eb3a80d8239 ("mm, oom: introduce oom reaper")

Thank you for the report and the bisection.

> Here is a BUG produced by generic/029 when run against XFS:
> 
> [  235.751723] ------------[ cut here ]------------
> [  235.752194] kernel BUG at mm/filemap.c:208!

This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
the full kernel log? It all smells like a race when OOM reaper tears
down the mapping and there is a truncate still in progress. But hitting
the BUG_ON just because of that doesn't make much sense to me. OOM
reaper is essentially MADV_DONTNEED. I have to think about this some
more, though, but I am in a holiday mode until early next year so please
bear with me.

[...]
> [  235.765638] Call Trace:
> [  235.765903]  [<ffffffff811c8493>] delete_from_page_cache+0x63/0xd0
> [  235.766513]  [<ffffffff811dc3e5>] truncate_inode_page+0xa5/0x120
> [  235.767088]  [<ffffffff811dc648>] truncate_inode_pages_range+0x1a8/0x7f0
> [  235.767725]  [<ffffffff81021459>] ? sched_clock+0x9/0x10
> [  235.768239]  [<ffffffff810db37c>] ? local_clock+0x1c/0x20
> [  235.768779]  [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
> [  235.769385]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
> [  235.770010]  [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
> [  235.770501]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
> [  235.771092]  [<ffffffff811dcd58>] truncate_pagecache+0x48/0x70
> [  235.771646]  [<ffffffff811dcdb2>] truncate_setsize+0x32/0x40
> [  235.772276]  [<ffffffff8148e972>] xfs_setattr_size+0x232/0x470
> [  235.772839]  [<ffffffff8148ec64>] xfs_vn_setattr+0xb4/0xc0
> [  235.773369]  [<ffffffff8127af87>] notify_change+0x237/0x350
> [  235.773945]  [<ffffffff81257c87>] do_truncate+0x77/0xc0
> [  235.774446]  [<ffffffff8125800f>] do_sys_ftruncate.constprop.15+0xef/0x150
> [  235.775156]  [<ffffffff812580ae>] SyS_ftruncate+0xe/0x10
> [  235.775650]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
> [  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
> e8 32
> [  235.778695] RIP  [<ffffffff811c81f6>] __delete_from_page_cache+0x206/0x440
> [  235.779350]  RSP <ffff8800bab83b60>
> [  235.779694] ---[ end trace fac9dd65c4cdd828 ]---
> 
> And a different BUG produced by generic/095, also with XFS:
> 
> [  609.398897] ------------[ cut here ]------------
> [  609.399843] kernel BUG at mm/truncate.c:629!

Hmm, I do not see any BUG_ON at this line. But there is
BUG_ON(page_mapped(page)) at line 620.

> [  609.400666] invalid opcode: 0000 [#1] SMP
> [  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
> [  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: G        W

There was a warning before this triggered. The full kernel log would be
helpful as well.

[...]
> [  609.425325] Call Trace:
> [  609.425797]  [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
> [  609.426971]  [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
> [  609.428097]  [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
> [  609.429073]  [<ffffffff8125a319>] vfs_read+0x89/0x130
> [  609.430010]  [<ffffffff8125b418>] SyS_read+0x58/0xd0
> [  609.430943]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> 74 15
[...]
> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> ramdisk test devices, one for the xfstest test disk and one for the
> scratch disk.

Is this just a plain ramdisk device or it needs a special configuration?
Is this somehow DAX related?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-24  9:47     ` Michal Hocko
@ 2015-12-24 11:06       ` Tetsuo Handa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2015-12-24 11:06 UTC (permalink / raw)
  To: mhocko, zwisler
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, ross.zwisler

Michal Hocko wrote:
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.

I don't know whether the OOM killer was invoked just before this
VM_BUG_ON_PAGE().

> Is this somehow DAX related?

4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
http://marc.info/?l=linux-mm&m=145068666428057 should be applied
when retesting. (20151223 does not have this fix.)

[  235.768779]  [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
[  235.769385]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
[  235.770010]  [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
[  235.770501]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-24 11:06       ` Tetsuo Handa
  0 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2015-12-24 11:06 UTC (permalink / raw)
  To: mhocko, zwisler
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel, ross.zwisler

Michal Hocko wrote:
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.

I don't know whether the OOM killer was invoked just before this
VM_BUG_ON_PAGE().

> Is this somehow DAX related?

4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
http://marc.info/?l=linux-mm&m=145068666428057 should be applied
when retesting. (20151223 does not have this fix.)

[  235.768779]  [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
[  235.769385]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
[  235.770010]  [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
[  235.770501]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-24 11:06       ` Tetsuo Handa
@ 2015-12-24 20:39         ` Ross Zwisler
  -1 siblings, 0 replies; 78+ messages in thread
From: Ross Zwisler @ 2015-12-24 20:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Thu, Dec 24, 2015 at 4:06 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Michal Hocko wrote:
>> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
>> the full kernel log? It all smells like a race when OOM reaper tears
>> down the mapping and there is a truncate still in progress. But hitting
>> the BUG_ON just because of that doesn't make much sense to me. OOM
>> reaper is essentially MADV_DONTNEED. I have to think about this some
>> more, though, but I am in a holiday mode until early next year so please
>> bear with me.
>
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
>
>> Is this somehow DAX related?
>
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm&m=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)

No, DAX was not turned on, and while my fsync/msync patches were the initial
reason I was testing (hence the new_fsync_v6 kernel name) they were not
applied during the bisect, so I'm sure they are not related to this issue.

I will retest with the patch referenced above, but it probably won't
happen until
the new year.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-24 20:39         ` Ross Zwisler
  0 siblings, 0 replies; 78+ messages in thread
From: Ross Zwisler @ 2015-12-24 20:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Thu, Dec 24, 2015 at 4:06 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> Michal Hocko wrote:
>> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
>> the full kernel log? It all smells like a race when OOM reaper tears
>> down the mapping and there is a truncate still in progress. But hitting
>> the BUG_ON just because of that doesn't make much sense to me. OOM
>> reaper is essentially MADV_DONTNEED. I have to think about this some
>> more, though, but I am in a holiday mode until early next year so please
>> bear with me.
>
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
>
>> Is this somehow DAX related?
>
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm&m=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)

No, DAX was not turned on, and while my fsync/msync patches were the initial
reason I was testing (hence the new_fsync_v6 kernel name) they were not
applied during the bisect, so I'm sure they are not related to this issue.

I will retest with the patch referenced above, but it probably won't
happen until
the new year.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-24  9:47     ` Michal Hocko
@ 2015-12-24 20:44       ` Ross Zwisler
  -1 siblings, 0 replies; 78+ messages in thread
From: Ross Zwisler @ 2015-12-24 20:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> [...]
>> While running xfstests on next-20151223 I hit a pair of kernel BUGs
>> that bisected to this commit:
>>
>> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
>
> Thank you for the report and the bisection.
>
>> Here is a BUG produced by generic/029 when run against XFS:
>>
>> [  235.751723] ------------[ cut here ]------------
>> [  235.752194] kernel BUG at mm/filemap.c:208!
>
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.

The two stack traces were gathered with next-20151223, so the line numbers
may have moved around a bit when compared to the actual "mm, oom: introduce
oom reaper" commit.

> [...]
>> [  235.765638] Call Trace:
>> [  235.765903]  [<ffffffff811c8493>] delete_from_page_cache+0x63/0xd0
>> [  235.766513]  [<ffffffff811dc3e5>] truncate_inode_page+0xa5/0x120
>> [  235.767088]  [<ffffffff811dc648>] truncate_inode_pages_range+0x1a8/0x7f0
>> [  235.767725]  [<ffffffff81021459>] ? sched_clock+0x9/0x10
>> [  235.768239]  [<ffffffff810db37c>] ? local_clock+0x1c/0x20
>> [  235.768779]  [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
>> [  235.769385]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
>> [  235.770010]  [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
>> [  235.770501]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
>> [  235.771092]  [<ffffffff811dcd58>] truncate_pagecache+0x48/0x70
>> [  235.771646]  [<ffffffff811dcdb2>] truncate_setsize+0x32/0x40
>> [  235.772276]  [<ffffffff8148e972>] xfs_setattr_size+0x232/0x470
>> [  235.772839]  [<ffffffff8148ec64>] xfs_vn_setattr+0xb4/0xc0
>> [  235.773369]  [<ffffffff8127af87>] notify_change+0x237/0x350
>> [  235.773945]  [<ffffffff81257c87>] do_truncate+0x77/0xc0
>> [  235.774446]  [<ffffffff8125800f>] do_sys_ftruncate.constprop.15+0xef/0x150
>> [  235.775156]  [<ffffffff812580ae>] SyS_ftruncate+0xe/0x10
>> [  235.775650]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
>> [  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
>> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
>> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
>> e8 32
>> [  235.778695] RIP  [<ffffffff811c81f6>] __delete_from_page_cache+0x206/0x440
>> [  235.779350]  RSP <ffff8800bab83b60>
>> [  235.779694] ---[ end trace fac9dd65c4cdd828 ]---
>>
>> And a different BUG produced by generic/095, also with XFS:
>>
>> [  609.398897] ------------[ cut here ]------------
>> [  609.399843] kernel BUG at mm/truncate.c:629!
>
> Hmm, I do not see any BUG_ON at this line. But there is
> BUG_ON(page_mapped(page)) at line 620.

Ditto - check out next-20151223 for real line numbers.

>> [  609.400666] invalid opcode: 0000 [#1] SMP
>> [  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
>> [  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: G        W
>
> There was a warning before this triggered. The full kernel log would be
> helpful as well.

Sure, I can gather full kernel logs, but it'll probably after the new year.

> [...]
>> [  609.425325] Call Trace:
>> [  609.425797]  [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
>> [  609.426971]  [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
>> [  609.428097]  [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
>> [  609.429073]  [<ffffffff8125a319>] vfs_read+0x89/0x130
>> [  609.430010]  [<ffffffff8125b418>] SyS_read+0x58/0xd0
>> [  609.430943]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
>> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
>> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
>> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
>> 74 15
> [...]
>> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
>> ramdisk test devices, one for the xfstest test disk and one for the
>> scratch disk.
>
> Is this just a plain ramdisk device or it needs a special configuration?

Just a plain PMEM ram disk with DAX turned off.  Configuration instructions
for PMEM can be found here:

https://nvdimm.wiki.kernel.org/

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-24 20:44       ` Ross Zwisler
  0 siblings, 0 replies; 78+ messages in thread
From: Ross Zwisler @ 2015-12-24 20:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> [...]
>> While running xfstests on next-20151223 I hit a pair of kernel BUGs
>> that bisected to this commit:
>>
>> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
>
> Thank you for the report and the bisection.
>
>> Here is a BUG produced by generic/029 when run against XFS:
>>
>> [  235.751723] ------------[ cut here ]------------
>> [  235.752194] kernel BUG at mm/filemap.c:208!
>
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.

The two stack traces were gathered with next-20151223, so the line numbers
may have moved around a bit when compared to the actual "mm, oom: introduce
oom reaper" commit.

> [...]
>> [  235.765638] Call Trace:
>> [  235.765903]  [<ffffffff811c8493>] delete_from_page_cache+0x63/0xd0
>> [  235.766513]  [<ffffffff811dc3e5>] truncate_inode_page+0xa5/0x120
>> [  235.767088]  [<ffffffff811dc648>] truncate_inode_pages_range+0x1a8/0x7f0
>> [  235.767725]  [<ffffffff81021459>] ? sched_clock+0x9/0x10
>> [  235.768239]  [<ffffffff810db37c>] ? local_clock+0x1c/0x20
>> [  235.768779]  [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
>> [  235.769385]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
>> [  235.770010]  [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
>> [  235.770501]  [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
>> [  235.771092]  [<ffffffff811dcd58>] truncate_pagecache+0x48/0x70
>> [  235.771646]  [<ffffffff811dcdb2>] truncate_setsize+0x32/0x40
>> [  235.772276]  [<ffffffff8148e972>] xfs_setattr_size+0x232/0x470
>> [  235.772839]  [<ffffffff8148ec64>] xfs_vn_setattr+0xb4/0xc0
>> [  235.773369]  [<ffffffff8127af87>] notify_change+0x237/0x350
>> [  235.773945]  [<ffffffff81257c87>] do_truncate+0x77/0xc0
>> [  235.774446]  [<ffffffff8125800f>] do_sys_ftruncate.constprop.15+0xef/0x150
>> [  235.775156]  [<ffffffff812580ae>] SyS_ftruncate+0xe/0x10
>> [  235.775650]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
>> [  235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
>> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
>> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
>> e8 32
>> [  235.778695] RIP  [<ffffffff811c81f6>] __delete_from_page_cache+0x206/0x440
>> [  235.779350]  RSP <ffff8800bab83b60>
>> [  235.779694] ---[ end trace fac9dd65c4cdd828 ]---
>>
>> And a different BUG produced by generic/095, also with XFS:
>>
>> [  609.398897] ------------[ cut here ]------------
>> [  609.399843] kernel BUG at mm/truncate.c:629!
>
> Hmm, I do not see any BUG_ON at this line. But there is
> BUG_ON(page_mapped(page)) at line 620.

Ditto - check out next-20151223 for real line numbers.

>> [  609.400666] invalid opcode: 0000 [#1] SMP
>> [  609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
>> [  609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: G        W
>
> There was a warning before this triggered. The full kernel log would be
> helpful as well.

Sure, I can gather full kernel logs, but it'll probably after the new year.

> [...]
>> [  609.425325] Call Trace:
>> [  609.425797]  [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
>> [  609.426971]  [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
>> [  609.428097]  [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
>> [  609.429073]  [<ffffffff8125a319>] vfs_read+0x89/0x130
>> [  609.430010]  [<ffffffff8125b418>] SyS_read+0x58/0xd0
>> [  609.430943]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
>> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
>> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
>> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
>> 74 15
> [...]
>> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
>> ramdisk test devices, one for the xfstest test disk and one for the
>> scratch disk.
>
> Is this just a plain ramdisk device or it needs a special configuration?

Just a plain PMEM ram disk with DAX turned off.  Configuration instructions
for PMEM can be found here:

https://nvdimm.wiki.kernel.org/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-24 20:44       ` Ross Zwisler
@ 2015-12-25 11:35         ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-25 11:35 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Thu 24-12-15 13:44:03, Ross Zwisler wrote:
> On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> > [...]
> >> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> >> that bisected to this commit:
> >>
> >> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
> >
> > Thank you for the report and the bisection.
> >
> >> Here is a BUG produced by generic/029 when run against XFS:
> >>
> >> [  235.751723] ------------[ cut here ]------------
> >> [  235.752194] kernel BUG at mm/filemap.c:208!
> >
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
> 
> The two stack traces were gathered with next-20151223, so the line numbers
> may have moved around a bit when compared to the actual "mm, oom: introduce
> oom reaper" commit.

I was looking at the same next tree, I believe
$ git describe
next-20151223

[...]
> > There was a warning before this triggered. The full kernel log would be
> > helpful as well.
> 
> Sure, I can gather full kernel logs, but it'll probably after the new year.

OK, I will wait for the logs. It is really interesting to see what was
the timing between OOM killer invocation and this trace.

> > [...]
> >> [  609.425325] Call Trace:
> >> [  609.425797]  [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
> >> [  609.426971]  [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
> >> [  609.428097]  [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
> >> [  609.429073]  [<ffffffff8125a319>] vfs_read+0x89/0x130
> >> [  609.430010]  [<ffffffff8125b418>] SyS_read+0x58/0xd0
> >> [  609.430943]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
> >> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> >> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> >> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> >> 74 15
> > [...]
> >> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> >> ramdisk test devices, one for the xfstest test disk and one for the
> >> scratch disk.
> >
> > Is this just a plain ramdisk device or it needs a special configuration?
> 
> Just a plain PMEM ram disk with DAX turned off.  Configuration instructions
> for PMEM can be found here:
> 
> https://nvdimm.wiki.kernel.org/

Thanks I will try to reproduce early next year. But so far I think this
is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
just lucky to trigger it. There shouldn't be anything oom_reaper
specific here. Maybe there is some additional locking missing?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-25 11:35         ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-25 11:35 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Thu 24-12-15 13:44:03, Ross Zwisler wrote:
> On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> > [...]
> >> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> >> that bisected to this commit:
> >>
> >> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
> >
> > Thank you for the report and the bisection.
> >
> >> Here is a BUG produced by generic/029 when run against XFS:
> >>
> >> [  235.751723] ------------[ cut here ]------------
> >> [  235.752194] kernel BUG at mm/filemap.c:208!
> >
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
> 
> The two stack traces were gathered with next-20151223, so the line numbers
> may have moved around a bit when compared to the actual "mm, oom: introduce
> oom reaper" commit.

I was looking at the same next tree, I believe
$ git describe
next-20151223

[...]
> > There was a warning before this triggered. The full kernel log would be
> > helpful as well.
> 
> Sure, I can gather full kernel logs, but it'll probably after the new year.

OK, I will wait for the logs. It is really interesting to see what was
the timing between OOM killer invocation and this trace.

> > [...]
> >> [  609.425325] Call Trace:
> >> [  609.425797]  [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
> >> [  609.426971]  [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
> >> [  609.428097]  [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
> >> [  609.429073]  [<ffffffff8125a319>] vfs_read+0x89/0x130
> >> [  609.430010]  [<ffffffff8125b418>] SyS_read+0x58/0xd0
> >> [  609.430943]  [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
> >> [  609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> >> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> >> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> >> 74 15
> > [...]
> >> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> >> ramdisk test devices, one for the xfstest test disk and one for the
> >> scratch disk.
> >
> > Is this just a plain ramdisk device or it needs a special configuration?
> 
> Just a plain PMEM ram disk with DAX turned off.  Configuration instructions
> for PMEM can be found here:
> 
> https://nvdimm.wiki.kernel.org/

Thanks I will try to reproduce early next year. But so far I think this
is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
just lucky to trigger it. There shouldn't be anything oom_reaper
specific here. Maybe there is some additional locking missing?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-24 11:06       ` Tetsuo Handa
@ 2015-12-25 11:41         ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-25 11:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: zwisler, akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel, ross.zwisler

On Thu 24-12-15 20:06:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
> 
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
> 
> > Is this somehow DAX related?
> 
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm&m=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)

Hmm, I think you are right! Very well spotted! If ignore_dirty ends up
being true then we would simply skip over dirty page and wouldn't end up
doing page_remove_rmap. I can see that the truncation code can later trip
over this page.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-25 11:41         ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-25 11:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: zwisler, akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel, ross.zwisler

On Thu 24-12-15 20:06:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
> 
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
> 
> > Is this somehow DAX related?
> 
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm&m=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)

Hmm, I think you are right! Very well spotted! If ignore_dirty ends up
being true then we would simply skip over dirty page and wouldn't end up
doing page_remove_rmap. I can see that the truncation code can later trip
over this page.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-25 11:35         ` Michal Hocko
@ 2015-12-25 11:44           ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-25 11:44 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Fri 25-12-15 12:35:37, Michal Hocko wrote:
[...]
> Thanks I will try to reproduce early next year. But so far I think this
> is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
> just lucky to trigger it. There shouldn't be anything oom_reaper
> specific here. Maybe there is some additional locking missing?

Hmm, scratch that. I think Tetsuo has nailed it. It seems like
the missing initialization of details structure during unmap
is the culprit. So there most probably was on OOM killing
invoked. It is just a side effect of the patch and missing
http://marc.info/?l=linux-mm&m=145068666428057 follow up fix.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2015-12-25 11:44           ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2015-12-25 11:44 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML, Ross Zwisler

On Fri 25-12-15 12:35:37, Michal Hocko wrote:
[...]
> Thanks I will try to reproduce early next year. But so far I think this
> is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
> just lucky to trigger it. There shouldn't be anything oom_reaper
> specific here. Maybe there is some additional locking missing?

Hmm, scratch that. I think Tetsuo has nailed it. It seems like
the missing initialization of details structure during unmap
is the culprit. So there most probably was on OOM killing
invoked. It is just a side effect of the patch and missing
http://marc.info/?l=linux-mm&m=145068666428057 follow up fix.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2015-12-21 20:38   ` Paul Gortmaker
@ 2016-01-06  9:10     ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-06  9:10 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
[...]
> ...use one of the non-modular initcalls here?   I'm trying to clean up most of
> the non-modular uses of modular macros etc. since:
> 
>  (1) it is easy to accidentally code up an unused module_exit function
>  (2) it can be misleading when reading the source, thinking it can be
>       modular when the Makefile and/or Kconfig prohibit it
>  (3) it requires the include of the module.h header file which in turn
>      includes nearly everything else, thus increasing CPP overhead.
> 
> I figured no point in sending a follow on patch since this came in via
> the akpm tree into next and that gets rebased/updated regularly.

Sorry for the late reply. I was mostly offline throughout the last 2
weeks last year. Is the following what you would like to see? If yes I
will fold it into the original patch.

Thanks!
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7a9678c50edd..1ece40b94725 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -36,7 +36,7 @@
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
-#include <linux/module.h>
+#include <linux/init.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -541,7 +541,7 @@ static int __init oom_init(void)
 	}
 	return 0;
 }
-module_init(oom_init)
+subsys_initcall(oom_init)
 #else
 static void wake_oom_reaper(struct mm_struct *mm)
 {
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-06  9:10     ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-06  9:10 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
[...]
> ...use one of the non-modular initcalls here?   I'm trying to clean up most of
> the non-modular uses of modular macros etc. since:
> 
>  (1) it is easy to accidentally code up an unused module_exit function
>  (2) it can be misleading when reading the source, thinking it can be
>       modular when the Makefile and/or Kconfig prohibit it
>  (3) it requires the include of the module.h header file which in turn
>      includes nearly everything else, thus increasing CPP overhead.
> 
> I figured no point in sending a follow on patch since this came in via
> the akpm tree into next and that gets rebased/updated regularly.

Sorry for the late reply. I was mostly offline throughout the last 2
weeks last year. Is the following what you would like to see? If yes I
will fold it into the original patch.

Thanks!
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7a9678c50edd..1ece40b94725 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -36,7 +36,7 @@
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
-#include <linux/module.h>
+#include <linux/init.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -541,7 +541,7 @@ static int __init oom_init(void)
 	}
 	return 0;
 }
-module_init(oom_init)
+subsys_initcall(oom_init)
 #else
 static void wake_oom_reaper(struct mm_struct *mm)
 {
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06  9:10     ` Michal Hocko
@ 2016-01-06 14:26       ` Paul Gortmaker
  -1 siblings, 0 replies; 78+ messages in thread
From: Paul Gortmaker @ 2016-01-06 14:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

[Re: [PATCH 1/2] mm, oom: introduce oom reaper] On 06/01/2016 (Wed 10:10) Michal Hocko wrote:

> On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
> [...]
> > ...use one of the non-modular initcalls here?   I'm trying to clean up most of
> > the non-modular uses of modular macros etc. since:
> > 
> >  (1) it is easy to accidentally code up an unused module_exit function
> >  (2) it can be misleading when reading the source, thinking it can be
> >       modular when the Makefile and/or Kconfig prohibit it
> >  (3) it requires the include of the module.h header file which in turn
> >      includes nearly everything else, thus increasing CPP overhead.
> > 
> > I figured no point in sending a follow on patch since this came in via
> > the akpm tree into next and that gets rebased/updated regularly.
> 
> Sorry for the late reply. I was mostly offline throughout the last 2
> weeks last year. Is the following what you would like to see? If yes I
> will fold it into the original patch.

Yes, that looks fine.  Do note that susbsys_initcall is earlier than the
module_init that you were using previously though.

Thanks,
Paul.
--

> 
> Thanks!
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7a9678c50edd..1ece40b94725 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -36,7 +36,7 @@
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
>  #include <linux/kthread.h>
> -#include <linux/module.h>
> +#include <linux/init.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -541,7 +541,7 @@ static int __init oom_init(void)
>  	}
>  	return 0;
>  }
> -module_init(oom_init)
> +subsys_initcall(oom_init)
>  #else
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-06 14:26       ` Paul Gortmaker
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Gortmaker @ 2016-01-06 14:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

[Re: [PATCH 1/2] mm, oom: introduce oom reaper] On 06/01/2016 (Wed 10:10) Michal Hocko wrote:

> On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
> [...]
> > ...use one of the non-modular initcalls here?   I'm trying to clean up most of
> > the non-modular uses of modular macros etc. since:
> > 
> >  (1) it is easy to accidentally code up an unused module_exit function
> >  (2) it can be misleading when reading the source, thinking it can be
> >       modular when the Makefile and/or Kconfig prohibit it
> >  (3) it requires the include of the module.h header file which in turn
> >      includes nearly everything else, thus increasing CPP overhead.
> > 
> > I figured no point in sending a follow on patch since this came in via
> > the akpm tree into next and that gets rebased/updated regularly.
> 
> Sorry for the late reply. I was mostly offline throughout the last 2
> weeks last year. Is the following what you would like to see? If yes I
> will fold it into the original patch.

Yes, that looks fine.  Do note that susbsys_initcall is earlier than the
module_init that you were using previously though.

Thanks,
Paul.
--

> 
> Thanks!
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7a9678c50edd..1ece40b94725 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -36,7 +36,7 @@
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
>  #include <linux/kthread.h>
> -#include <linux/module.h>
> +#include <linux/init.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -541,7 +541,7 @@ static int __init oom_init(void)
>  	}
>  	return 0;
>  }
> -module_init(oom_init)
> +subsys_initcall(oom_init)
>  #else
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06 14:26       ` Paul Gortmaker
@ 2016-01-06 15:00         ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-06 15:00 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Wed 06-01-16 09:26:12, Paul Gortmaker wrote:
> [Re: [PATCH 1/2] mm, oom: introduce oom reaper] On 06/01/2016 (Wed 10:10) Michal Hocko wrote:
> 
> > On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
> > [...]
> > > ...use one of the non-modular initcalls here?   I'm trying to clean up most of
> > > the non-modular uses of modular macros etc. since:
> > > 
> > >  (1) it is easy to accidentally code up an unused module_exit function
> > >  (2) it can be misleading when reading the source, thinking it can be
> > >       modular when the Makefile and/or Kconfig prohibit it
> > >  (3) it requires the include of the module.h header file which in turn
> > >      includes nearly everything else, thus increasing CPP overhead.
> > > 
> > > I figured no point in sending a follow on patch since this came in via
> > > the akpm tree into next and that gets rebased/updated regularly.
> > 
> > Sorry for the late reply. I was mostly offline throughout the last 2
> > weeks last year. Is the following what you would like to see? If yes I
> > will fold it into the original patch.
> 
> Yes, that looks fine.  Do note that susbsys_initcall is earlier than the
> module_init that you were using previously though.

Yes, I have noticed that but quite honestly module_init choice was just
"look around and use what others are doing". So there was no particular
reason to stick with that order.

Thanks for double checking after me!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-06 15:00         ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-06 15:00 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, David Rientjes,
	Linus Torvalds, Oleg Nesterov, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Wed 06-01-16 09:26:12, Paul Gortmaker wrote:
> [Re: [PATCH 1/2] mm, oom: introduce oom reaper] On 06/01/2016 (Wed 10:10) Michal Hocko wrote:
> 
> > On Mon 21-12-15 15:38:21, Paul Gortmaker wrote:
> > [...]
> > > ...use one of the non-modular initcalls here?   I'm trying to clean up most of
> > > the non-modular uses of modular macros etc. since:
> > > 
> > >  (1) it is easy to accidentally code up an unused module_exit function
> > >  (2) it can be misleading when reading the source, thinking it can be
> > >       modular when the Makefile and/or Kconfig prohibit it
> > >  (3) it requires the include of the module.h header file which in turn
> > >      includes nearly everything else, thus increasing CPP overhead.
> > > 
> > > I figured no point in sending a follow on patch since this came in via
> > > the akpm tree into next and that gets rebased/updated regularly.
> > 
> > Sorry for the late reply. I was mostly offline throughout the last 2
> > weeks last year. Is the following what you would like to see? If yes I
> > will fold it into the original patch.
> 
> Yes, that looks fine.  Do note that susbsys_initcall is earlier than the
> module_init that you were using previously though.

Yes, I have noticed that but quite honestly module_init choice was just
"look around and use what others are doing". So there was no particular
reason to stick with that order.

Thanks for double checking after me!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02 22:51             ` David Rientjes
@ 2016-02-03 10:31               ` Tetsuo Handa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2016-02-03 10:31 UTC (permalink / raw)
  To: rientjes, mhocko, hannes
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

David Rientjes wrote:
> On Tue, 2 Feb 2016, Michal Hocko wrote:
> > > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > > disrupting the global runqueue by running oom_reaper at a high priority.  
> > > The disruption itself is not only in first wakeup but also in how long the 
> > > reaper can run and when it is rescheduled: for a lot of memory this is 
> > > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > > can't possibly be expected to be immediate.  This high priority should be 
> > > removed so memcg oom conditions are isolated and don't affect other loads.
> > 
> > If this is a concern then I would be tempted to simply disable oom
> > reaper for memcg oom altogether. For me it is much more important that
> > the reaper, even though a best effort, is guaranteed to schedule if
> > something goes terribly wrong on the machine.
> > 
> 
> I don't believe the higher priority guarantees it is able to schedule any 
> more than it was guaranteed to schedule before.  It will run, but it won't 
> preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
> only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
> nodemasks.  I think it would be disappointing to leave those out.  I think 
> the higher priority should simply be removed in terms of fairness.
> 
> Other than these issues, I don't see any reason why a refreshed series 
> wouldn't be immediately acked.  Thanks very much for continuing to work on 
> this!
> 

Excuse me, but I came to think that we should try to wake up the OOM reaper at

    if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
        if (!is_sysrq_oom(oc))
            return OOM_SCAN_ABORT;
    }

in oom_scan_process_thread() rather than at oom_kill_process() or at
mark_oom_victim(). Waking up the OOM reaper there will try to reap
task->mm, and give up eventually which will in turn naturally allow the
OOM killer to choose next OOM victim. The key point is PATCH 2/5 shown
below. What do you think?

PATCH 1/5 is (I think) a bug fix.
PATCH 2/5 is for waking up the OOM reaper from victim selection loop.
PATCH 3/5 is for helping the OOM killer to choose next OOM victim.
PATCH 4/5 is for handling corner cases.
PATCH 5/5 is for changing the OOM reaper to use default priority.

 include/linux/oom.h |    3
 mm/oom_kill.c       |  173 ++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 136 insertions(+), 40 deletions(-)

----------------------------------------
>From e1c0a78fbfd0a76f367efac269cbcf22c7df9292 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:18:19 +0900
Subject: [PATCH 1/5] mm,oom: Fix incorrect oom_task_origin check.

Currently, the OOM killer unconditionally selects p if oom_task_origin(p)
is true, but p should not be OOM-killed if p is marked as OOM-unkillable.

This patch does not fix a race condition where p is selected when p was
by chance between set_current_oom_origin() and actually start operations
that might trigger an OOM event when an OOM event is triggered for some
reason other than operations by p.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..59481e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,7 +67,8 @@ static inline void clear_current_oom_origin(void)
 
 static inline bool oom_task_origin(const struct task_struct *p)
 {
-	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
+	return (p->signal->oom_flags & OOM_FLAG_ORIGIN) &&
+		p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN;
 }
 
 extern void mark_oom_victim(struct task_struct *tsk);
-- 
1.8.3.1

>From 76cf60d33e4e1daa475e4c1e39087415a309c6e9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:20:07 +0900
Subject: [PATCH 2/5] mm,oom: Change timing of waking up the OOM reaper

Currently, the OOM reaper kernel thread is woken up when we set TIF_MEMDIE
on a task. But it is not easy to build a reliable OOM-reap queuing chain.

Since the OOM livelock problem occurs when we find TIF_MEMDIE on a task
which cannot terminate, waking up the OOM reaper when we found TIF_MEMDIE
on a task can simplify handling of the chain. Also, we don't need to wake
up the OOM reaper if the victim can smoothly terminate. Therefore, this
patch replaces wake_oom_reaper() called from oom_kill_process() with
try_oom_reap() called from oom_scan_process_thread().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b51bcce..07c6389 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -268,6 +268,8 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+static bool try_oom_reap(struct task_struct *tsk);
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
@@ -279,7 +281,7 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (!is_sysrq_oom(oc))
+		if (!is_sysrq_oom(oc) && try_oom_reap(task))
 			return OOM_SCAN_ABORT;
 	}
 	if (!task->mm)
@@ -420,6 +422,40 @@ static struct task_struct *oom_reaper_th;
 static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
+static bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *g;
+	struct task_struct *p;
+
+	/*
+	 * Since it is possible that p voluntarily called do_exit() or
+	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
+	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
+	 * has pending SIGKILL or already reached do_exit().
+	 *
+	 * On the other hand, it is possible that mark_oom_victim(p) is called
+	 * without sending SIGKILL to all tasks using this mm. In this case,
+	 * the OOM reaper cannot reap this mm unless p is the only task using
+	 * this mm.
+	 *
+	 * Therefore, determine whether this mm is reapable by testing whether
+	 * all tasks using this mm are dying or already exiting rather than
+	 * depending on p->signal->oom_score_adj value which is updated by the
+	 * OOM reaper.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (mm != READ_ONCE(p->mm) ||
+		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
+			continue;
+		mm = NULL;
+		goto out;
+	}
+ out:
+	rcu_read_unlock();
+	return mm != NULL;
+}
+
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -448,7 +484,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
 	}
@@ -500,7 +536,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
-	/* Drop a reference taken by wake_oom_reaper */
+	/* Drop a reference taken by try_oom_reap */
 	put_task_struct(tsk);
 }
 
@@ -512,18 +548,44 @@ static int oom_reaper(void *unused)
 		wait_event_freezable(oom_reaper_wait,
 				     (tsk = READ_ONCE(task_to_reap)));
 		oom_reap_task(tsk);
+		/*
+		 * The OOM killer might be about to call try_oom_reap() after
+		 * seeing TIF_MEMDIE.
+		 */
+		smp_wmb();
 		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static bool try_oom_reap(struct task_struct *tsk)
 {
 	struct task_struct *old_tsk;
 
+	/*
+	 * We will livelock if we unconditionally return true.
+	 * We will kill all tasks if we unconditionally return false.
+	 */
 	if (!oom_reaper_th)
-		return;
+		return true;
+
+	/*
+	 * Wait for the OOM reaper to reap this task and mark this task
+	 * as OOM-unkillable and clear TIF_MEMDIE. Since the OOM reaper
+	 * has high scheduling priority, we can unconditionally wait for
+	 * completion.
+	 */
+	if (task_to_reap)
+		return true;
+
+	/*
+	 * The OOM reaper might be about to clear task_to_reap after
+	 * clearing TIF_MEMDIE.
+	 */
+	smp_rmb();
+	if (!test_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return false;
 
 	get_task_struct(tsk);
 
@@ -537,6 +599,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 		wake_up(&oom_reaper_wait);
 	else
 		put_task_struct(tsk);
+	return true;
 }
 
 static int __init oom_init(void)
@@ -559,8 +622,13 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct task_struct *mm)
+static bool try_oom_reap(struct task_struct *tsk)
 {
+	/*
+	 * We will livelock if we unconditionally return true.
+	 * We will kill all tasks if we unconditionally return false.
+	 */
+	return true;
 }
 #endif
 
@@ -592,7 +660,8 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk, TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -669,7 +738,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -762,23 +830,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (is_global_init(p))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
-		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
1.8.3.1

>From 8c6024b963d5b4e8d38a3416e14b458e1e073607 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:42:28 +0900
Subject: [PATCH 3/5] mm,oom: Always update OOM score and clear TIF_MEMDIE
 after OOM reap.

This patch updates victim's oom_score_adj and clear TIF_MEMDIE
even if the OOM reaper failed to reap victim's memory. This is
needed for handling corner cases where TIF_MEMDIE is set on a victim
without sending SIGKILL to all tasks sharing the same memory.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 50 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 07c6389..a0ae8dc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,6 +413,44 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+static void update_victim_score(struct task_struct *tsk, bool reap_success)
+{
+	/*
+	 * If we succeeded to reap a mm, mark that task using it as
+	 * OOM-unkillable and clear TIF_MEMDIE, for the task shouldn't be
+	 * sitting on a reasonably reclaimable memory anymore.
+	 * OOM killer can continue by selecting other victim if unmapping
+	 * hasn't led to any improvements. This also means that selecting
+	 * this task doesn't make any sense.
+	 */
+	if (reap_success)
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	/*
+	 * If we failed to reap a mm, mark that task using it as almost
+	 * OOM-unkillable and clear TIF_MEMDIE. This will help future
+	 * select_bad_process() try to select other OOM-killable tasks
+	 * before selecting that task again.
+	 */
+	else if (tsk->signal->oom_score_adj > OOM_SCORE_ADJ_MIN + 1)
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN + 1;
+	/*
+	 * But if that task got TIF_MEMDIE when that task is already marked as
+	 * almost OOM-unkillable, mark that task completely OOM-unkillable.
+	 * Otherwise, we cannot make progress when all OOM-killable tasks are
+	 * marked as almost OOM-unkillable.
+	 *
+	 * Note that the reason we fail to reap a mm might be that there are
+	 * tasks using this mm without neither pending SIGKILL nor PF_EXITING
+	 * which means that we set TIF_MEMDIE on a task without sending SIGKILL
+	 * to tasks sharing this mm. In this case, we will call panic() without
+	 * sending SIGKILL to tasks sharing this mm when all OOM-killable tasks
+	 * are marked as completely OOM-unkillable.
+	 */
+	else
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
+}
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -513,16 +551,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
-	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
@@ -536,6 +564,8 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
+	update_victim_score(tsk, attempts < 10);
+
 	/* Drop a reference taken by try_oom_reap */
 	put_task_struct(tsk);
 }
-- 
1.8.3.1

>From d6254acc565af7456fb21c0bb7568452fb227f3c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:44:47 +0900
Subject: [PATCH 4/5] mm,oom: Add timeout counter for handling corner cases.

Currently, we can hit OOM livelock if the OOM reaper kernel thread is
not available. This patch adds a simple timeout based next victim
selection logic in case the OOM reaper kernel thread is unavailable.

Future patch will add hooks for allowing global access to memory
reserves before this timeout counter expires.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a0ae8dc..e4e955b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -451,6 +451,11 @@ static void update_victim_score(struct task_struct *tsk, bool reap_success)
 	exit_oom_victim(tsk);
 }
 
+static void oomkiller_reset(unsigned long arg)
+{
+}
+static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -596,9 +601,14 @@ static bool try_oom_reap(struct task_struct *tsk)
 	/*
 	 * We will livelock if we unconditionally return true.
 	 * We will kill all tasks if we unconditionally return false.
+	 * Thus, use a simple timeout counter if the OOM reaper is unavailable.
 	 */
-	if (!oom_reaper_th)
-		return true;
+	if (!oom_reaper_th) {
+		if (timer_pending(&oomkiller_victim_wait_timer))
+			return true;
+		update_victim_score(tsk, false);
+		return false;
+	}
 
 	/*
 	 * Wait for the OOM reaper to reap this task and mark this task
@@ -654,11 +664,11 @@ subsys_initcall(oom_init)
 #else
 static bool try_oom_reap(struct task_struct *tsk)
 {
-	/*
-	 * We will livelock if we unconditionally return true.
-	 * We will kill all tasks if we unconditionally return false.
-	 */
-	return true;
+	/* Use a simple timeout counter, for the OOM reaper is unavailable. */
+	if (timer_pending(&oomkiller_victim_wait_timer))
+		return true;
+	update_victim_score(tsk, false);
+	return false;
 }
 #endif

@@ -683,6 +693,8 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	/* Make sure that we won't wait for this task forever. */
+	mod_timer(&oomkiller_victim_wait_timer, jiffies + 5 * HZ);
 }
 
 /**
-- 
1.8.3.1

>From 6156462d2db03bfc9fe76ca5a3f0ebcc5a88a12e Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:45:29 +0900
Subject: [PATCH 5/5] mm,oom: Use normal scheduling priority for the OOM reaper

Currently, the OOM reaper kernel thread has high scheduling priority
in order to make sure that OOM-reap operation occurs immediately.

This patch changes the scheduling priority to normal, and fallback to
a simple timeout based next victim selection logic if the OOM reaper
fails to get enough CPU resource.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e4e955b..b55159f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -604,6 +604,7 @@ static bool try_oom_reap(struct task_struct *tsk)
 	 * Thus, use a simple timeout counter if the OOM reaper is unavailable.
 	 */
 	if (!oom_reaper_th) {
+check_timeout:
 		if (timer_pending(&oomkiller_victim_wait_timer))
 			return true;
 		update_victim_score(tsk, false);
@@ -613,11 +614,12 @@ static bool try_oom_reap(struct task_struct *tsk)
 	/*
 	 * Wait for the OOM reaper to reap this task and mark this task
 	 * as OOM-unkillable and clear TIF_MEMDIE. Since the OOM reaper
-	 * has high scheduling priority, we can unconditionally wait for
-	 * completion.
+	 * has normal scheduling priority, we can't wait for completion
+	 * forever. Thus, use a simple timeout counter in case the OOM
+	 * reaper fails to get enough CPU resource.
 	 */
 	if (task_to_reap)
-		return true;
+		goto check_timeout;
 
 	/*
 	 * The OOM reaper might be about to clear task_to_reap after
@@ -649,14 +651,6 @@ static int __init oom_init(void)
 		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
 				PTR_ERR(oom_reaper_th));
 		oom_reaper_th = NULL;
-	} else {
-		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
-		/*
-		 * Make sure our oom reaper thread will get scheduled when
-		 * ASAP and that it won't get preempted by malicious userspace.
-		 */
-		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
 	}
 	return 0;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-03 10:31               ` Tetsuo Handa
  0 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2016-02-03 10:31 UTC (permalink / raw)
  To: rientjes, mhocko, hannes
  Cc: akpm, mgorman, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel

David Rientjes wrote:
> On Tue, 2 Feb 2016, Michal Hocko wrote:
> > > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > > disrupting the global runqueue by running oom_reaper at a high priority.  
> > > The disruption itself is not only in first wakeup but also in how long the 
> > > reaper can run and when it is rescheduled: for a lot of memory this is 
> > > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > > can't possibly be expected to be immediate.  This high priority should be 
> > > removed so memcg oom conditions are isolated and don't affect other loads.
> > 
> > If this is a concern then I would be tempted to simply disable oom
> > reaper for memcg oom altogether. For me it is much more important that
> > the reaper, even though a best effort, is guaranteed to schedule if
> > something goes terribly wrong on the machine.
> > 
> 
> I don't believe the higher priority guarantees it is able to schedule any 
> more than it was guaranteed to schedule before.  It will run, but it won't 
> preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
> only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
> nodemasks.  I think it would be disappointing to leave those out.  I think 
> the higher priority should simply be removed in terms of fairness.
> 
> Other than these issues, I don't see any reason why a refreshed series 
> wouldn't be immediately acked.  Thanks very much for continuing to work on 
> this!
> 

Excuse me, but I came to think that we should try to wake up the OOM reaper at

    if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
        if (!is_sysrq_oom(oc))
            return OOM_SCAN_ABORT;
    }

in oom_scan_process_thread() rather than at oom_kill_process() or at
mark_oom_victim(). Waking up the OOM reaper there will try to reap
task->mm, and give up eventually which will in turn naturally allow the
OOM killer to choose next OOM victim. The key point is PATCH 2/5 shown
below. What do you think?

PATCH 1/5 is (I think) a bug fix.
PATCH 2/5 is for waking up the OOM reaper from victim selection loop.
PATCH 3/5 is for helping the OOM killer to choose next OOM victim.
PATCH 4/5 is for handling corner cases.
PATCH 5/5 is for changing the OOM reaper to use default priority.

 include/linux/oom.h |    3
 mm/oom_kill.c       |  173 ++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 136 insertions(+), 40 deletions(-)

----------------------------------------
>From e1c0a78fbfd0a76f367efac269cbcf22c7df9292 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:18:19 +0900
Subject: [PATCH 1/5] mm,oom: Fix incorrect oom_task_origin check.

Currently, the OOM killer unconditionally selects p if oom_task_origin(p)
is true, but p should not be OOM-killed if p is marked as OOM-unkillable.

This patch does not fix a race condition where p is selected when p was
by chance between set_current_oom_origin() and actually start operations
that might trigger an OOM event when an OOM event is triggered for some
reason other than operations by p.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 include/linux/oom.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..59481e6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,7 +67,8 @@ static inline void clear_current_oom_origin(void)
 
 static inline bool oom_task_origin(const struct task_struct *p)
 {
-	return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
+	return (p->signal->oom_flags & OOM_FLAG_ORIGIN) &&
+		p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN;
 }
 
 extern void mark_oom_victim(struct task_struct *tsk);
-- 
1.8.3.1

>From 76cf60d33e4e1daa475e4c1e39087415a309c6e9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:20:07 +0900
Subject: [PATCH 2/5] mm,oom: Change timing of waking up the OOM reaper

Currently, the OOM reaper kernel thread is woken up when we set TIF_MEMDIE
on a task. But it is not easy to build a reliable OOM-reap queuing chain.

Since the OOM livelock problem occurs when we find TIF_MEMDIE on a task
which cannot terminate, waking up the OOM reaper when we found TIF_MEMDIE
on a task can simplify handling of the chain. Also, we don't need to wake
up the OOM reaper if the victim can smoothly terminate. Therefore, this
patch replaces wake_oom_reaper() called from oom_kill_process() with
try_oom_reap() called from oom_scan_process_thread().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 20 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b51bcce..07c6389 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -268,6 +268,8 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 }
 #endif
 
+static bool try_oom_reap(struct task_struct *tsk);
+
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
@@ -279,7 +281,7 @@ enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 	 * Don't allow any other task to have access to the reserves.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (!is_sysrq_oom(oc))
+		if (!is_sysrq_oom(oc) && try_oom_reap(task))
 			return OOM_SCAN_ABORT;
 	}
 	if (!task->mm)
@@ -420,6 +422,40 @@ static struct task_struct *oom_reaper_th;
 static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
+static bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *g;
+	struct task_struct *p;
+
+	/*
+	 * Since it is possible that p voluntarily called do_exit() or
+	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
+	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
+	 * has pending SIGKILL or already reached do_exit().
+	 *
+	 * On the other hand, it is possible that mark_oom_victim(p) is called
+	 * without sending SIGKILL to all tasks using this mm. In this case,
+	 * the OOM reaper cannot reap this mm unless p is the only task using
+	 * this mm.
+	 *
+	 * Therefore, determine whether this mm is reapable by testing whether
+	 * all tasks using this mm are dying or already exiting rather than
+	 * depending on p->signal->oom_score_adj value which is updated by the
+	 * OOM reaper.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (mm != READ_ONCE(p->mm) ||
+		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
+			continue;
+		mm = NULL;
+		goto out;
+	}
+ out:
+	rcu_read_unlock();
+	return mm != NULL;
+}
+
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -448,7 +484,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
 	}
@@ -500,7 +536,7 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
-	/* Drop a reference taken by wake_oom_reaper */
+	/* Drop a reference taken by try_oom_reap */
 	put_task_struct(tsk);
 }
 
@@ -512,18 +548,44 @@ static int oom_reaper(void *unused)
 		wait_event_freezable(oom_reaper_wait,
 				     (tsk = READ_ONCE(task_to_reap)));
 		oom_reap_task(tsk);
+		/*
+		 * The OOM killer might be about to call try_oom_reap() after
+		 * seeing TIF_MEMDIE.
+		 */
+		smp_wmb();
 		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static bool try_oom_reap(struct task_struct *tsk)
 {
 	struct task_struct *old_tsk;
 
+	/*
+	 * We will livelock if we unconditionally return true.
+	 * We will kill all tasks if we unconditionally return false.
+	 */
 	if (!oom_reaper_th)
-		return;
+		return true;
+
+	/*
+	 * Wait for the OOM reaper to reap this task and mark this task
+	 * as OOM-unkillable and clear TIF_MEMDIE. Since the OOM reaper
+	 * has high scheduling priority, we can unconditionally wait for
+	 * completion.
+	 */
+	if (task_to_reap)
+		return true;
+
+	/*
+	 * The OOM reaper might be about to clear task_to_reap after
+	 * clearing TIF_MEMDIE.
+	 */
+	smp_rmb();
+	if (!test_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return false;
 
 	get_task_struct(tsk);
 
@@ -537,6 +599,7 @@ static void wake_oom_reaper(struct task_struct *tsk)
 		wake_up(&oom_reaper_wait);
 	else
 		put_task_struct(tsk);
+	return true;
 }
 
 static int __init oom_init(void)
@@ -559,8 +622,13 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct task_struct *mm)
+static bool try_oom_reap(struct task_struct *tsk)
 {
+	/*
+	 * We will livelock if we unconditionally return true.
+	 * We will kill all tasks if we unconditionally return false.
+	 */
+	return true;
 }
 #endif
 
@@ -592,7 +660,8 @@ void mark_oom_victim(struct task_struct *tsk)
  */
 void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_tsk_thread_flag(tsk, TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -669,7 +738,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -762,23 +830,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (is_global_init(p))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (unlikely(p->flags & PF_KTHREAD))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
-		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
1.8.3.1

>From 8c6024b963d5b4e8d38a3416e14b458e1e073607 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:42:28 +0900
Subject: [PATCH 3/5] mm,oom: Always update OOM score and clear TIF_MEMDIE
 after OOM reap.

This patch updates victim's oom_score_adj and clear TIF_MEMDIE
even if the OOM reaper failed to reap victim's memory. This is
needed for handling corner cases where TIF_MEMDIE is set on a victim
without sending SIGKILL to all tasks sharing the same memory.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 50 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 07c6389..a0ae8dc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -413,6 +413,44 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+static void update_victim_score(struct task_struct *tsk, bool reap_success)
+{
+	/*
+	 * If we succeeded to reap a mm, mark that task using it as
+	 * OOM-unkillable and clear TIF_MEMDIE, for the task shouldn't be
+	 * sitting on a reasonably reclaimable memory anymore.
+	 * OOM killer can continue by selecting other victim if unmapping
+	 * hasn't led to any improvements. This also means that selecting
+	 * this task doesn't make any sense.
+	 */
+	if (reap_success)
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	/*
+	 * If we failed to reap a mm, mark that task using it as almost
+	 * OOM-unkillable and clear TIF_MEMDIE. This will help future
+	 * select_bad_process() try to select other OOM-killable tasks
+	 * before selecting that task again.
+	 */
+	else if (tsk->signal->oom_score_adj > OOM_SCORE_ADJ_MIN + 1)
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN + 1;
+	/*
+	 * But if that task got TIF_MEMDIE when that task is already marked as
+	 * almost OOM-unkillable, mark that task completely OOM-unkillable.
+	 * Otherwise, we cannot make progress when all OOM-killable tasks are
+	 * marked as almost OOM-unkillable.
+	 *
+	 * Note that the reason we fail to reap a mm might be that there are
+	 * tasks using this mm without neither pending SIGKILL nor PF_EXITING
+	 * which means that we set TIF_MEMDIE on a task without sending SIGKILL
+	 * to tasks sharing this mm. In this case, we will call panic() without
+	 * sending SIGKILL to tasks sharing this mm when all OOM-killable tasks
+	 * are marked as completely OOM-unkillable.
+	 */
+	else
+		tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
+}
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -513,16 +551,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
-	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
@@ -536,6 +564,8 @@ static void oom_reap_task(struct task_struct *tsk)
 	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
+	update_victim_score(tsk, attempts < 10);
+
 	/* Drop a reference taken by try_oom_reap */
 	put_task_struct(tsk);
 }
-- 
1.8.3.1

>From d6254acc565af7456fb21c0bb7568452fb227f3c Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:44:47 +0900
Subject: [PATCH 4/5] mm,oom: Add timeout counter for handling corner cases.

Currently, we can hit OOM livelock if the OOM reaper kernel thread is
not available. This patch adds a simple timeout based next victim
selection logic in case the OOM reaper kernel thread is unavailable.

Future patch will add hooks for allowing global access to memory
reserves before this timeout counter expires.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a0ae8dc..e4e955b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -451,6 +451,11 @@ static void update_victim_score(struct task_struct *tsk, bool reap_success)
 	exit_oom_victim(tsk);
 }
 
+static void oomkiller_reset(unsigned long arg)
+{
+}
+static DEFINE_TIMER(oomkiller_victim_wait_timer, oomkiller_reset, 0, 0);
+
 #ifdef CONFIG_MMU
 /*
  * OOM Reaper kernel thread which tries to reap the memory used by the OOM
@@ -596,9 +601,14 @@ static bool try_oom_reap(struct task_struct *tsk)
 	/*
 	 * We will livelock if we unconditionally return true.
 	 * We will kill all tasks if we unconditionally return false.
+	 * Thus, use a simple timeout counter if the OOM reaper is unavailable.
 	 */
-	if (!oom_reaper_th)
-		return true;
+	if (!oom_reaper_th) {
+		if (timer_pending(&oomkiller_victim_wait_timer))
+			return true;
+		update_victim_score(tsk, false);
+		return false;
+	}
 
 	/*
 	 * Wait for the OOM reaper to reap this task and mark this task
@@ -654,11 +664,11 @@ subsys_initcall(oom_init)
 #else
 static bool try_oom_reap(struct task_struct *tsk)
 {
-	/*
-	 * We will livelock if we unconditionally return true.
-	 * We will kill all tasks if we unconditionally return false.
-	 */
-	return true;
+	/* Use a simple timeout counter, for the OOM reaper is unavailable. */
+	if (timer_pending(&oomkiller_victim_wait_timer))
+		return true;
+	update_victim_score(tsk, false);
+	return false;
 }
 #endif

@@ -683,6 +693,8 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	/* Make sure that we won't wait for this task forever. */
+	mod_timer(&oomkiller_victim_wait_timer, jiffies + 5 * HZ);
 }
 
 /**
-- 
1.8.3.1

>From 6156462d2db03bfc9fe76ca5a3f0ebcc5a88a12e Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 3 Feb 2016 14:45:29 +0900
Subject: [PATCH 5/5] mm,oom: Use normal scheduling priority for the OOM reaper

Currently, the OOM reaper kernel thread has high scheduling priority
in order to make sure that OOM-reap operation occurs immediately.

This patch changes the scheduling priority to normal, and fallback to
a simple timeout based next victim selection logic if the OOM reaper
fails to get enough CPU resource.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 16 +++++-----------
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e4e955b..b55159f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -604,6 +604,7 @@ static bool try_oom_reap(struct task_struct *tsk)
 	 * Thus, use a simple timeout counter if the OOM reaper is unavailable.
 	 */
 	if (!oom_reaper_th) {
+check_timeout:
 		if (timer_pending(&oomkiller_victim_wait_timer))
 			return true;
 		update_victim_score(tsk, false);
@@ -613,11 +614,12 @@ static bool try_oom_reap(struct task_struct *tsk)
 	/*
 	 * Wait for the OOM reaper to reap this task and mark this task
 	 * as OOM-unkillable and clear TIF_MEMDIE. Since the OOM reaper
-	 * has high scheduling priority, we can unconditionally wait for
-	 * completion.
+	 * has normal scheduling priority, we can't wait for completion
+	 * forever. Thus, use a simple timeout counter in case the OOM
+	 * reaper fails to get enough CPU resource.
 	 */
 	if (task_to_reap)
-		return true;
+		goto check_timeout;
 
 	/*
 	 * The OOM reaper might be about to clear task_to_reap after
@@ -649,14 +651,6 @@ static int __init oom_init(void)
 		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
 				PTR_ERR(oom_reaper_th));
 		oom_reaper_th = NULL;
-	} else {
-		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
-		/*
-		 * Make sure our oom reaper thread will get scheduled when
-		 * ASAP and that it won't get preempted by malicious userspace.
-		 */
-		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
 	}
 	return 0;
 }
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02 11:48             ` Tetsuo Handa
@ 2016-02-02 22:55               ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2016-02-02 22:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Tue, 2 Feb 2016, Tetsuo Handa wrote:

> Maybe we all agree with introducing OOM reaper without queuing, but I do
> want to see a guarantee for scheduling for next OOM-kill operation before
> trying to build a reliable queuing chain.
> 

The race can be fixed in two ways which I've already enumerated, but the 
scheduling issue is tangential: the oom_reaper kthread is going to run; 
increasing it's priority will only interfere with other innocent processes 
that are not attached to the oom memcg hierarchy, have disjoint cpuset 
mems, or are happily allocating from mempolicy nodes with free memory.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02 22:55               ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2016-02-02 22:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, mgorman, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Tue, 2 Feb 2016, Tetsuo Handa wrote:

> Maybe we all agree with introducing OOM reaper without queuing, but I do
> want to see a guarantee for scheduling for next OOM-kill operation before
> trying to build a reliable queuing chain.
> 

The race can be fixed in two ways which I've already enumerated, but the 
scheduling issue is tangential: the oom_reaper kthread is going to run; 
increasing it's priority will only interfere with other innocent processes 
that are not attached to the oom memcg hierarchy, have disjoint cpuset 
mems, or are happily allocating from mempolicy nodes with free memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02  8:57           ` Michal Hocko
@ 2016-02-02 22:51             ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2016-02-02 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Tue, 2 Feb 2016, Michal Hocko wrote:

> > Not exclude them, but I would have expected untrack_pfn().
> 
> My understanding is that vm_normal_page will do the right thing for
> those mappings - especially for CoW VM_PFNMAP which are normal pages
> AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
> enter exit_mmap and do the remaining house keepining. Maybe I am missing
> something but untrack_pfn shouldn't lead to releasing a considerable
> amount of memory. So is this really necessary or we can wait for
> exit_mmap?
> 

I think if you move the code to mm/memory.c that you may find a greater 
opportunity to share code with the implementations there and this will 
take care of itself :)  I'm concerned about this also from a 
maintainability standpoint where a future patch might modify one 
implementation while forgetting about the other.  I think there's a great 
opportunity here for a really clean and shiny interfance that doesn't 
introduce any more complexity.

> > The problem is that this is racy and quite easy to trigger: imagine if 
> > __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> > victim has been freed, and then another system-wide oom condition occurs 
> > before the oom reaper's mm_to_reap has been set to NULL.
> 
> Yes I realize this is potentially racy. I just didn't consider the race
> important enough to justify task queuing in the first submission. Tetsuo
> was pushing for this already and I tried to push back for simplicity in
> the first submission. But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.
> 

Ok, thanks!  It should probably be dropped from -mm in the interim until 
it has some acked-by's, but I think those will come pretty quickly once 
it's refreshed if all of this is handled.

> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

Hmm, I wasn't referring to oom context: it would be possible without 
queueing with an mm_to_reap_lock (or cmpxchg) in the oom reaper and when 
the final mmput() is done.  Set it when the mm is ready for reaping, clear 
it when the mm is being destroyed, and test it before calling the oom 
killer.  I think we'd want to defer the oom killer until potential reaping 
could be done anyway and I don't anticipate an issue where oom_reaper 
fails to schedule.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.
> 

I don't believe the higher priority guarantees it is able to schedule any 
more than it was guaranteed to schedule before.  It will run, but it won't 
preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
nodemasks.  I think it would be disappointing to leave those out.  I think 
the higher priority should simply be removed in terms of fairness.

Other than these issues, I don't see any reason why a refreshed series 
wouldn't be immediately acked.  Thanks very much for continuing to work on 
this!

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02 22:51             ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2016-02-02 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Tue, 2 Feb 2016, Michal Hocko wrote:

> > Not exclude them, but I would have expected untrack_pfn().
> 
> My understanding is that vm_normal_page will do the right thing for
> those mappings - especially for CoW VM_PFNMAP which are normal pages
> AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
> enter exit_mmap and do the remaining house keepining. Maybe I am missing
> something but untrack_pfn shouldn't lead to releasing a considerable
> amount of memory. So is this really necessary or we can wait for
> exit_mmap?
> 

I think if you move the code to mm/memory.c that you may find a greater 
opportunity to share code with the implementations there and this will 
take care of itself :)  I'm concerned about this also from a 
maintainability standpoint where a future patch might modify one 
implementation while forgetting about the other.  I think there's a great 
opportunity here for a really clean and shiny interfance that doesn't 
introduce any more complexity.

> > The problem is that this is racy and quite easy to trigger: imagine if 
> > __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> > victim has been freed, and then another system-wide oom condition occurs 
> > before the oom reaper's mm_to_reap has been set to NULL.
> 
> Yes I realize this is potentially racy. I just didn't consider the race
> important enough to justify task queuing in the first submission. Tetsuo
> was pushing for this already and I tried to push back for simplicity in
> the first submission. But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.
> 

Ok, thanks!  It should probably be dropped from -mm in the interim until 
it has some acked-by's, but I think those will come pretty quickly once 
it's refreshed if all of this is handled.

> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

Hmm, I wasn't referring to oom context: it would be possible without 
queueing with an mm_to_reap_lock (or cmpxchg) in the oom reaper and when 
the final mmput() is done.  Set it when the mm is ready for reaping, clear 
it when the mm is being destroyed, and test it before calling the oom 
killer.  I think we'd want to defer the oom killer until potential reaping 
could be done anyway and I don't anticipate an issue where oom_reaper 
fails to schedule.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.
> 

I don't believe the higher priority guarantees it is able to schedule any 
more than it was guaranteed to schedule before.  It will run, but it won't 
preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
nodemasks.  I think it would be disappointing to leave those out.  I think 
the higher priority should simply be removed in terms of fairness.

Other than these issues, I don't see any reason why a refreshed series 
wouldn't be immediately acked.  Thanks very much for continuing to work on 
this!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02  8:57           ` Michal Hocko
@ 2016-02-02 11:48             ` Tetsuo Handa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2016-02-02 11:48 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, mgorman, penguin-kernel, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

OK. But it is not easy to build a reliable OOM-reap queuing chain. I think
that a dedicated kernel thread which does OOM-kill operation and OOM-reap
operation will be expected. That will also handle the "sleeping for too
long with oom_lock held after sending SIGKILL" problem.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.

I think that if something goes terribly wrong on the machine, a guarantee for
scheduling the reaper will not help unless we build a reliable queuing chain.
Building a reliable queuing chain will break some of assumptions provided by
current behavior. For me, a guarantee for scheduling for next OOM-kill
operation (with globally opening some or all of memory reserves) before
building a reliable queuing chain is much more important.

>                       But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.

Maybe we all agree with introducing OOM reaper without queuing, but I do
want to see a guarantee for scheduling for next OOM-kill operation before
trying to build a reliable queuing chain.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02 11:48             ` Tetsuo Handa
  0 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2016-02-02 11:48 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, mgorman, penguin-kernel, torvalds, oleg, hughd, andrea,
	riel, linux-mm, linux-kernel

Michal Hocko wrote:
> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

OK. But it is not easy to build a reliable OOM-reap queuing chain. I think
that a dedicated kernel thread which does OOM-kill operation and OOM-reap
operation will be expected. That will also handle the "sleeping for too
long with oom_lock held after sending SIGKILL" problem.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.

I think that if something goes terribly wrong on the machine, a guarantee for
scheduling the reaper will not help unless we build a reliable queuing chain.
Building a reliable queuing chain will break some of assumptions provided by
current behavior. For me, a guarantee for scheduling for next OOM-kill
operation (with globally opening some or all of memory reserves) before
building a reliable queuing chain is much more important.

>                       But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.

Maybe we all agree with introducing OOM reaper without queuing, but I do
want to see a guarantee for scheduling for next OOM-kill operation before
trying to build a reliable queuing chain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-02-02  3:02         ` David Rientjes
@ 2016-02-02  8:57           ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-02-02  8:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 01-02-16 19:02:06, David Rientjes wrote:
> On Thu, 28 Jan 2016, Michal Hocko wrote:
> 
> > [...]
> > > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > > +{
> > > > +	struct mmu_gather tlb;
> > > > +	struct vm_area_struct *vma;
> > > > +	struct zap_details details = {.check_swap_entries = true,
> > > > +				      .ignore_dirty = true};
> > > > +	bool ret = true;
> > > > +
> > > > +	/* We might have raced with exit path */
> > > > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > > > +		return true;
> > > > +
> > > > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > > > +		ret = false;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > > +		if (is_vm_hugetlb_page(vma))
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * mlocked VMAs require explicit munlocking before unmap.
> > > > +		 * Let's keep it simple here and skip such VMAs.
> > > > +		 */
> > > > +		if (vma->vm_flags & VM_LOCKED)
> > > > +			continue;
> > > 
> > > Shouldn't there be VM_PFNMAP handling here?
> > 
> > What would be the reason to exclude them?
> > 
> 
> Not exclude them, but I would have expected untrack_pfn().

My understanding is that vm_normal_page will do the right thing for
those mappings - especially for CoW VM_PFNMAP which are normal pages
AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
enter exit_mmap and do the remaining house keepining. Maybe I am missing
something but untrack_pfn shouldn't lead to releasing a considerable
amount of memory. So is this really necessary or we can wait for
exit_mmap?

> > > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > > used here for simplicity?
> > 
> > I didn't use zap_page_range because I wanted to have a full control over
> > what and how gets torn down. E.g. it is much more easier to skip over
> > hugetlb pages than relying on i_mmap_lock_write which might be blocked
> > and the whole oom_reaper will get stuck.
> > 
> 
> Let me be clear that I think the implementation is fine, minus the missing 
> handling for VM_PFNMAP.  However, I think this implementation is better 
> placed into mm/memory.c to do the iteration, selection criteria, and then 
> unmap_page_range().  I don't think we should be exposing 
> unmap_page_range() globally, but rather add a new function to do the 
> iteration in mm/memory.c with the others.

I do not have any objections to moving the code but I felt this is a
single purpose thingy which doesn't need a wider exposure. The exclusion
criteria is tightly coupled to what oom reaper is allowed to do. In
other words such a function wouldn't be reusable for say MADV_DONTNEED
because it has different criteria. Having all the selection criteria
close to __oom_reap_task on the other hand makes it easier to evaluate
their relevance. So I am not really convinced. I can move it if you feel
strongly about that, though.

> > [...]
> > > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > > +{
> > > > +	struct mm_struct *old_mm;
> > > > +
> > > > +	if (!oom_reaper_th)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > > > +	 * we do not want to delay the address space tear down.
> > > > +	 */
> > > > +	atomic_inc(&mm->mm_count);
> > > > +
> > > > +	/*
> > > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > > +	 * because multiple are not necessary and the operation might be
> > > > +	 * disruptive so better reduce it to the bare minimum.
> > > > +	 */
> > > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > > > +	if (!old_mm)
> > > > +		wake_up(&oom_reaper_wait);
> > > > +	else
> > > > +		mmdrop(mm);
> > > 
> > > This behavior is probably the only really significant concern I have about 
> > > the patch: we just drop the mm and don't try any reaping if there is 
> > > already reaping in progress.
> > 
> > This is based on the assumption that OOM killer will not select another
> > task to kill until the previous one drops its TIF_MEMDIE. Should this
> > change in the future we will have to come up with a queuing mechanism. I
> > didn't want to do it right away to make the change as simple as
> > possible.
> > 
> 
> The problem is that this is racy and quite easy to trigger: imagine if 
> __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> victim has been freed, and then another system-wide oom condition occurs 
> before the oom reaper's mm_to_reap has been set to NULL.

Yes I realize this is potentially racy. I just didn't consider the race
important enough to justify task queuing in the first submission. Tetsuo
was pushing for this already and I tried to push back for simplicity in
the first submission. But ohh well... I will queue up a patch to do this
on top. I plan to repost the full patchset shortly.

> No synchronization prevents that from happening (not sure what the
> reference to TIF_MEMDIE is about).

Now that I am reading my response again I see how it could be
misleading. I was referring to possibility of choosing multiple oom
victims which was discussed recently. I didn't mean TIF_MEMDIE to exclude
oom reaper vs. exit exclusion.

> In this case, the oom reaper has ignored the next victim and doesn't do 
> anything; the simple race has prevented it from zapping memory and does 
> not reduce the livelock probability.
> 
> This can be solved either by queueing mm's to reap or involving the oom 
> reaper into the oom killer synchronization itself.

as we have already discussed previously oom reaper is really tricky to
be called from the direct OOM context. I will go with queuing. 
 
> > > > +static int __init oom_init(void)
> > > > +{
> > > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > > +	if (IS_ERR(oom_reaper_th)) {
> > > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > > +				PTR_ERR(oom_reaper_th));
> > > > +		oom_reaper_th = NULL;
> > > > +	} else {
> > > > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > > +
> > > > +		/*
> > > > +		 * Make sure our oom reaper thread will get scheduled when
> > > > +		 * ASAP and that it won't get preempted by malicious userspace.
> > > > +		 */
> > > > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> > > 
> > > Eeek, do you really show this is necessary?  I would imagine that we would 
> > > want to limit high priority processes system-wide and that we wouldn't 
> > > want to be interferred with by memcg oom conditions that trigger the oom 
> > > reaper, for example.
> > 
> > The idea was that we do not want to allow a high priority userspace to
> > preempt this important operation. I do understand your concern about the
> > memcg oom interference but I find it more important that oom_reaper is
> > runnable when needed. I guess that memcg oom heavy loads can change the
> > priority from userspace if necessary?
> > 
> 
> I'm baffled by any reference to "memcg oom heavy loads", I don't 
> understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> disrupting the global runqueue by running oom_reaper at a high priority.  
> The disruption itself is not only in first wakeup but also in how long the 
> reaper can run and when it is rescheduled: for a lot of memory this is 
> potentially long.  The reaper is best-effort, as the changelog indicates, 
> and we shouldn't have a reliance on this high priority: oom kill exiting 
> can't possibly be expected to be immediate.  This high priority should be 
> removed so memcg oom conditions are isolated and don't affect other loads.

If this is a concern then I would be tempted to simply disable oom
reaper for memcg oom altogether. For me it is much more important that
the reaper, even though a best effort, is guaranteed to schedule if
something goes terribly wrong on the machine.

Is this acceptable?

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02  8:57           ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-02-02  8:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 01-02-16 19:02:06, David Rientjes wrote:
> On Thu, 28 Jan 2016, Michal Hocko wrote:
> 
> > [...]
> > > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > > +{
> > > > +	struct mmu_gather tlb;
> > > > +	struct vm_area_struct *vma;
> > > > +	struct zap_details details = {.check_swap_entries = true,
> > > > +				      .ignore_dirty = true};
> > > > +	bool ret = true;
> > > > +
> > > > +	/* We might have raced with exit path */
> > > > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > > > +		return true;
> > > > +
> > > > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > > > +		ret = false;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > > +		if (is_vm_hugetlb_page(vma))
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * mlocked VMAs require explicit munlocking before unmap.
> > > > +		 * Let's keep it simple here and skip such VMAs.
> > > > +		 */
> > > > +		if (vma->vm_flags & VM_LOCKED)
> > > > +			continue;
> > > 
> > > Shouldn't there be VM_PFNMAP handling here?
> > 
> > What would be the reason to exclude them?
> > 
> 
> Not exclude them, but I would have expected untrack_pfn().

My understanding is that vm_normal_page will do the right thing for
those mappings - especially for CoW VM_PFNMAP which are normal pages
AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
enter exit_mmap and do the remaining house keepining. Maybe I am missing
something but untrack_pfn shouldn't lead to releasing a considerable
amount of memory. So is this really necessary or we can wait for
exit_mmap?

> > > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > > used here for simplicity?
> > 
> > I didn't use zap_page_range because I wanted to have a full control over
> > what and how gets torn down. E.g. it is much more easier to skip over
> > hugetlb pages than relying on i_mmap_lock_write which might be blocked
> > and the whole oom_reaper will get stuck.
> > 
> 
> Let me be clear that I think the implementation is fine, minus the missing 
> handling for VM_PFNMAP.  However, I think this implementation is better 
> placed into mm/memory.c to do the iteration, selection criteria, and then 
> unmap_page_range().  I don't think we should be exposing 
> unmap_page_range() globally, but rather add a new function to do the 
> iteration in mm/memory.c with the others.

I do not have any objections to moving the code but I felt this is a
single purpose thingy which doesn't need a wider exposure. The exclusion
criteria is tightly coupled to what oom reaper is allowed to do. In
other words such a function wouldn't be reusable for say MADV_DONTNEED
because it has different criteria. Having all the selection criteria
close to __oom_reap_task on the other hand makes it easier to evaluate
their relevance. So I am not really convinced. I can move it if you feel
strongly about that, though.

> > [...]
> > > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > > +{
> > > > +	struct mm_struct *old_mm;
> > > > +
> > > > +	if (!oom_reaper_th)
> > > > +		return;
> > > > +
> > > > +	/*
> > > > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > > > +	 * we do not want to delay the address space tear down.
> > > > +	 */
> > > > +	atomic_inc(&mm->mm_count);
> > > > +
> > > > +	/*
> > > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > > +	 * because multiple are not necessary and the operation might be
> > > > +	 * disruptive so better reduce it to the bare minimum.
> > > > +	 */
> > > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > > > +	if (!old_mm)
> > > > +		wake_up(&oom_reaper_wait);
> > > > +	else
> > > > +		mmdrop(mm);
> > > 
> > > This behavior is probably the only really significant concern I have about 
> > > the patch: we just drop the mm and don't try any reaping if there is 
> > > already reaping in progress.
> > 
> > This is based on the assumption that OOM killer will not select another
> > task to kill until the previous one drops its TIF_MEMDIE. Should this
> > change in the future we will have to come up with a queuing mechanism. I
> > didn't want to do it right away to make the change as simple as
> > possible.
> > 
> 
> The problem is that this is racy and quite easy to trigger: imagine if 
> __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> victim has been freed, and then another system-wide oom condition occurs 
> before the oom reaper's mm_to_reap has been set to NULL.

Yes I realize this is potentially racy. I just didn't consider the race
important enough to justify task queuing in the first submission. Tetsuo
was pushing for this already and I tried to push back for simplicity in
the first submission. But ohh well... I will queue up a patch to do this
on top. I plan to repost the full patchset shortly.

> No synchronization prevents that from happening (not sure what the
> reference to TIF_MEMDIE is about).

Now that I am reading my response again I see how it could be
misleading. I was referring to possibility of choosing multiple oom
victims which was discussed recently. I didn't mean TIF_MEMDIE to exclude
oom reaper vs. exit exclusion.

> In this case, the oom reaper has ignored the next victim and doesn't do 
> anything; the simple race has prevented it from zapping memory and does 
> not reduce the livelock probability.
> 
> This can be solved either by queueing mm's to reap or involving the oom 
> reaper into the oom killer synchronization itself.

as we have already discussed previously oom reaper is really tricky to
be called from the direct OOM context. I will go with queuing. 
 
> > > > +static int __init oom_init(void)
> > > > +{
> > > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > > +	if (IS_ERR(oom_reaper_th)) {
> > > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > > +				PTR_ERR(oom_reaper_th));
> > > > +		oom_reaper_th = NULL;
> > > > +	} else {
> > > > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > > +
> > > > +		/*
> > > > +		 * Make sure our oom reaper thread will get scheduled when
> > > > +		 * ASAP and that it won't get preempted by malicious userspace.
> > > > +		 */
> > > > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> > > 
> > > Eeek, do you really show this is necessary?  I would imagine that we would 
> > > want to limit high priority processes system-wide and that we wouldn't 
> > > want to be interferred with by memcg oom conditions that trigger the oom 
> > > reaper, for example.
> > 
> > The idea was that we do not want to allow a high priority userspace to
> > preempt this important operation. I do understand your concern about the
> > memcg oom interference but I find it more important that oom_reaper is
> > runnable when needed. I guess that memcg oom heavy loads can change the
> > priority from userspace if necessary?
> > 
> 
> I'm baffled by any reference to "memcg oom heavy loads", I don't 
> understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> disrupting the global runqueue by running oom_reaper at a high priority.  
> The disruption itself is not only in first wakeup but also in how long the 
> reaper can run and when it is rescheduled: for a lot of memory this is 
> potentially long.  The reaper is best-effort, as the changelog indicates, 
> and we shouldn't have a reliance on this high priority: oom kill exiting 
> can't possibly be expected to be immediate.  This high priority should be 
> removed so memcg oom conditions are isolated and don't affect other loads.

If this is a concern then I would be tempted to simply disable oom
reaper for memcg oom altogether. For me it is much more important that
the reaper, even though a best effort, is guaranteed to schedule if
something goes terribly wrong on the machine.

Is this acceptable?

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-28 21:42       ` Michal Hocko
@ 2016-02-02  3:02         ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2016-02-02  3:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 28 Jan 2016, Michal Hocko wrote:

> [...]
> > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > +	struct mmu_gather tlb;
> > > +	struct vm_area_struct *vma;
> > > +	struct zap_details details = {.check_swap_entries = true,
> > > +				      .ignore_dirty = true};
> > > +	bool ret = true;
> > > +
> > > +	/* We might have raced with exit path */
> > > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > > +		return true;
> > > +
> > > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > > +		ret = false;
> > > +		goto out;
> > > +	}
> > > +
> > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > +		if (is_vm_hugetlb_page(vma))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * mlocked VMAs require explicit munlocking before unmap.
> > > +		 * Let's keep it simple here and skip such VMAs.
> > > +		 */
> > > +		if (vma->vm_flags & VM_LOCKED)
> > > +			continue;
> > 
> > Shouldn't there be VM_PFNMAP handling here?
> 
> What would be the reason to exclude them?
> 

Not exclude them, but I would have expected untrack_pfn().

> > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > used here for simplicity?
> 
> I didn't use zap_page_range because I wanted to have a full control over
> what and how gets torn down. E.g. it is much more easier to skip over
> hugetlb pages than relying on i_mmap_lock_write which might be blocked
> and the whole oom_reaper will get stuck.
> 

Let me be clear that I think the implementation is fine, minus the missing 
handling for VM_PFNMAP.  However, I think this implementation is better 
placed into mm/memory.c to do the iteration, selection criteria, and then 
unmap_page_range().  I don't think we should be exposing 
unmap_page_range() globally, but rather add a new function to do the 
iteration in mm/memory.c with the others.

> [...]
> > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > +{
> > > +	struct mm_struct *old_mm;
> > > +
> > > +	if (!oom_reaper_th)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > > +	 * we do not want to delay the address space tear down.
> > > +	 */
> > > +	atomic_inc(&mm->mm_count);
> > > +
> > > +	/*
> > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > +	 * because multiple are not necessary and the operation might be
> > > +	 * disruptive so better reduce it to the bare minimum.
> > > +	 */
> > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > > +	if (!old_mm)
> > > +		wake_up(&oom_reaper_wait);
> > > +	else
> > > +		mmdrop(mm);
> > 
> > This behavior is probably the only really significant concern I have about 
> > the patch: we just drop the mm and don't try any reaping if there is 
> > already reaping in progress.
> 
> This is based on the assumption that OOM killer will not select another
> task to kill until the previous one drops its TIF_MEMDIE. Should this
> change in the future we will have to come up with a queuing mechanism. I
> didn't want to do it right away to make the change as simple as
> possible.
> 

The problem is that this is racy and quite easy to trigger: imagine if 
__oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
victim has been freed, and then another system-wide oom condition occurs 
before the oom reaper's mm_to_reap has been set to NULL.  No 
synchronization prevents that from happening (not sure what the reference 
to TIF_MEMDIE is about).

In this case, the oom reaper has ignored the next victim and doesn't do 
anything; the simple race has prevented it from zapping memory and does 
not reduce the livelock probability.

This can be solved either by queueing mm's to reap or involving the oom 
reaper into the oom killer synchronization itself.

> > > +static int __init oom_init(void)
> > > +{
> > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > +	if (IS_ERR(oom_reaper_th)) {
> > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > +				PTR_ERR(oom_reaper_th));
> > > +		oom_reaper_th = NULL;
> > > +	} else {
> > > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > +
> > > +		/*
> > > +		 * Make sure our oom reaper thread will get scheduled when
> > > +		 * ASAP and that it won't get preempted by malicious userspace.
> > > +		 */
> > > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> > 
> > Eeek, do you really show this is necessary?  I would imagine that we would 
> > want to limit high priority processes system-wide and that we wouldn't 
> > want to be interferred with by memcg oom conditions that trigger the oom 
> > reaper, for example.
> 
> The idea was that we do not want to allow a high priority userspace to
> preempt this important operation. I do understand your concern about the
> memcg oom interference but I find it more important that oom_reaper is
> runnable when needed. I guess that memcg oom heavy loads can change the
> priority from userspace if necessary?
> 

I'm baffled by any reference to "memcg oom heavy loads", I don't 
understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
disrupting the global runqueue by running oom_reaper at a high priority.  
The disruption itself is not only in first wakeup but also in how long the 
reaper can run and when it is rescheduled: for a lot of memory this is 
potentially long.  The reaper is best-effort, as the changelog indicates, 
and we shouldn't have a reliance on this high priority: oom kill exiting 
can't possibly be expected to be immediate.  This high priority should be 
removed so memcg oom conditions are isolated and don't affect other loads.

"Memcg oom heavy loads" cannot always be determined and the suggested fix 
cannot possibly be to adjust the priority of a global resource.  ??

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-02-02  3:02         ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2016-02-02  3:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 28 Jan 2016, Michal Hocko wrote:

> [...]
> > > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > +	struct mmu_gather tlb;
> > > +	struct vm_area_struct *vma;
> > > +	struct zap_details details = {.check_swap_entries = true,
> > > +				      .ignore_dirty = true};
> > > +	bool ret = true;
> > > +
> > > +	/* We might have raced with exit path */
> > > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > > +		return true;
> > > +
> > > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > > +		ret = false;
> > > +		goto out;
> > > +	}
> > > +
> > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > +		if (is_vm_hugetlb_page(vma))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * mlocked VMAs require explicit munlocking before unmap.
> > > +		 * Let's keep it simple here and skip such VMAs.
> > > +		 */
> > > +		if (vma->vm_flags & VM_LOCKED)
> > > +			continue;
> > 
> > Shouldn't there be VM_PFNMAP handling here?
> 
> What would be the reason to exclude them?
> 

Not exclude them, but I would have expected untrack_pfn().

> > I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> > used here for simplicity?
> 
> I didn't use zap_page_range because I wanted to have a full control over
> what and how gets torn down. E.g. it is much more easier to skip over
> hugetlb pages than relying on i_mmap_lock_write which might be blocked
> and the whole oom_reaper will get stuck.
> 

Let me be clear that I think the implementation is fine, minus the missing 
handling for VM_PFNMAP.  However, I think this implementation is better 
placed into mm/memory.c to do the iteration, selection criteria, and then 
unmap_page_range().  I don't think we should be exposing 
unmap_page_range() globally, but rather add a new function to do the 
iteration in mm/memory.c with the others.

> [...]
> > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > +{
> > > +	struct mm_struct *old_mm;
> > > +
> > > +	if (!oom_reaper_th)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > > +	 * we do not want to delay the address space tear down.
> > > +	 */
> > > +	atomic_inc(&mm->mm_count);
> > > +
> > > +	/*
> > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > +	 * because multiple are not necessary and the operation might be
> > > +	 * disruptive so better reduce it to the bare minimum.
> > > +	 */
> > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > > +	if (!old_mm)
> > > +		wake_up(&oom_reaper_wait);
> > > +	else
> > > +		mmdrop(mm);
> > 
> > This behavior is probably the only really significant concern I have about 
> > the patch: we just drop the mm and don't try any reaping if there is 
> > already reaping in progress.
> 
> This is based on the assumption that OOM killer will not select another
> task to kill until the previous one drops its TIF_MEMDIE. Should this
> change in the future we will have to come up with a queuing mechanism. I
> didn't want to do it right away to make the change as simple as
> possible.
> 

The problem is that this is racy and quite easy to trigger: imagine if 
__oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
victim has been freed, and then another system-wide oom condition occurs 
before the oom reaper's mm_to_reap has been set to NULL.  No 
synchronization prevents that from happening (not sure what the reference 
to TIF_MEMDIE is about).

In this case, the oom reaper has ignored the next victim and doesn't do 
anything; the simple race has prevented it from zapping memory and does 
not reduce the livelock probability.

This can be solved either by queueing mm's to reap or involving the oom 
reaper into the oom killer synchronization itself.

> > > +static int __init oom_init(void)
> > > +{
> > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > +	if (IS_ERR(oom_reaper_th)) {
> > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > +				PTR_ERR(oom_reaper_th));
> > > +		oom_reaper_th = NULL;
> > > +	} else {
> > > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > > +
> > > +		/*
> > > +		 * Make sure our oom reaper thread will get scheduled when
> > > +		 * ASAP and that it won't get preempted by malicious userspace.
> > > +		 */
> > > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> > 
> > Eeek, do you really show this is necessary?  I would imagine that we would 
> > want to limit high priority processes system-wide and that we wouldn't 
> > want to be interferred with by memcg oom conditions that trigger the oom 
> > reaper, for example.
> 
> The idea was that we do not want to allow a high priority userspace to
> preempt this important operation. I do understand your concern about the
> memcg oom interference but I find it more important that oom_reaper is
> runnable when needed. I guess that memcg oom heavy loads can change the
> priority from userspace if necessary?
> 

I'm baffled by any reference to "memcg oom heavy loads", I don't 
understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
disrupting the global runqueue by running oom_reaper at a high priority.  
The disruption itself is not only in first wakeup but also in how long the 
reaper can run and when it is rescheduled: for a lot of memory this is 
potentially long.  The reaper is best-effort, as the changelog indicates, 
and we shouldn't have a reliance on this high priority: oom kill exiting 
can't possibly be expected to be immediate.  This high priority should be 
removed so memcg oom conditions are isolated and don't affect other loads.

"Memcg oom heavy loads" cannot always be determined and the suggested fix 
cannot possibly be to adjust the priority of a global resource.  ??

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-28  1:28     ` David Rientjes
@ 2016-01-28 21:42       ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-28 21:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 27-01-16 17:28:10, David Rientjes wrote:
> On Wed, 6 Jan 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> 
> Suggested-bys?

Sure, why not.
 
> > The OOM killer currently allows to kill only a single task in a good
> > hope that the task will terminate in a reasonable time and frees up its
> > memory.  Such a task (oom victim) will get an access to memory reserves
> > via mark_oom_victim to allow a forward progress should there be a need
> > for additional memory during exit path.
> > 
> > It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> > construct workloads which break the core assumption mentioned above and
> > the OOM victim might take unbounded amount of time to exit because it
> > might be blocked in the uninterruptible state waiting for on an event
> > (e.g. lock) which is blocked by another task looping in the page
> > allocator.
> > 
> 
> s/for on/for/

fixed
 
> I think it would be good to note in either of the two paragraphs above 
> that each victim is per-memcg hierarchy or system-wide and the oom reaper 
> is used for memcg oom conditions as well.  Otherwise, there's no mention 
> of the memcg usecase.

I didn't mention memcg usecase because that doesn't suffer from the
deadlock issue because the OOM is invoked from the lockless context. I
think this would just make the wording more confusing.

[...]
> > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +	struct mmu_gather tlb;
> > +	struct vm_area_struct *vma;
> > +	struct zap_details details = {.check_swap_entries = true,
> > +				      .ignore_dirty = true};
> > +	bool ret = true;
> > +
> > +	/* We might have raced with exit path */
> > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > +		return true;
> > +
> > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > +		ret = false;
> > +		goto out;
> > +	}
> > +
> > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +
> > +		/*
> > +		 * mlocked VMAs require explicit munlocking before unmap.
> > +		 * Let's keep it simple here and skip such VMAs.
> > +		 */
> > +		if (vma->vm_flags & VM_LOCKED)
> > +			continue;
> 
> Shouldn't there be VM_PFNMAP handling here?

What would be the reason to exclude them?

> I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> used here for simplicity?

I didn't use zap_page_range because I wanted to have a full control over
what and how gets torn down. E.g. it is much more easier to skip over
hugetlb pages than relying on i_mmap_lock_write which might be blocked
and the whole oom_reaper will get stuck.

[...]
> > +static void wake_oom_reaper(struct mm_struct *mm)
> > +{
> > +	struct mm_struct *old_mm;
> > +
> > +	if (!oom_reaper_th)
> > +		return;
> > +
> > +	/*
> > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > +	 * we do not want to delay the address space tear down.
> > +	 */
> > +	atomic_inc(&mm->mm_count);
> > +
> > +	/*
> > +	 * Make sure that only a single mm is ever queued for the reaper
> > +	 * because multiple are not necessary and the operation might be
> > +	 * disruptive so better reduce it to the bare minimum.
> > +	 */
> > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > +	if (!old_mm)
> > +		wake_up(&oom_reaper_wait);
> > +	else
> > +		mmdrop(mm);
> 
> This behavior is probably the only really significant concern I have about 
> the patch: we just drop the mm and don't try any reaping if there is 
> already reaping in progress.

This is based on the assumption that OOM killer will not select another
task to kill until the previous one drops its TIF_MEMDIE. Should this
change in the future we will have to come up with a queuing mechanism. I
didn't want to do it right away to make the change as simple as
possible.

> We don't always have control over the amount of memory that can be reaped 
> from the victim, either because of oom kill prioritization through 
> /proc/pid/oom_score_adj or because the memory of the victim is not 
> eligible.
> 
> I'm imagining a scenario where the oom reaper has raced with a follow-up 
> oom kill before mm_to_reap has been set to NULL so there's no subsequent 
> reaping.  It's also possible that oom reaping of the first victim actually 
> freed little memory.
> 
> Would it really be difficult to queue mm's to reap from?  If memory has 
> already been freed before the reaper can get to it, the 
> find_lock_task_mm() should just fail and we're done.  I'm not sure why 
> this is being limited to a single mm system-wide.

It is not that complicated but I believe we can implement it on top once
we see this is really needed. So unless this is a strong requirement I
would rather go with a simpler way.

> > +}
> > +
> > +static int __init oom_init(void)
> > +{
> > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > +	if (IS_ERR(oom_reaper_th)) {
> > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > +				PTR_ERR(oom_reaper_th));
> > +		oom_reaper_th = NULL;
> > +	} else {
> > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > +
> > +		/*
> > +		 * Make sure our oom reaper thread will get scheduled when
> > +		 * ASAP and that it won't get preempted by malicious userspace.
> > +		 */
> > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> 
> Eeek, do you really show this is necessary?  I would imagine that we would 
> want to limit high priority processes system-wide and that we wouldn't 
> want to be interferred with by memcg oom conditions that trigger the oom 
> reaper, for example.

The idea was that we do not want to allow a high priority userspace to
preempt this important operation. I do understand your concern about the
memcg oom interference but I find it more important that oom_reaper is
runnable when needed. I guess that memcg oom heavy loads can change the
priority from userspace if necessary?

[...]
> > @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD))
> > -			continue;
> >  		if (is_global_init(p))
> >  			continue;
> > -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +		if (unlikely(p->flags & PF_KTHREAD) ||
> > +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +			/*
> > +			 * We cannot use oom_reaper for the mm shared by this
> > +			 * process because it wouldn't get killed and so the
> > +			 * memory might be still used.
> > +			 */
> > +			can_oom_reap = false;
> >  			continue;
> > -
> > +		}
> >  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> 
> Is it possible to just do wake_oom_reaper(mm) here and eliminate 
> can_oom_reap with a little bit of moving around?

I am not sure how do you mean it. We have to check all processes before
we can tell that reaping is safe. Care to elaborate some more? I am all
for making the code easier to follow and understand.

> 
> >  	}
> >  	rcu_read_unlock();
> >  
> > +	if (can_oom_reap)
> > +		wake_oom_reaper(mm);
> > +
> >  	mmdrop(mm);
> >  	put_task_struct(victim);
> >  }

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-28 21:42       ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-28 21:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 27-01-16 17:28:10, David Rientjes wrote:
> On Wed, 6 Jan 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> 
> Suggested-bys?

Sure, why not.
 
> > The OOM killer currently allows to kill only a single task in a good
> > hope that the task will terminate in a reasonable time and frees up its
> > memory.  Such a task (oom victim) will get an access to memory reserves
> > via mark_oom_victim to allow a forward progress should there be a need
> > for additional memory during exit path.
> > 
> > It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> > construct workloads which break the core assumption mentioned above and
> > the OOM victim might take unbounded amount of time to exit because it
> > might be blocked in the uninterruptible state waiting for on an event
> > (e.g. lock) which is blocked by another task looping in the page
> > allocator.
> > 
> 
> s/for on/for/

fixed
 
> I think it would be good to note in either of the two paragraphs above 
> that each victim is per-memcg hierarchy or system-wide and the oom reaper 
> is used for memcg oom conditions as well.  Otherwise, there's no mention 
> of the memcg usecase.

I didn't mention memcg usecase because that doesn't suffer from the
deadlock issue because the OOM is invoked from the lockless context. I
think this would just make the wording more confusing.

[...]
> > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +	struct mmu_gather tlb;
> > +	struct vm_area_struct *vma;
> > +	struct zap_details details = {.check_swap_entries = true,
> > +				      .ignore_dirty = true};
> > +	bool ret = true;
> > +
> > +	/* We might have raced with exit path */
> > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > +		return true;
> > +
> > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > +		ret = false;
> > +		goto out;
> > +	}
> > +
> > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +
> > +		/*
> > +		 * mlocked VMAs require explicit munlocking before unmap.
> > +		 * Let's keep it simple here and skip such VMAs.
> > +		 */
> > +		if (vma->vm_flags & VM_LOCKED)
> > +			continue;
> 
> Shouldn't there be VM_PFNMAP handling here?

What would be the reason to exclude them?

> I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
> used here for simplicity?

I didn't use zap_page_range because I wanted to have a full control over
what and how gets torn down. E.g. it is much more easier to skip over
hugetlb pages than relying on i_mmap_lock_write which might be blocked
and the whole oom_reaper will get stuck.

[...]
> > +static void wake_oom_reaper(struct mm_struct *mm)
> > +{
> > +	struct mm_struct *old_mm;
> > +
> > +	if (!oom_reaper_th)
> > +		return;
> > +
> > +	/*
> > +	 * Pin the given mm. Use mm_count instead of mm_users because
> > +	 * we do not want to delay the address space tear down.
> > +	 */
> > +	atomic_inc(&mm->mm_count);
> > +
> > +	/*
> > +	 * Make sure that only a single mm is ever queued for the reaper
> > +	 * because multiple are not necessary and the operation might be
> > +	 * disruptive so better reduce it to the bare minimum.
> > +	 */
> > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > +	if (!old_mm)
> > +		wake_up(&oom_reaper_wait);
> > +	else
> > +		mmdrop(mm);
> 
> This behavior is probably the only really significant concern I have about 
> the patch: we just drop the mm and don't try any reaping if there is 
> already reaping in progress.

This is based on the assumption that OOM killer will not select another
task to kill until the previous one drops its TIF_MEMDIE. Should this
change in the future we will have to come up with a queuing mechanism. I
didn't want to do it right away to make the change as simple as
possible.

> We don't always have control over the amount of memory that can be reaped 
> from the victim, either because of oom kill prioritization through 
> /proc/pid/oom_score_adj or because the memory of the victim is not 
> eligible.
> 
> I'm imagining a scenario where the oom reaper has raced with a follow-up 
> oom kill before mm_to_reap has been set to NULL so there's no subsequent 
> reaping.  It's also possible that oom reaping of the first victim actually 
> freed little memory.
> 
> Would it really be difficult to queue mm's to reap from?  If memory has 
> already been freed before the reaper can get to it, the 
> find_lock_task_mm() should just fail and we're done.  I'm not sure why 
> this is being limited to a single mm system-wide.

It is not that complicated but I believe we can implement it on top once
we see this is really needed. So unless this is a strong requirement I
would rather go with a simpler way.

> > +}
> > +
> > +static int __init oom_init(void)
> > +{
> > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > +	if (IS_ERR(oom_reaper_th)) {
> > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > +				PTR_ERR(oom_reaper_th));
> > +		oom_reaper_th = NULL;
> > +	} else {
> > +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> > +
> > +		/*
> > +		 * Make sure our oom reaper thread will get scheduled when
> > +		 * ASAP and that it won't get preempted by malicious userspace.
> > +		 */
> > +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
> 
> Eeek, do you really show this is necessary?  I would imagine that we would 
> want to limit high priority processes system-wide and that we wouldn't 
> want to be interferred with by memcg oom conditions that trigger the oom 
> reaper, for example.

The idea was that we do not want to allow a high priority userspace to
preempt this important operation. I do understand your concern about the
memcg oom interference but I find it more important that oom_reaper is
runnable when needed. I guess that memcg oom heavy loads can change the
priority from userspace if necessary?

[...]
> > @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD))
> > -			continue;
> >  		if (is_global_init(p))
> >  			continue;
> > -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +		if (unlikely(p->flags & PF_KTHREAD) ||
> > +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +			/*
> > +			 * We cannot use oom_reaper for the mm shared by this
> > +			 * process because it wouldn't get killed and so the
> > +			 * memory might be still used.
> > +			 */
> > +			can_oom_reap = false;
> >  			continue;
> > -
> > +		}
> >  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> 
> Is it possible to just do wake_oom_reaper(mm) here and eliminate 
> can_oom_reap with a little bit of moving around?

I am not sure how do you mean it. We have to check all processes before
we can tell that reaping is safe. Care to elaborate some more? I am all
for making the code easier to follow and understand.

> 
> >  	}
> >  	rcu_read_unlock();
> >  
> > +	if (can_oom_reap)
> > +		wake_oom_reaper(mm);
> > +
> >  	mmdrop(mm);
> >  	put_task_struct(victim);
> >  }

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06 15:42   ` Michal Hocko
@ 2016-01-28  1:28     ` David Rientjes
  -1 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2016-01-28  1:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 6 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 

Suggested-bys?

> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
> 
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
> 

s/for on/for/

I think it would be good to note in either of the two paragraphs above 
that each victim is per-memcg hierarchy or system-wide and the oom reaper 
is used for memcg oom conditions as well.  Otherwise, there's no mention 
of the memcg usecase.

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
> 
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
> 

Very good points.  And I think this makes the case clear that oom_reaper 
is really a best-effort solution.

> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
> 
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
> 
> Changes since v3
> - many style/compile fixups by Andrew
> - unmap_mapping_range_tree needs full initialization of zap_details
>   to prevent from missing unmaps and follow up BUG_ON during truncate
>   resp. misaccounting - Kirill/Andrew
> - exclude mlocked pages because they need an explicit munlock by Kirill
> - use subsys_initcall instead of module_init - Paul Gortmaker
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mm.h |   2 +
>  mm/internal.h      |   5 ++
>  mm/memory.c        |  17 +++---
>  mm/oom_kill.c      | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 170 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25cdec395f2c..d1ce03569942 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1061,6 +1061,8 @@ struct zap_details {
>  	struct address_space *check_mapping;	/* Check page->mapping if set */
>  	pgoff_t	first_index;			/* Lowest page->index to unmap */
>  	pgoff_t last_index;			/* Highest page->index to unmap */
> +	bool ignore_dirty;			/* Ignore dirty pages */
> +	bool check_swap_entries;		/* Check also swap entries */
>  };
>  
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/internal.h b/mm/internal.h
> index 4ae7b7c7462b..9006ce1960ff 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
>  
> +void unmap_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end,
> +			     struct zap_details *details);
> +
>  static inline void set_page_count(struct page *page, int v)
>  {
>  	atomic_set(&page->_count, v);
> diff --git a/mm/memory.c b/mm/memory.c
> index f5b8e8c9f4c3..f60c6d6aa633 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  
>  			if (!PageAnon(page)) {
>  				if (pte_dirty(ptent)) {
> +					/*
> +					 * oom_reaper cannot tear down dirty
> +					 * pages
> +					 */
> +					if (unlikely(details && details->ignore_dirty))
> +						continue;
>  					force_flush = 1;
>  					set_page_dirty(page);
>  				}
> @@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			}
>  			continue;
>  		}
> -		/* If details->check_mapping, we leave swap entries. */
> -		if (unlikely(details))
> +		/* only check swap_entries if explicitly asked for in details */
> +		if (unlikely(details && !details->check_swap_entries))
>  			continue;
>  
>  		entry = pte_to_swp_entry(ptent);
> @@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>  
> -static void unmap_page_range(struct mmu_gather *tlb,
> +void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
>  			     struct zap_details *details)
> @@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
>  	pgd_t *pgd;
>  	unsigned long next;
>  
> -	if (details && !details->check_mapping)
> -		details = NULL;
> -
>  	BUG_ON(addr >= end);
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
> @@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
>  void unmap_mapping_range(struct address_space *mapping,
>  		loff_t const holebegin, loff_t const holelen, int even_cows)
>  {
> -	struct zap_details details;
> +	struct zap_details details = { };
>  	pgoff_t hba = holebegin >> PAGE_SHIFT;
>  	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dc490c06941b..1ece40b94725 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,6 +35,11 @@
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
> +#include <linux/kthread.h>
> +#include <linux/init.h>
> +
> +#include <asm/tlb.h>
> +#include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/oom.h>
> @@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
>  
>  bool oom_killer_disabled __read_mostly;
>  
> +#ifdef CONFIG_MMU
> +/*
> + * OOM Reaper kernel thread which tries to reap the memory used by the OOM
> + * victim (if that is possible) to help the OOM killer to move on.
> + */
> +static struct task_struct *oom_reaper_th;
> +static struct mm_struct *mm_to_reap;
> +static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> +
> +static bool __oom_reap_vmas(struct mm_struct *mm)
> +{
> +	struct mmu_gather tlb;
> +	struct vm_area_struct *vma;
> +	struct zap_details details = {.check_swap_entries = true,
> +				      .ignore_dirty = true};
> +	bool ret = true;
> +
> +	/* We might have raced with exit path */
> +	if (!atomic_inc_not_zero(&mm->mm_users))
> +		return true;
> +
> +	if (!down_read_trylock(&mm->mmap_sem)) {
> +		ret = false;
> +		goto out;
> +	}
> +
> +	tlb_gather_mmu(&tlb, mm, 0, -1);
> +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +
> +		/*
> +		 * mlocked VMAs require explicit munlocking before unmap.
> +		 * Let's keep it simple here and skip such VMAs.
> +		 */
> +		if (vma->vm_flags & VM_LOCKED)
> +			continue;

Shouldn't there be VM_PFNMAP handling here?

I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
used here for simplicity?  It appears as though what you're doing is an 
MADV_DONTNEED over the length of all anonymous vmas that aren't shared, so 
why not have such an implementation in a single place so any changes don't 
have to be made in two different spots for things such as VM_PFNMAP?

> +
> +		/*
> +		 * Only anonymous pages have a good chance to be dropped
> +		 * without additional steps which we cannot afford as we
> +		 * are OOM already.
> +		 *
> +		 * We do not even care about fs backed pages because all
> +		 * which are reclaimable have already been reclaimed and
> +		 * we do not want to block exit_mmap by keeping mm ref
> +		 * count elevated without a good reason.
> +		 */
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> +					 &details);
> +	}
> +	tlb_finish_mmu(&tlb, 0, -1);
> +	up_read(&mm->mmap_sem);
> +out:
> +	mmput(mm);
> +	return ret;
> +}
> +
> +static void oom_reap_vmas(struct mm_struct *mm)
> +{
> +	int attempts = 0;
> +
> +	/* Retry the down_read_trylock(mmap_sem) a few times */
> +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> +		schedule_timeout_idle(HZ/10);
> +
> +	/* Drop a reference taken by wake_oom_reaper */
> +	mmdrop(mm);
> +}
> +
> +static int oom_reaper(void *unused)
> +{
> +	while (true) {
> +		struct mm_struct *mm;
> +
> +		wait_event_freezable(oom_reaper_wait,
> +				     (mm = READ_ONCE(mm_to_reap)));
> +		oom_reap_vmas(mm);
> +		WRITE_ONCE(mm_to_reap, NULL);
> +	}
> +
> +	return 0;
> +}
> +
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +	struct mm_struct *old_mm;
> +
> +	if (!oom_reaper_th)
> +		return;
> +
> +	/*
> +	 * Pin the given mm. Use mm_count instead of mm_users because
> +	 * we do not want to delay the address space tear down.
> +	 */
> +	atomic_inc(&mm->mm_count);
> +
> +	/*
> +	 * Make sure that only a single mm is ever queued for the reaper
> +	 * because multiple are not necessary and the operation might be
> +	 * disruptive so better reduce it to the bare minimum.
> +	 */
> +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> +	if (!old_mm)
> +		wake_up(&oom_reaper_wait);
> +	else
> +		mmdrop(mm);

This behavior is probably the only really significant concern I have about 
the patch: we just drop the mm and don't try any reaping if there is 
already reaping in progress.

We don't always have control over the amount of memory that can be reaped 
from the victim, either because of oom kill prioritization through 
/proc/pid/oom_score_adj or because the memory of the victim is not 
eligible.

I'm imagining a scenario where the oom reaper has raced with a follow-up 
oom kill before mm_to_reap has been set to NULL so there's no subsequent 
reaping.  It's also possible that oom reaping of the first victim actually 
freed little memory.

Would it really be difficult to queue mm's to reap from?  If memory has 
already been freed before the reaper can get to it, the 
find_lock_task_mm() should just fail and we're done.  I'm not sure why 
this is being limited to a single mm system-wide.

> +}
> +
> +static int __init oom_init(void)
> +{
> +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> +	if (IS_ERR(oom_reaper_th)) {
> +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> +				PTR_ERR(oom_reaper_th));
> +		oom_reaper_th = NULL;
> +	} else {
> +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> +
> +		/*
> +		 * Make sure our oom reaper thread will get scheduled when
> +		 * ASAP and that it won't get preempted by malicious userspace.
> +		 */
> +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);

Eeek, do you really show this is necessary?  I would imagine that we would 
want to limit high priority processes system-wide and that we wouldn't 
want to be interferred with by memcg oom conditions that trigger the oom 
reaper, for example.

> +	}
> +	return 0;
> +}
> +subsys_initcall(oom_init)
> +#else
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +}
> +#endif
> +
>  /**
>   * mark_oom_victim - mark the given task as OOM victim
>   * @tsk: task to mark
> @@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> +	bool can_oom_reap = true;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
>  		if (is_global_init(p))
>  			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot use oom_reaper for the mm shared by this
> +			 * process because it wouldn't get killed and so the
> +			 * memory might be still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;
> -
> +		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);

Is it possible to just do wake_oom_reaper(mm) here and eliminate 
can_oom_reap with a little bit of moving around?

>  	}
>  	rcu_read_unlock();
>  
> +	if (can_oom_reap)
> +		wake_oom_reaper(mm);
> +
>  	mmdrop(mm);
>  	put_task_struct(victim);
>  }

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-28  1:28     ` David Rientjes
  0 siblings, 0 replies; 78+ messages in thread
From: David Rientjes @ 2016-01-28  1:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 6 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 

Suggested-bys?

> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
> 
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
> 

s/for on/for/

I think it would be good to note in either of the two paragraphs above 
that each victim is per-memcg hierarchy or system-wide and the oom reaper 
is used for memcg oom conditions as well.  Otherwise, there's no mention 
of the memcg usecase.

> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
> 
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
> 

Very good points.  And I think this makes the case clear that oom_reaper 
is really a best-effort solution.

> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
> 
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
> 
> Changes since v3
> - many style/compile fixups by Andrew
> - unmap_mapping_range_tree needs full initialization of zap_details
>   to prevent from missing unmaps and follow up BUG_ON during truncate
>   resp. misaccounting - Kirill/Andrew
> - exclude mlocked pages because they need an explicit munlock by Kirill
> - use subsys_initcall instead of module_init - Paul Gortmaker
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mm.h |   2 +
>  mm/internal.h      |   5 ++
>  mm/memory.c        |  17 +++---
>  mm/oom_kill.c      | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 170 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25cdec395f2c..d1ce03569942 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1061,6 +1061,8 @@ struct zap_details {
>  	struct address_space *check_mapping;	/* Check page->mapping if set */
>  	pgoff_t	first_index;			/* Lowest page->index to unmap */
>  	pgoff_t last_index;			/* Highest page->index to unmap */
> +	bool ignore_dirty;			/* Ignore dirty pages */
> +	bool check_swap_entries;		/* Check also swap entries */
>  };
>  
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/internal.h b/mm/internal.h
> index 4ae7b7c7462b..9006ce1960ff 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
>  
> +void unmap_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end,
> +			     struct zap_details *details);
> +
>  static inline void set_page_count(struct page *page, int v)
>  {
>  	atomic_set(&page->_count, v);
> diff --git a/mm/memory.c b/mm/memory.c
> index f5b8e8c9f4c3..f60c6d6aa633 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  
>  			if (!PageAnon(page)) {
>  				if (pte_dirty(ptent)) {
> +					/*
> +					 * oom_reaper cannot tear down dirty
> +					 * pages
> +					 */
> +					if (unlikely(details && details->ignore_dirty))
> +						continue;
>  					force_flush = 1;
>  					set_page_dirty(page);
>  				}
> @@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			}
>  			continue;
>  		}
> -		/* If details->check_mapping, we leave swap entries. */
> -		if (unlikely(details))
> +		/* only check swap_entries if explicitly asked for in details */
> +		if (unlikely(details && !details->check_swap_entries))
>  			continue;
>  
>  		entry = pte_to_swp_entry(ptent);
> @@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
>  	return addr;
>  }
>  
> -static void unmap_page_range(struct mmu_gather *tlb,
> +void unmap_page_range(struct mmu_gather *tlb,
>  			     struct vm_area_struct *vma,
>  			     unsigned long addr, unsigned long end,
>  			     struct zap_details *details)
> @@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
>  	pgd_t *pgd;
>  	unsigned long next;
>  
> -	if (details && !details->check_mapping)
> -		details = NULL;
> -
>  	BUG_ON(addr >= end);
>  	tlb_start_vma(tlb, vma);
>  	pgd = pgd_offset(vma->vm_mm, addr);
> @@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
>  void unmap_mapping_range(struct address_space *mapping,
>  		loff_t const holebegin, loff_t const holelen, int even_cows)
>  {
> -	struct zap_details details;
> +	struct zap_details details = { };
>  	pgoff_t hba = holebegin >> PAGE_SHIFT;
>  	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dc490c06941b..1ece40b94725 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,6 +35,11 @@
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
> +#include <linux/kthread.h>
> +#include <linux/init.h>
> +
> +#include <asm/tlb.h>
> +#include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/oom.h>
> @@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
>  
>  bool oom_killer_disabled __read_mostly;
>  
> +#ifdef CONFIG_MMU
> +/*
> + * OOM Reaper kernel thread which tries to reap the memory used by the OOM
> + * victim (if that is possible) to help the OOM killer to move on.
> + */
> +static struct task_struct *oom_reaper_th;
> +static struct mm_struct *mm_to_reap;
> +static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> +
> +static bool __oom_reap_vmas(struct mm_struct *mm)
> +{
> +	struct mmu_gather tlb;
> +	struct vm_area_struct *vma;
> +	struct zap_details details = {.check_swap_entries = true,
> +				      .ignore_dirty = true};
> +	bool ret = true;
> +
> +	/* We might have raced with exit path */
> +	if (!atomic_inc_not_zero(&mm->mm_users))
> +		return true;
> +
> +	if (!down_read_trylock(&mm->mmap_sem)) {
> +		ret = false;
> +		goto out;
> +	}
> +
> +	tlb_gather_mmu(&tlb, mm, 0, -1);
> +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +
> +		/*
> +		 * mlocked VMAs require explicit munlocking before unmap.
> +		 * Let's keep it simple here and skip such VMAs.
> +		 */
> +		if (vma->vm_flags & VM_LOCKED)
> +			continue;

Shouldn't there be VM_PFNMAP handling here?

I'm wondering why zap_page_range() for vma->vm_start to vma->vm_end wasn't 
used here for simplicity?  It appears as though what you're doing is an 
MADV_DONTNEED over the length of all anonymous vmas that aren't shared, so 
why not have such an implementation in a single place so any changes don't 
have to be made in two different spots for things such as VM_PFNMAP?

> +
> +		/*
> +		 * Only anonymous pages have a good chance to be dropped
> +		 * without additional steps which we cannot afford as we
> +		 * are OOM already.
> +		 *
> +		 * We do not even care about fs backed pages because all
> +		 * which are reclaimable have already been reclaimed and
> +		 * we do not want to block exit_mmap by keeping mm ref
> +		 * count elevated without a good reason.
> +		 */
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> +					 &details);
> +	}
> +	tlb_finish_mmu(&tlb, 0, -1);
> +	up_read(&mm->mmap_sem);
> +out:
> +	mmput(mm);
> +	return ret;
> +}
> +
> +static void oom_reap_vmas(struct mm_struct *mm)
> +{
> +	int attempts = 0;
> +
> +	/* Retry the down_read_trylock(mmap_sem) a few times */
> +	while (attempts++ < 10 && !__oom_reap_vmas(mm))
> +		schedule_timeout_idle(HZ/10);
> +
> +	/* Drop a reference taken by wake_oom_reaper */
> +	mmdrop(mm);
> +}
> +
> +static int oom_reaper(void *unused)
> +{
> +	while (true) {
> +		struct mm_struct *mm;
> +
> +		wait_event_freezable(oom_reaper_wait,
> +				     (mm = READ_ONCE(mm_to_reap)));
> +		oom_reap_vmas(mm);
> +		WRITE_ONCE(mm_to_reap, NULL);
> +	}
> +
> +	return 0;
> +}
> +
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +	struct mm_struct *old_mm;
> +
> +	if (!oom_reaper_th)
> +		return;
> +
> +	/*
> +	 * Pin the given mm. Use mm_count instead of mm_users because
> +	 * we do not want to delay the address space tear down.
> +	 */
> +	atomic_inc(&mm->mm_count);
> +
> +	/*
> +	 * Make sure that only a single mm is ever queued for the reaper
> +	 * because multiple are not necessary and the operation might be
> +	 * disruptive so better reduce it to the bare minimum.
> +	 */
> +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> +	if (!old_mm)
> +		wake_up(&oom_reaper_wait);
> +	else
> +		mmdrop(mm);

This behavior is probably the only really significant concern I have about 
the patch: we just drop the mm and don't try any reaping if there is 
already reaping in progress.

We don't always have control over the amount of memory that can be reaped 
from the victim, either because of oom kill prioritization through 
/proc/pid/oom_score_adj or because the memory of the victim is not 
eligible.

I'm imagining a scenario where the oom reaper has raced with a follow-up 
oom kill before mm_to_reap has been set to NULL so there's no subsequent 
reaping.  It's also possible that oom reaping of the first victim actually 
freed little memory.

Would it really be difficult to queue mm's to reap from?  If memory has 
already been freed before the reaper can get to it, the 
find_lock_task_mm() should just fail and we're done.  I'm not sure why 
this is being limited to a single mm system-wide.

> +}
> +
> +static int __init oom_init(void)
> +{
> +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> +	if (IS_ERR(oom_reaper_th)) {
> +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> +				PTR_ERR(oom_reaper_th));
> +		oom_reaper_th = NULL;
> +	} else {
> +		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> +
> +		/*
> +		 * Make sure our oom reaper thread will get scheduled when
> +		 * ASAP and that it won't get preempted by malicious userspace.
> +		 */
> +		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);

Eeek, do you really show this is necessary?  I would imagine that we would 
want to limit high priority processes system-wide and that we wouldn't 
want to be interferred with by memcg oom conditions that trigger the oom 
reaper, for example.

> +	}
> +	return 0;
> +}
> +subsys_initcall(oom_init)
> +#else
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +}
> +#endif
> +
>  /**
>   * mark_oom_victim - mark the given task as OOM victim
>   * @tsk: task to mark
> @@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> +	bool can_oom_reap = true;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
>  		if (is_global_init(p))
>  			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot use oom_reaper for the mm shared by this
> +			 * process because it wouldn't get killed and so the
> +			 * memory might be still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;
> -
> +		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);

Is it possible to just do wake_oom_reaper(mm) here and eliminate 
can_oom_reap with a little bit of moving around?

>  	}
>  	rcu_read_unlock();
>  
> +	if (can_oom_reap)
> +		wake_oom_reaper(mm);
> +
>  	mmdrop(mm);
>  	put_task_struct(victim);
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-11 22:54     ` Andrew Morton
@ 2016-01-12  8:16       ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-12  8:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 11-01-16 14:54:55, Andrew Morton wrote:
> On Wed,  6 Jan 2016 16:42:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > - use subsys_initcall instead of module_init - Paul Gortmaker
> 
> That's pretty much the only change between what-i-have and
> what-you-sent, so I'll just do this as a delta:

Yeah that should be the case, thanks for double checking!
 
> --- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-v4
> +++ a/mm/oom_kill.c
> @@ -32,12 +32,11 @@
>  #include <linux/mempolicy.h>
>  #include <linux/security.h>
>  #include <linux/ptrace.h>
> -#include <linux/delay.h>
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
>  #include <linux/kthread.h>
> -#include <linux/module.h>
> +#include <linux/init.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -542,7 +541,7 @@ static int __init oom_init(void)
>  	}
>  	return 0;
>  }
> -module_init(oom_init)
> +subsys_initcall(oom_init)
>  #else
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> _

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-12  8:16       ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-12  8:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 11-01-16 14:54:55, Andrew Morton wrote:
> On Wed,  6 Jan 2016 16:42:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > - use subsys_initcall instead of module_init - Paul Gortmaker
> 
> That's pretty much the only change between what-i-have and
> what-you-sent, so I'll just do this as a delta:

Yeah that should be the case, thanks for double checking!
 
> --- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-v4
> +++ a/mm/oom_kill.c
> @@ -32,12 +32,11 @@
>  #include <linux/mempolicy.h>
>  #include <linux/security.h>
>  #include <linux/ptrace.h>
> -#include <linux/delay.h>
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
>  #include <linux/ratelimit.h>
>  #include <linux/kthread.h>
> -#include <linux/module.h>
> +#include <linux/init.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -542,7 +541,7 @@ static int __init oom_init(void)
>  	}
>  	return 0;
>  }
> -module_init(oom_init)
> +subsys_initcall(oom_init)
>  #else
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> _

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06 15:42   ` Michal Hocko
@ 2016-01-11 22:54     ` Andrew Morton
  -1 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2016-01-11 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed,  6 Jan 2016 16:42:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> - use subsys_initcall instead of module_init - Paul Gortmaker

That's pretty much the only change between what-i-have and
what-you-sent, so I'll just do this as a delta:


--- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-v4
+++ a/mm/oom_kill.c
@@ -32,12 +32,11 @@
 #include <linux/mempolicy.h>
 #include <linux/security.h>
 #include <linux/ptrace.h>
-#include <linux/delay.h>
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
-#include <linux/module.h>
+#include <linux/init.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -542,7 +541,7 @@ static int __init oom_init(void)
 	}
 	return 0;
 }
-module_init(oom_init)
+subsys_initcall(oom_init)
 #else
 static void wake_oom_reaper(struct mm_struct *mm)
 {
_

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-11 22:54     ` Andrew Morton
  0 siblings, 0 replies; 78+ messages in thread
From: Andrew Morton @ 2016-01-11 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed,  6 Jan 2016 16:42:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> - use subsys_initcall instead of module_init - Paul Gortmaker

That's pretty much the only change between what-i-have and
what-you-sent, so I'll just do this as a delta:


--- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-v4
+++ a/mm/oom_kill.c
@@ -32,12 +32,11 @@
 #include <linux/mempolicy.h>
 #include <linux/security.h>
 #include <linux/ptrace.h>
-#include <linux/delay.h>
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
-#include <linux/module.h>
+#include <linux/init.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -542,7 +541,7 @@ static int __init oom_init(void)
 	}
 	return 0;
 }
-module_init(oom_init)
+subsys_initcall(oom_init)
 #else
 static void wake_oom_reaper(struct mm_struct *mm)
 {
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-07 11:23     ` Tetsuo Handa
@ 2016-01-07 12:30       ` Michal Hocko
  -1 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-07 12:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 07-01-16 20:23:04, Tetsuo Handa wrote:
[...]
> According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
> to kill init sharing same memory"), below patch is needed for avoid
> killing init process with SIGSEGV.
> 
> ----------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9548dce..9832f3f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>   			continue;
>   		if (same_thread_group(p, victim))
>   			continue;
> -		if (is_global_init(p))
> -			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) ||
> +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
>   		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>   			/*
>   			 * We cannot use oom_reaper for the mm shared by this
[...]
> [    3.132836] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [    3.137232] [   98]     0    98   279607   244400     489       5        0             0 init
> [    3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice child
> [    3.145346] Killed process 98 (init) total-vm:1118428kB, anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
> [    3.416105] init[1]: segfault at 0 ip           (null) sp 00007ffd484cf5f0 error 14 in init[400000+1000]
> [    3.439074] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    3.439074]
> [    3.450193] Kernel Offset: disabled
> [    3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    3.456259]

Ouch. You are right. The reaper will tear down the shared mm and the
global init will blow up. Very well spotted! The system will blow up
later, I would guess, because killing the victim wouldn't release a lot
of memory which will be pinned by the global init. So a panic sounds
unevitable. The scenario is really insane but what you are proposing is
correct.

Updated patch below.
--- 
>From 71c6f4135fe4a8d448d63d4904ba514787dea008 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 23 Nov 2015 18:20:57 +0100
Subject: [PATCH] mm, oom: introduce oom reaper

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
- do not tear down mm if it is shared with the global init because this
  could lead to SEGV and panic - Tetsuo
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 170 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..f60c6d6aa633 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dc490c06941b..95ce1602744b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,17 +748,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (is_global_init(p))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.6.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-07 12:30       ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-07 12:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, mgorman, rientjes, torvalds, oleg, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 07-01-16 20:23:04, Tetsuo Handa wrote:
[...]
> According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
> to kill init sharing same memory"), below patch is needed for avoid
> killing init process with SIGSEGV.
> 
> ----------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9548dce..9832f3f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>   			continue;
>   		if (same_thread_group(p, victim))
>   			continue;
> -		if (is_global_init(p))
> -			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) ||
> +		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
>   		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>   			/*
>   			 * We cannot use oom_reaper for the mm shared by this
[...]
> [    3.132836] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [    3.137232] [   98]     0    98   279607   244400     489       5        0             0 init
> [    3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice child
> [    3.145346] Killed process 98 (init) total-vm:1118428kB, anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
> [    3.416105] init[1]: segfault at 0 ip           (null) sp 00007ffd484cf5f0 error 14 in init[400000+1000]
> [    3.439074] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    3.439074]
> [    3.450193] Kernel Offset: disabled
> [    3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    3.456259]

Ouch. You are right. The reaper will tear down the shared mm and the
global init will blow up. Very well spotted! The system will blow up
later, I would guess, because killing the victim wouldn't release a lot
of memory which will be pinned by the global init. So a panic sounds
unevitable. The scenario is really insane but what you are proposing is
correct.

Updated patch below.
--- 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06 15:42   ` Michal Hocko
@ 2016-01-07 11:23     ` Tetsuo Handa
  -1 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2016-01-07 11:23 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
>  		if (is_global_init(p))
>  			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot use oom_reaper for the mm shared by this
> +			 * process because it wouldn't get killed and so the
> +			 * memory might be still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;
> -
> +		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	}
>  	rcu_read_unlock();

According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
to kill init sharing same memory"), below patch is needed for avoid
killing init process with SIGSEGV.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9548dce..9832f3f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  			continue;
  		if (same_thread_group(p, victim))
  			continue;
-		if (is_global_init(p))
-			continue;
-		if (unlikely(p->flags & PF_KTHREAD) ||
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
  		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
  			/*
  			 * We cannot use oom_reaper for the mm shared by this
----------

----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

static int child(void *unused)
{
	char *buf = NULL;
	unsigned long i;
	unsigned long size = 0;
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < size; i += 4096)
		buf[i] = '\0'; /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	char *cp = malloc(8192);
	if (cp && clone(child, cp + 8192, CLONE_VM, NULL) > 0)
		while (1) {
			sleep(1);
			write(1, cp, 1);
		}
	return 0;
}
----------
[    2.954212] init invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[    2.959697] init cpuset=/ mems_allowed=0
[    2.961927] CPU: 0 PID: 98 Comm: init Not tainted 4.4.0-rc8-next-20160106+ #28
[    2.965738] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[    2.971239]  0000000000000000 0000000075c7a38e ffffffff812ab8c4 ffff88003bd6fd48
[    2.975461]  ffffffff8117eb58 0000000000000000 ffff88003bd6fd48 0000000000000000
[    2.979572]  ffffffff810c5630 0000000000000003 0000000000000202 0000000000000549
[    2.983525] Call Trace:
[    2.984813]  [<ffffffff812ab8c4>] ? dump_stack+0x40/0x5c
[    2.987497]  [<ffffffff8117eb58>] ? dump_header+0x58/0x1ed
[    2.990285]  [<ffffffff810c5630>] ? ktime_get+0x30/0x90
[    2.992963]  [<ffffffff810fd225>] ? delayacct_end+0x35/0x60
[    2.995884]  [<ffffffff81113dc3>] ? oom_kill_process+0x323/0x460
[    2.998944]  [<ffffffff81114060>] ? out_of_memory+0x110/0x480
[    3.001833]  [<ffffffff811197ad>] ? __alloc_pages_nodemask+0xbbd/0xd60
[    3.005400]  [<ffffffff8115d951>] ? alloc_pages_vma+0xb1/0x220
[    3.008391]  [<ffffffff811780ac>] ? mem_cgroup_commit_charge+0x7c/0xf0
[    3.011668]  [<ffffffff8113ce86>] ? handle_mm_fault+0x1036/0x1460
[    3.014782]  [<ffffffff81056c97>] ? __do_page_fault+0x177/0x430
[    3.017770]  [<ffffffff81056f7b>] ? do_page_fault+0x2b/0x70
[    3.020615]  [<ffffffff815a9198>] ? page_fault+0x28/0x30
[    3.023359] Mem-Info:
[    3.024575] active_anon:244334 inactive_anon:0 isolated_anon:0
[    3.024575]  active_file:0 inactive_file:0 isolated_file:0
[    3.024575]  unevictable:561 dirty:0 writeback:0 unstable:0
[    3.024575]  slab_reclaimable:94 slab_unreclaimable:2386
[    3.024575]  mapped:275 shmem:0 pagetables:477 bounce:0
[    3.024575]  free:1924 free_pcp:304 free_cma:0
[    3.040715] Node 0 DMA free:3936kB min:60kB low:72kB high:88kB active_anon:11260kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:64kB kernel_stack:0kB pagetables:564kB unstable:0kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[    3.062251] lowmem_reserve[]: 0 969 969 969
[    3.064752] Node 0 DMA32 free:3760kB min:3812kB low:4764kB high:5716kB active_anon:966076kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:2244kB isolated(anon):0kB 
isolated(file):0kB present:1032064kB managed:994872kB mlocked:0kB dirty:0kB writeback:0kB mapped:1100kB shmem:0kB slab_reclaimable:372kB slab_unreclaimable:9480kB kernel_stack:2192kB pagetables:1344kB 
unstable:0kB bounce:0kB free_pcp:1216kB local_pcp:244kB free_cma:0kB writeback_tmp:0kB pages_scanned:2244 all_unreclaimable? yes
[    3.087299] lowmem_reserve[]: 0 0 0 0
[    3.089437] Node 0 DMA: 2*4kB (ME) 1*8kB (E) 3*16kB (UME) 3*32kB (UME) 3*64kB (UME) 2*128kB (ME) 3*256kB (UME) 3*512kB (UME) 1*1024kB (E) 0*2048kB 0*4096kB = 3936kB
[    3.098058] Node 0 DMA32: 4*4kB (UME) 4*8kB (UME) 2*16kB (UE) 1*32kB (M) 1*64kB (M) 2*128kB (UE) 1*256kB (E) 0*512kB 3*1024kB (UME) 0*2048kB 0*4096kB = 3760kB
[    3.106371] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[    3.110846] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[    3.115169] 561 total pagecache pages
[    3.117051] 0 pages in swap cache
[    3.118764] Swap cache stats: add 0, delete 0, find 0/0
[    3.121414] Free swap  = 0kB
[    3.122958] Total swap = 0kB
[    3.124468] 262013 pages RAM
[    3.125962] 0 pages HighMem/MovableOnly
[    3.127932] 9319 pages reserved
[    3.129597] 0 pages cma reserved
[    3.131258] 0 pages hwpoisoned
[    3.132836] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[    3.137232] [   98]     0    98   279607   244400     489       5        0             0 init
[    3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice child
[    3.145346] Killed process 98 (init) total-vm:1118428kB, anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
[    3.416105] init[1]: segfault at 0 ip           (null) sp 00007ffd484cf5f0 error 14 in init[400000+1000]
[    3.439074] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.439074]
[    3.450193] Kernel Offset: disabled
[    3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.456259]
----------

Guessing from commit 1e99bad0d9c12a4a ("oom: kill all threads sharing oom
killed task's mm"), the

	if (same_thread_group(p, victim))
		continue;

test is for avoiding "Kill process %d (%s) sharing same memory\n" on the
victim's mm, but that printk() was already removed. Thus, I think we have
nothing to do (or can remove it if we don't mind sending SIGKILL twice).

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-07 11:23     ` Tetsuo Handa
  0 siblings, 0 replies; 78+ messages in thread
From: Tetsuo Handa @ 2016-01-07 11:23 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: mgorman, rientjes, torvalds, oleg, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> @@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
>  		if (is_global_init(p))
>  			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot use oom_reaper for the mm shared by this
> +			 * process because it wouldn't get killed and so the
> +			 * memory might be still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;
> -
> +		}
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	}
>  	rcu_read_unlock();

According to commit a2b829d95958da20 ("mm/oom_kill.c: avoid attempting
to kill init sharing same memory"), below patch is needed for avoid
killing init process with SIGSEGV.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9548dce..9832f3f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -784,9 +784,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
  			continue;
  		if (same_thread_group(p, victim))
  			continue;
-		if (is_global_init(p))
-			continue;
-		if (unlikely(p->flags & PF_KTHREAD) ||
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
  		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
  			/*
  			 * We cannot use oom_reaper for the mm shared by this
----------

----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

static int child(void *unused)
{
	char *buf = NULL;
	unsigned long i;
	unsigned long size = 0;
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	for (i = 0; i < size; i += 4096)
		buf[i] = '\0'; /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	char *cp = malloc(8192);
	if (cp && clone(child, cp + 8192, CLONE_VM, NULL) > 0)
		while (1) {
			sleep(1);
			write(1, cp, 1);
		}
	return 0;
}
----------
[    2.954212] init invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
[    2.959697] init cpuset=/ mems_allowed=0
[    2.961927] CPU: 0 PID: 98 Comm: init Not tainted 4.4.0-rc8-next-20160106+ #28
[    2.965738] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[    2.971239]  0000000000000000 0000000075c7a38e ffffffff812ab8c4 ffff88003bd6fd48
[    2.975461]  ffffffff8117eb58 0000000000000000 ffff88003bd6fd48 0000000000000000
[    2.979572]  ffffffff810c5630 0000000000000003 0000000000000202 0000000000000549
[    2.983525] Call Trace:
[    2.984813]  [<ffffffff812ab8c4>] ? dump_stack+0x40/0x5c
[    2.987497]  [<ffffffff8117eb58>] ? dump_header+0x58/0x1ed
[    2.990285]  [<ffffffff810c5630>] ? ktime_get+0x30/0x90
[    2.992963]  [<ffffffff810fd225>] ? delayacct_end+0x35/0x60
[    2.995884]  [<ffffffff81113dc3>] ? oom_kill_process+0x323/0x460
[    2.998944]  [<ffffffff81114060>] ? out_of_memory+0x110/0x480
[    3.001833]  [<ffffffff811197ad>] ? __alloc_pages_nodemask+0xbbd/0xd60
[    3.005400]  [<ffffffff8115d951>] ? alloc_pages_vma+0xb1/0x220
[    3.008391]  [<ffffffff811780ac>] ? mem_cgroup_commit_charge+0x7c/0xf0
[    3.011668]  [<ffffffff8113ce86>] ? handle_mm_fault+0x1036/0x1460
[    3.014782]  [<ffffffff81056c97>] ? __do_page_fault+0x177/0x430
[    3.017770]  [<ffffffff81056f7b>] ? do_page_fault+0x2b/0x70
[    3.020615]  [<ffffffff815a9198>] ? page_fault+0x28/0x30
[    3.023359] Mem-Info:
[    3.024575] active_anon:244334 inactive_anon:0 isolated_anon:0
[    3.024575]  active_file:0 inactive_file:0 isolated_file:0
[    3.024575]  unevictable:561 dirty:0 writeback:0 unstable:0
[    3.024575]  slab_reclaimable:94 slab_unreclaimable:2386
[    3.024575]  mapped:275 shmem:0 pagetables:477 bounce:0
[    3.024575]  free:1924 free_pcp:304 free_cma:0
[    3.040715] Node 0 DMA free:3936kB min:60kB low:72kB high:88kB active_anon:11260kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:64kB kernel_stack:0kB pagetables:564kB unstable:0kB bounce:0kB 
free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[    3.062251] lowmem_reserve[]: 0 969 969 969
[    3.064752] Node 0 DMA32 free:3760kB min:3812kB low:4764kB high:5716kB active_anon:966076kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:2244kB isolated(anon):0kB 
isolated(file):0kB present:1032064kB managed:994872kB mlocked:0kB dirty:0kB writeback:0kB mapped:1100kB shmem:0kB slab_reclaimable:372kB slab_unreclaimable:9480kB kernel_stack:2192kB pagetables:1344kB 
unstable:0kB bounce:0kB free_pcp:1216kB local_pcp:244kB free_cma:0kB writeback_tmp:0kB pages_scanned:2244 all_unreclaimable? yes
[    3.087299] lowmem_reserve[]: 0 0 0 0
[    3.089437] Node 0 DMA: 2*4kB (ME) 1*8kB (E) 3*16kB (UME) 3*32kB (UME) 3*64kB (UME) 2*128kB (ME) 3*256kB (UME) 3*512kB (UME) 1*1024kB (E) 0*2048kB 0*4096kB = 3936kB
[    3.098058] Node 0 DMA32: 4*4kB (UME) 4*8kB (UME) 2*16kB (UE) 1*32kB (M) 1*64kB (M) 2*128kB (UE) 1*256kB (E) 0*512kB 3*1024kB (UME) 0*2048kB 0*4096kB = 3760kB
[    3.106371] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[    3.110846] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[    3.115169] 561 total pagecache pages
[    3.117051] 0 pages in swap cache
[    3.118764] Swap cache stats: add 0, delete 0, find 0/0
[    3.121414] Free swap  = 0kB
[    3.122958] Total swap = 0kB
[    3.124468] 262013 pages RAM
[    3.125962] 0 pages HighMem/MovableOnly
[    3.127932] 9319 pages reserved
[    3.129597] 0 pages cma reserved
[    3.131258] 0 pages hwpoisoned
[    3.132836] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[    3.137232] [   98]     0    98   279607   244400     489       5        0             0 init
[    3.141664] Out of memory: Kill process 98 (init) score 940 or sacrifice child
[    3.145346] Killed process 98 (init) total-vm:1118428kB, anon-rss:977464kB, file-rss:136kB, shmem-rss:0kB
[    3.416105] init[1]: segfault at 0 ip           (null) sp 00007ffd484cf5f0 error 14 in init[400000+1000]
[    3.439074] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.439074]
[    3.450193] Kernel Offset: disabled
[    3.456259] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    3.456259]
----------

Guessing from commit 1e99bad0d9c12a4a ("oom: kill all threads sharing oom
killed task's mm"), the

	if (same_thread_group(p, victim))
		continue;

test is for avoiding "Kill process %d (%s) sharing same memory\n" on the
victim's mm, but that printk() was already removed. Thus, I think we have
nothing to do (or can remove it if we don't mind sending SIGKILL twice).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 1/2] mm, oom: introduce oom reaper
  2016-01-06 15:42 [PATCH 0/2 -mm] oom reaper v4 Michal Hocko
@ 2016-01-06 15:42   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-06 15:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 170 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..f60c6d6aa633 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dc490c06941b..1ece40b94725 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
 		if (is_global_init(p))
 			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.6.4


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH 1/2] mm, oom: introduce oom reaper
@ 2016-01-06 15:42   ` Michal Hocko
  0 siblings, 0 replies; 78+ messages in thread
From: Michal Hocko @ 2016-01-06 15:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Tetsuo Handa, David Rientjes, Linus Torvalds,
	Oleg Nesterov, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 170 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..f60c6d6aa633 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1128,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1234,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1242,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2393,7 +2396,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dc490c06941b..1ece40b94725 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,141 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -517,6 +657,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,17 +748,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
 		if (is_global_init(p))
 			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2016-02-03 10:32 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-15 18:36 [PATCH 1/2] mm, oom: introduce oom reaper Michal Hocko
2015-12-15 18:36 ` Michal Hocko
2015-12-17  0:50 ` Andrew Morton
2015-12-17  0:50   ` Andrew Morton
2015-12-17 13:02   ` Michal Hocko
2015-12-17 13:02     ` Michal Hocko
2015-12-17 19:55     ` Linus Torvalds
2015-12-17 19:55       ` Linus Torvalds
2015-12-17 20:00       ` Andrew Morton
2015-12-17 20:00         ` Andrew Morton
2015-12-18 11:54         ` Michal Hocko
2015-12-18 11:54           ` Michal Hocko
2015-12-18 21:14           ` Andrew Morton
2015-12-18 21:14             ` Andrew Morton
2015-12-21  8:38             ` Michal Hocko
2015-12-21  8:38               ` Michal Hocko
2015-12-17 21:13     ` Andrew Morton
2015-12-17 21:13       ` Andrew Morton
2015-12-18 12:11       ` Michal Hocko
2015-12-18 12:11         ` Michal Hocko
2015-12-18 12:10     ` Tetsuo Handa
2015-12-18 12:10       ` Tetsuo Handa
2015-12-20  7:14       ` Tetsuo Handa
2015-12-20  7:14         ` Tetsuo Handa
2015-12-18  0:15 ` Andrew Morton
2015-12-18  0:15   ` Andrew Morton
2015-12-18 11:48   ` Michal Hocko
2015-12-18 11:48     ` Michal Hocko
2015-12-21 20:38 ` Paul Gortmaker
2015-12-21 20:38   ` Paul Gortmaker
2016-01-06  9:10   ` Michal Hocko
2016-01-06  9:10     ` Michal Hocko
2016-01-06 14:26     ` Paul Gortmaker
2016-01-06 14:26       ` Paul Gortmaker
2016-01-06 15:00       ` Michal Hocko
2016-01-06 15:00         ` Michal Hocko
2015-12-23 23:00 ` Ross Zwisler
2015-12-23 23:00   ` Ross Zwisler
2015-12-24  9:47   ` Michal Hocko
2015-12-24  9:47     ` Michal Hocko
2015-12-24 11:06     ` Tetsuo Handa
2015-12-24 11:06       ` Tetsuo Handa
2015-12-24 20:39       ` Ross Zwisler
2015-12-24 20:39         ` Ross Zwisler
2015-12-25 11:41       ` Michal Hocko
2015-12-25 11:41         ` Michal Hocko
2015-12-24 20:44     ` Ross Zwisler
2015-12-24 20:44       ` Ross Zwisler
2015-12-25 11:35       ` Michal Hocko
2015-12-25 11:35         ` Michal Hocko
2015-12-25 11:44         ` Michal Hocko
2015-12-25 11:44           ` Michal Hocko
2016-01-06 15:42 [PATCH 0/2 -mm] oom reaper v4 Michal Hocko
2016-01-06 15:42 ` [PATCH 1/2] mm, oom: introduce oom reaper Michal Hocko
2016-01-06 15:42   ` Michal Hocko
2016-01-07 11:23   ` Tetsuo Handa
2016-01-07 11:23     ` Tetsuo Handa
2016-01-07 12:30     ` Michal Hocko
2016-01-07 12:30       ` Michal Hocko
2016-01-11 22:54   ` Andrew Morton
2016-01-11 22:54     ` Andrew Morton
2016-01-12  8:16     ` Michal Hocko
2016-01-12  8:16       ` Michal Hocko
2016-01-28  1:28   ` David Rientjes
2016-01-28  1:28     ` David Rientjes
2016-01-28 21:42     ` Michal Hocko
2016-01-28 21:42       ` Michal Hocko
2016-02-02  3:02       ` David Rientjes
2016-02-02  3:02         ` David Rientjes
2016-02-02  8:57         ` Michal Hocko
2016-02-02  8:57           ` Michal Hocko
2016-02-02 11:48           ` Tetsuo Handa
2016-02-02 11:48             ` Tetsuo Handa
2016-02-02 22:55             ` David Rientjes
2016-02-02 22:55               ` David Rientjes
2016-02-02 22:51           ` David Rientjes
2016-02-02 22:51             ` David Rientjes
2016-02-03 10:31             ` Tetsuo Handa
2016-02-03 10:31               ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.