All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] oom reaper v6
@ 2016-03-22 11:00 ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar,
	Johannes Weiner, Mel Gorman, Michal Hocko, Oleg Nesterov,
	Peter Zijlstra, Vladimir Davydov

Hi,
I am reposting the whole patchset on top of the current Linus tree which should
already contain big pile of Andrew's mm patches. This should serve an easier
reviewability and I also hope that this core part of the work can go to 4.6.

The previous version was posted here [1] Hugh and David have suggested to
drop [2] because the munlock path currently depends on the page lock and
it is better if the initial version was conservative and prevent from
any potential lockups even though it is not clear whether they are real
- nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
will have a look and try to make the munlock path not depend on the page
lock as a follow up work.

Apart from that the feedback revealed one bug for a very unusual
configuration (sysctl_oom_kill_allocating_task) and that has been fixed
by patch 8 and one potential mis interaction with the pm freezer fixed by
patch 7.

I think the current code base is already very useful for many situations.
The rest of the feedback was mostly about potential enhancements of the
current code which I would really prefer to build on top of the current
series. I plan to finish my mmap_sem killable for write in the upcoming
release cycle and hopefully have it merged in the next merge window.
I believe more extensions will follow.

This code has been sitting in the mmotm (thus linux-next) for a while.
Are there any fundamental objections to have this part merged in this
merge window?

Thanks!

[1] http://lkml.kernel.org/r/1454505240-23446-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1454505240-23446-3-git-send-email-mhocko@kernel.org

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 0/9] oom reaper v6
@ 2016-03-22 11:00 ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar,
	Johannes Weiner, Mel Gorman, Michal Hocko, Oleg Nesterov,
	Peter Zijlstra, Vladimir Davydov

Hi,
I am reposting the whole patchset on top of the current Linus tree which should
already contain big pile of Andrew's mm patches. This should serve an easier
reviewability and I also hope that this core part of the work can go to 4.6.

The previous version was posted here [1] Hugh and David have suggested to
drop [2] because the munlock path currently depends on the page lock and
it is better if the initial version was conservative and prevent from
any potential lockups even though it is not clear whether they are real
- nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
will have a look and try to make the munlock path not depend on the page
lock as a follow up work.

Apart from that the feedback revealed one bug for a very unusual
configuration (sysctl_oom_kill_allocating_task) and that has been fixed
by patch 8 and one potential mis interaction with the pm freezer fixed by
patch 7.

I think the current code base is already very useful for many situations.
The rest of the feedback was mostly about potential enhancements of the
current code which I would really prefer to build on top of the current
series. I plan to finish my mmap_sem killable for write in the upcoming
release cycle and hopefully have it merged in the next merge window.
I believe more extensions will follow.

This code has been sitting in the mmotm (thus linux-next) for a while.
Are there any fundamental objections to have this part merged in this
merge window?

Thanks!

[1] http://lkml.kernel.org/r/1454505240-23446-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1454505240-23446-3-git-send-email-mhocko@kernel.org


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 11:00   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar,
	Peter Zijlstra

From: Andrew Morton <akpm@linux-foundation.org>

This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/sched.h |  1 +
 kernel/time/timer.c   | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 084ed9fba620..9cf5731472fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -425,6 +425,7 @@ extern signed long schedule_timeout(signed long timeout);
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern signed long schedule_timeout_idle(signed long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index d1798fa0c743..73164c3aa56b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uninterruptible(signed long timeout)
 }
 EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
+/*
+ * Like schedule_timeout_uninterruptible(), except this task will not contribute
+ * to load average.
+ */
+signed long __sched schedule_timeout_idle(signed long timeout)
+{
+	__set_current_state(TASK_IDLE);
+	return schedule_timeout(timeout);
+}
+EXPORT_SYMBOL(schedule_timeout_idle);
+
 #ifdef CONFIG_HOTPLUG_CPU
 static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
 {
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-22 11:00   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar,
	Peter Zijlstra

From: Andrew Morton <akpm@linux-foundation.org>

This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/sched.h |  1 +
 kernel/time/timer.c   | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 084ed9fba620..9cf5731472fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -425,6 +425,7 @@ extern signed long schedule_timeout(signed long timeout);
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern signed long schedule_timeout_idle(signed long timeout);
 asmlinkage void schedule(void);
 extern void schedule_preempt_disabled(void);
 
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index d1798fa0c743..73164c3aa56b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uninterruptible(signed long timeout)
 }
 EXPORT_SYMBOL(schedule_timeout_uninterruptible);
 
+/*
+ * Like schedule_timeout_uninterruptible(), except this task will not contribute
+ * to load average.
+ */
+signed long __sched schedule_timeout_idle(signed long timeout)
+{
+	__set_current_state(TASK_IDLE);
+	return schedule_timeout(timeout);
+}
+EXPORT_SYMBOL(schedule_timeout_idle);
+
 #ifdef CONFIG_HOTPLUG_CPU
 static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
 {
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/9] mm, oom: introduce oom reaper
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 11:00   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Chnages since v4
- drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
  might interfere with the rest of the system
Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
- do not tear down mm if it is shared with the global init because this
  could lead to SEGV and panic - Tetsuo
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 162 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 450fc977ed02..ed6407d1b7b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1132,6 +1132,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 7449392c6faa..b79abb6721cf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -38,6 +38,11 @@
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 extern int __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
 		unsigned long lookahead_size);
diff --git a/mm/memory.c b/mm/memory.c
index 81dca0083fcd..098f00d05461 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1102,6 +1102,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1120,8 +1126,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1226,7 +1232,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1234,9 +1240,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2432,7 +2435,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 06f7e1707847..f7ed6ece0719 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -405,6 +410,133 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -510,6 +642,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -600,17 +733,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (is_global_init(p))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/9] mm, oom: introduce oom reaper
@ 2016-03-22 11:00   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Chnages since v4
- drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
  might interfere with the rest of the system
Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
- do not tear down mm if it is shared with the global init because this
  could lead to SEGV and panic - Tetsuo
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 162 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 450fc977ed02..ed6407d1b7b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1132,6 +1132,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 7449392c6faa..b79abb6721cf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -38,6 +38,11 @@
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 extern int __do_page_cache_readahead(struct address_space *mapping,
 		struct file *filp, pgoff_t offset, unsigned long nr_to_read,
 		unsigned long lookahead_size);
diff --git a/mm/memory.c b/mm/memory.c
index 81dca0083fcd..098f00d05461 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1102,6 +1102,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1120,8 +1126,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1226,7 +1232,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1234,9 +1240,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2432,7 +2435,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 06f7e1707847..f7ed6ece0719 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -405,6 +410,133 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -510,6 +642,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -600,17 +733,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (is_global_init(p))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/9] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 11:00   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem
for write should be blocked killable which will help to reduce such a
lock contention. This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up
waiters too early.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  2 +-
 kernel/exit.c       |  2 +-
 mm/oom_kill.c       | 73 +++++++++++++++++++++++++++++++++++------------------
 3 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257321f0..45993b840ed6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(void);
+extern void exit_oom_victim(struct task_struct *tsk);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index 10e088237fed..ba3bd29d7e1d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,7 +434,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
+		exit_oom_victim(tsk);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f7ed6ece0719..2830b1c6483e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,20 +416,36 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct mm_struct *mm_to_reap;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
-static bool __oom_reap_vmas(struct mm_struct *mm)
+static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/* We might have raced with exit path */
-	if (!atomic_inc_not_zero(&mm->mm_users))
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	p = find_lock_task_mm(tsk);
+	if (!p)
+		return true;
+
+	mm = p->mm;
+	if (!atomic_inc_not_zero(&mm->mm_users)) {
+		task_unlock(p);
 		return true;
+	}
+
+	task_unlock(p);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -464,60 +480,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * reasonably reclaimable memory anymore. OOM killer can continue
+	 * by selecting other victim if unmapping hasn't led to any
+	 * improvements. This also means that selecting this task doesn't
+	 * make any sense.
+	 */
+	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
 }
 
-static void oom_reap_vmas(struct mm_struct *mm)
+static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
 	/* Drop a reference taken by wake_oom_reaper */
-	mmdrop(mm);
+	put_task_struct(tsk);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct mm_struct *mm;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (mm = READ_ONCE(mm_to_reap)));
-		oom_reap_vmas(mm);
-		WRITE_ONCE(mm_to_reap, NULL);
+				     (tsk = READ_ONCE(task_to_reap)));
+		oom_reap_task(tsk);
+		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct mm_struct *old_mm;
+	struct task_struct *old_tsk;
 
 	if (!oom_reaper_th)
 		return;
 
-	/*
-	 * Pin the given mm. Use mm_count instead of mm_users because
-	 * we do not want to delay the address space tear down.
-	 */
-	atomic_inc(&mm->mm_count);
+	get_task_struct(tsk);
 
 	/*
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
 	 */
-	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm)
+	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
+	if (!old_tsk)
 		wake_up(&oom_reaper_wait);
 	else
-		mmdrop(mm);
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
@@ -532,7 +554,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -563,9 +585,10 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(void)
+void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -748,7 +771,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		wake_oom_reaper(mm);
+		wake_oom_reaper(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/9] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-03-22 11:00   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem
for write should be blocked killable which will help to reduce such a
lock contention. This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up
waiters too early.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  2 +-
 kernel/exit.c       |  2 +-
 mm/oom_kill.c       | 73 +++++++++++++++++++++++++++++++++++------------------
 3 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257321f0..45993b840ed6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(void);
+extern void exit_oom_victim(struct task_struct *tsk);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index 10e088237fed..ba3bd29d7e1d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,7 +434,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
+		exit_oom_victim(tsk);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f7ed6ece0719..2830b1c6483e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,20 +416,36 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct mm_struct *mm_to_reap;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
-static bool __oom_reap_vmas(struct mm_struct *mm)
+static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/* We might have raced with exit path */
-	if (!atomic_inc_not_zero(&mm->mm_users))
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	p = find_lock_task_mm(tsk);
+	if (!p)
+		return true;
+
+	mm = p->mm;
+	if (!atomic_inc_not_zero(&mm->mm_users)) {
+		task_unlock(p);
 		return true;
+	}
+
+	task_unlock(p);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -464,60 +480,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * reasonably reclaimable memory anymore. OOM killer can continue
+	 * by selecting other victim if unmapping hasn't led to any
+	 * improvements. This also means that selecting this task doesn't
+	 * make any sense.
+	 */
+	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
 }
 
-static void oom_reap_vmas(struct mm_struct *mm)
+static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
 	/* Drop a reference taken by wake_oom_reaper */
-	mmdrop(mm);
+	put_task_struct(tsk);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct mm_struct *mm;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (mm = READ_ONCE(mm_to_reap)));
-		oom_reap_vmas(mm);
-		WRITE_ONCE(mm_to_reap, NULL);
+				     (tsk = READ_ONCE(task_to_reap)));
+		oom_reap_task(tsk);
+		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct mm_struct *old_mm;
+	struct task_struct *old_tsk;
 
 	if (!oom_reaper_th)
 		return;
 
-	/*
-	 * Pin the given mm. Use mm_count instead of mm_users because
-	 * we do not want to delay the address space tear down.
-	 */
-	atomic_inc(&mm->mm_count);
+	get_task_struct(tsk);
 
 	/*
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
 	 */
-	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm)
+	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
+	if (!old_tsk)
 		wake_up(&oom_reaper_wait);
 	else
-		mmdrop(mm);
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
@@ -532,7 +554,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -563,9 +585,10 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(void)
+void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -748,7 +771,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		wake_oom_reaper(mm);
+		wake_oom_reaper(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/9] mm, oom_reaper: report success/failure
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 11:00   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2830b1c6483e..e627ce235e38 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -419,6 +419,7 @@ static struct task_struct *oom_reaper_th;
 static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -479,6 +480,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 					 &details);
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
+	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+			task_pid_nr(tsk), tsk->comm,
+			K(get_mm_counter(mm, MM_ANONPAGES)),
+			K(get_mm_counter(mm, MM_FILEPAGES)),
+			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
@@ -495,14 +501,21 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	return ret;
 }
 
+#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
+	if (attempts > MAX_OOM_REAP_RETRIES) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -649,7 +662,6 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/9] mm, oom_reaper: report success/failure
@ 2016-03-22 11:00   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2830b1c6483e..e627ce235e38 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -419,6 +419,7 @@ static struct task_struct *oom_reaper_th;
 static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -479,6 +480,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 					 &details);
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
+	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+			task_pid_nr(tsk), tsk->comm,
+			K(get_mm_counter(mm, MM_ANONPAGES)),
+			K(get_mm_counter(mm, MM_FILEPAGES)),
+			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
@@ -495,14 +501,21 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	return ret;
 }
 
+#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
+	if (attempts > MAX_OOM_REAP_RETRIES) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -649,7 +662,6 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/9] mm, oom_reaper: implement OOM victims queuing
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 11:00   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wake_oom_reaper has allowed only 1 oom victim to be queued. The main
reason for that was the simplicity as other solutions would require
some way of queuing. The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach
to help with oom handling when the OOM victim cannot terminate in a
reasonable time. The race could lead to missing an oom victim which can
get stuck

out_of_memory
  wake_oom_reaper
    cmpxchg // OK
    			oom_reaper
			  oom_reap_task
			    __oom_reap_task
oom_victim terminates
			      atomic_inc_not_zero // fail
out_of_memory
  wake_oom_reaper
    cmpxchg // fails
			  task_to_reap = NULL

This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible. E.g. the original victim might
have not released a lot of memory for some reason.

The situation would improve considerably if wake_oom_reaper used a more
robust queuing. This is what this patch implements. This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization. wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  3 +++
 mm/oom_kill.c         | 36 +++++++++++++++++++-----------------
 2 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9cf5731472fe..bc5867296f7b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1838,6 +1838,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_MMU
+	struct list_head oom_reaper_list;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e627ce235e38..8e0bd279135f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,8 +416,10 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+static LIST_HEAD(oom_reaper_list);
+static DEFINE_SPINLOCK(oom_reaper_lock);
+
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
@@ -523,12 +525,20 @@ static void oom_reap_task(struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk;
+		struct task_struct *tsk = NULL;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (tsk = READ_ONCE(task_to_reap)));
-		oom_reap_task(tsk);
-		WRITE_ONCE(task_to_reap, NULL);
+				     (!list_empty(&oom_reaper_list)));
+		spin_lock(&oom_reaper_lock);
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_first_entry(&oom_reaper_list,
+					struct task_struct, oom_reaper_list);
+			list_del(&tsk->oom_reaper_list);
+		}
+		spin_unlock(&oom_reaper_lock);
+
+		if (tsk)
+			oom_reap_task(tsk);
 	}
 
 	return 0;
@@ -536,23 +546,15 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct task_struct *old_tsk;
-
 	if (!oom_reaper_th)
 		return;
 
 	get_task_struct(tsk);
 
-	/*
-	 * Make sure that only a single mm is ever queued for the reaper
-	 * because multiple are not necessary and the operation might be
-	 * disruptive so better reduce it to the bare minimum.
-	 */
-	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
-	if (!old_tsk)
-		wake_up(&oom_reaper_wait);
-	else
-		put_task_struct(tsk);
+	spin_lock(&oom_reaper_lock);
+	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
+	wake_up(&oom_reaper_wait);
 }
 
 static int __init oom_init(void)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/9] mm, oom_reaper: implement OOM victims queuing
@ 2016-03-22 11:00   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wake_oom_reaper has allowed only 1 oom victim to be queued. The main
reason for that was the simplicity as other solutions would require
some way of queuing. The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach
to help with oom handling when the OOM victim cannot terminate in a
reasonable time. The race could lead to missing an oom victim which can
get stuck

out_of_memory
  wake_oom_reaper
    cmpxchg // OK
    			oom_reaper
			  oom_reap_task
			    __oom_reap_task
oom_victim terminates
			      atomic_inc_not_zero // fail
out_of_memory
  wake_oom_reaper
    cmpxchg // fails
			  task_to_reap = NULL

This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible. E.g. the original victim might
have not released a lot of memory for some reason.

The situation would improve considerably if wake_oom_reaper used a more
robust queuing. This is what this patch implements. This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization. wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  3 +++
 mm/oom_kill.c         | 36 +++++++++++++++++++-----------------
 2 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9cf5731472fe..bc5867296f7b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1838,6 +1838,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_MMU
+	struct list_head oom_reaper_list;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e627ce235e38..8e0bd279135f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,8 +416,10 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+static LIST_HEAD(oom_reaper_list);
+static DEFINE_SPINLOCK(oom_reaper_lock);
+
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
@@ -523,12 +525,20 @@ static void oom_reap_task(struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk;
+		struct task_struct *tsk = NULL;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (tsk = READ_ONCE(task_to_reap)));
-		oom_reap_task(tsk);
-		WRITE_ONCE(task_to_reap, NULL);
+				     (!list_empty(&oom_reaper_list)));
+		spin_lock(&oom_reaper_lock);
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_first_entry(&oom_reaper_list,
+					struct task_struct, oom_reaper_list);
+			list_del(&tsk->oom_reaper_list);
+		}
+		spin_unlock(&oom_reaper_lock);
+
+		if (tsk)
+			oom_reap_task(tsk);
 	}
 
 	return 0;
@@ -536,23 +546,15 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct task_struct *old_tsk;
-
 	if (!oom_reaper_th)
 		return;
 
 	get_task_struct(tsk);
 
-	/*
-	 * Make sure that only a single mm is ever queued for the reaper
-	 * because multiple are not necessary and the operation might be
-	 * disruptive so better reduce it to the bare minimum.
-	 */
-	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
-	if (!old_tsk)
-		wake_up(&oom_reaper_wait);
-	else
-		put_task_struct(tsk);
+	spin_lock(&oom_reaper_lock);
+	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
+	wake_up(&oom_reaper_wait);
 }
 
 static int __init oom_init(void)
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/9] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 11:00   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times.

This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
which is set in oom_kill_process atomically and oom reaper is disabled
if the flag was already set.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 2 ++
 mm/oom_kill.c         | 6 +++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc5867296f7b..acb480b581e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -511,6 +511,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 
+#define MMF_OOM_KILLED		21	/* OOM killer has chosen this mm */
+
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
 struct sighand_struct {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8e0bd279135f..b38a648558f9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -679,7 +679,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -741,6 +741,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
+
+	/* Make sure we do not try to oom reap the mm multiple times */
+	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
+
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/9] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
@ 2016-03-22 11:00   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times.

This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
which is set in oom_kill_process atomically and oom reaper is disabled
if the flag was already set.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 2 ++
 mm/oom_kill.c         | 6 +++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc5867296f7b..acb480b581e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -511,6 +511,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 
+#define MMF_OOM_KILLED		21	/* OOM killer has chosen this mm */
+
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
 struct sighand_struct {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8e0bd279135f..b38a648558f9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -679,7 +679,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -741,6 +741,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
+
+	/* Make sure we do not try to oom reap the mm multiple times */
+	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
+
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 7/9] oom: make oom_reaper_list single linked
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 11:00   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Vladimir Davydov

From: Vladimir Davydov <vdavydov@virtuozzo.com>

Entries are only added/removed from oom_reaper_list at head so we can use
a single linked list and hence save a word in task_struct.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  2 +-
 mm/oom_kill.c         | 15 +++++++--------
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index acb480b581e3..d118445a332e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1841,7 +1841,7 @@ struct task_struct {
 #endif
 	int pagefault_disabled;
 #ifdef CONFIG_MMU
-	struct list_head oom_reaper_list;
+	struct task_struct *oom_reaper_list;
 #endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b38a648558f9..af75260f32c3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,7 +417,7 @@ bool oom_killer_disabled __read_mostly;
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static LIST_HEAD(oom_reaper_list);
+static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 
@@ -527,13 +527,11 @@ static int oom_reaper(void *unused)
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-		wait_event_freezable(oom_reaper_wait,
-				     (!list_empty(&oom_reaper_list)));
+		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
 		spin_lock(&oom_reaper_lock);
-		if (!list_empty(&oom_reaper_list)) {
-			tsk = list_first_entry(&oom_reaper_list,
-					struct task_struct, oom_reaper_list);
-			list_del(&tsk->oom_reaper_list);
+		if (oom_reaper_list != NULL) {
+			tsk = oom_reaper_list;
+			oom_reaper_list = tsk->oom_reaper_list;
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -552,7 +550,8 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);
-	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
+	tsk->oom_reaper_list = oom_reaper_list;
+	oom_reaper_list = tsk;
 	spin_unlock(&oom_reaper_lock);
 	wake_up(&oom_reaper_wait);
 }
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 7/9] oom: make oom_reaper_list single linked
@ 2016-03-22 11:00   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Vladimir Davydov

From: Vladimir Davydov <vdavydov@virtuozzo.com>

Entries are only added/removed from oom_reaper_list at head so we can use
a single linked list and hence save a word in task_struct.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  2 +-
 mm/oom_kill.c         | 15 +++++++--------
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index acb480b581e3..d118445a332e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1841,7 +1841,7 @@ struct task_struct {
 #endif
 	int pagefault_disabled;
 #ifdef CONFIG_MMU
-	struct list_head oom_reaper_list;
+	struct task_struct *oom_reaper_list;
 #endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b38a648558f9..af75260f32c3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,7 +417,7 @@ bool oom_killer_disabled __read_mostly;
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static LIST_HEAD(oom_reaper_list);
+static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 
@@ -527,13 +527,11 @@ static int oom_reaper(void *unused)
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-		wait_event_freezable(oom_reaper_wait,
-				     (!list_empty(&oom_reaper_list)));
+		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
 		spin_lock(&oom_reaper_lock);
-		if (!list_empty(&oom_reaper_list)) {
-			tsk = list_first_entry(&oom_reaper_list,
-					struct task_struct, oom_reaper_list);
-			list_del(&tsk->oom_reaper_list);
+		if (oom_reaper_list != NULL) {
+			tsk = oom_reaper_list;
+			oom_reaper_list = tsk->oom_reaper_list;
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -552,7 +550,8 @@ static void wake_oom_reaper(struct task_struct *tsk)
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);
-	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
+	tsk->oom_reaper_list = oom_reaper_list;
+	oom_reaper_list = tsk;
 	spin_unlock(&oom_reaper_lock);
 	wake_up(&oom_reaper_wait);
 }
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 8/9] oom: make oom_reaper freezable
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 11:00   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
address space" oom_reaper will call exit_oom_victim on the target
task after it is done. This might however race with the PM freezer:

CPU0				CPU1				CPU2
freeze_processes
  try_to_freeze_tasks
  				# Allocation request
				out_of_memory
  oom_killer_disable
				  wake_oom_reaper(P1)
				  				__oom_reap_task
								  exit_oom_victim(P1)
    wait_event(oom_victims==0)
[...]
    				do_exit(P1)
				  perform IO/interfere with the freezer

which breaks the oom_killer_disable semantic. We no longer have a
guarantee that the oom victim won't interfere with the freezer because
it might be anywhere on the way to do_exit while the freezer thinks the
task has already terminated. It might trigger IO or touch devices which
are frozen already.

In order to close this race, make the oom_reaper thread freezable. This
will work because
	a) already running oom_reaper will block freezer to enter the
	   quiescent state
	b) wake_oom_reaper will not wake up the reaper after it has been
	   frozen
	c) the only way to call exit_oom_victim after try_to_freeze_tasks
	   is from the oom victim's context when we know the further
	   interference shouldn't be possible

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index af75260f32c3..bed2885d10b0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -524,6 +524,8 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
+	set_freezable();
+
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 8/9] oom: make oom_reaper freezable
@ 2016-03-22 11:00   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
address space" oom_reaper will call exit_oom_victim on the target
task after it is done. This might however race with the PM freezer:

CPU0				CPU1				CPU2
freeze_processes
  try_to_freeze_tasks
  				# Allocation request
				out_of_memory
  oom_killer_disable
				  wake_oom_reaper(P1)
				  				__oom_reap_task
								  exit_oom_victim(P1)
    wait_event(oom_victims==0)
[...]
    				do_exit(P1)
				  perform IO/interfere with the freezer

which breaks the oom_killer_disable semantic. We no longer have a
guarantee that the oom victim won't interfere with the freezer because
it might be anywhere on the way to do_exit while the freezer thinks the
task has already terminated. It might trigger IO or touch devices which
are frozen already.

In order to close this race, make the oom_reaper thread freezable. This
will work because
	a) already running oom_reaper will block freezer to enter the
	   quiescent state
	b) wake_oom_reaper will not wake up the reaper after it has been
	   frozen
	c) the only way to call exit_oom_victim after try_to_freeze_tasks
	   is from the oom victim's context when we know the further
	   interference shouldn't be possible

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index af75260f32c3..bed2885d10b0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -524,6 +524,8 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
+	set_freezable();
+
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 9/9] oom, oom_reaper: protect oom_reaper_list using simpler way
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 11:00   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it
by simply checking tsk->oom_reaper_list != NULL.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 2 --
 mm/oom_kill.c         | 8 ++------
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d118445a332e..78434d4f85f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -511,8 +511,6 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 
-#define MMF_OOM_KILLED		21	/* OOM killer has chosen this mm */
-
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
 struct sighand_struct {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bed2885d10b0..cfafb91ebcd9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -546,7 +546,7 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
+	if (!oom_reaper_th || tsk->oom_reaper_list)
 		return;
 
 	get_task_struct(tsk);
@@ -680,7 +680,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap;
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -742,10 +742,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
-
-	/* Make sure we do not try to oom reap the mm multiple times */
-	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
-
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 9/9] oom, oom_reaper: protect oom_reaper_list using simpler way
@ 2016-03-22 11:00   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 11:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it
by simply checking tsk->oom_reaper_list != NULL.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 2 --
 mm/oom_kill.c         | 8 ++------
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d118445a332e..78434d4f85f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -511,8 +511,6 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 
-#define MMF_OOM_KILLED		21	/* OOM killer has chosen this mm */
-
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
 struct sighand_struct {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bed2885d10b0..cfafb91ebcd9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -546,7 +546,7 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
+	if (!oom_reaper_th || tsk->oom_reaper_list)
 		return;
 
 	get_task_struct(tsk);
@@ -680,7 +680,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap;
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -742,10 +742,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
-
-	/* Make sure we do not try to oom reap the mm multiple times */
-	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
-
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 11:00   ` Michal Hocko
@ 2016-03-22 12:23     ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-03-22 12:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:

>  extern signed long schedule_timeout_interruptible(signed long timeout);
>  extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern signed long schedule_timeout_idle(signed long timeout);

> +/*
> + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> + * to load average.
> + */
> +signed long __sched schedule_timeout_idle(signed long timeout)
> +{
> +	__set_current_state(TASK_IDLE);
> +	return schedule_timeout(timeout);
> +}
> +EXPORT_SYMBOL(schedule_timeout_idle);

Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
pretty pointless.

Why not kill the lot?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-22 12:23     ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-03-22 12:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:

>  extern signed long schedule_timeout_interruptible(signed long timeout);
>  extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern signed long schedule_timeout_idle(signed long timeout);

> +/*
> + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> + * to load average.
> + */
> +signed long __sched schedule_timeout_idle(signed long timeout)
> +{
> +	__set_current_state(TASK_IDLE);
> +	return schedule_timeout(timeout);
> +}
> +EXPORT_SYMBOL(schedule_timeout_idle);

Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
pretty pointless.

Why not kill the lot?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 12:23     ` Peter Zijlstra
@ 2016-03-22 12:33       ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 12:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue 22-03-16 13:23:45, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> 
> >  extern signed long schedule_timeout_interruptible(signed long timeout);
> >  extern signed long schedule_timeout_killable(signed long timeout);
> >  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > +extern signed long schedule_timeout_idle(signed long timeout);
> 
> > +/*
> > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > + * to load average.
> > + */
> > +signed long __sched schedule_timeout_idle(signed long timeout)
> > +{
> > +	__set_current_state(TASK_IDLE);
> > +	return schedule_timeout(timeout);
> > +}
> > +EXPORT_SYMBOL(schedule_timeout_idle);
> 
> Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> pretty pointless.

It seems it is just too easy to miss the __set_current_state (I am
talking from my own experience). This also seems to be a pretty common
pattern so why not wrap it under a common call.

> Why not kill the lot?

We have over 400 users, would it be much better if we open code all of
them? It doesn't sound like a huge win to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-22 12:33       ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 12:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue 22-03-16 13:23:45, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> 
> >  extern signed long schedule_timeout_interruptible(signed long timeout);
> >  extern signed long schedule_timeout_killable(signed long timeout);
> >  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > +extern signed long schedule_timeout_idle(signed long timeout);
> 
> > +/*
> > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > + * to load average.
> > + */
> > +signed long __sched schedule_timeout_idle(signed long timeout)
> > +{
> > +	__set_current_state(TASK_IDLE);
> > +	return schedule_timeout(timeout);
> > +}
> > +EXPORT_SYMBOL(schedule_timeout_idle);
> 
> Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> pretty pointless.

It seems it is just too easy to miss the __set_current_state (I am
talking from my own experience). This also seems to be a pretty common
pattern so why not wrap it under a common call.

> Why not kill the lot?

We have over 400 users, would it be much better if we open code all of
them? It doesn't sound like a huge win to me.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 12:33       ` Michal Hocko
@ 2016-03-22 12:51         ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-03-22 12:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue, Mar 22, 2016 at 01:33:14PM +0100, Michal Hocko wrote:
> On Tue 22-03-16 13:23:45, Peter Zijlstra wrote:
> > On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> > 
> > >  extern signed long schedule_timeout_interruptible(signed long timeout);
> > >  extern signed long schedule_timeout_killable(signed long timeout);
> > >  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > > +extern signed long schedule_timeout_idle(signed long timeout);
> > 
> > > +/*
> > > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > > + * to load average.
> > > + */
> > > +signed long __sched schedule_timeout_idle(signed long timeout)
> > > +{
> > > +	__set_current_state(TASK_IDLE);
> > > +	return schedule_timeout(timeout);
> > > +}
> > > +EXPORT_SYMBOL(schedule_timeout_idle);
> > 
> > Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> > pretty pointless.
> 
> It seems it is just too easy to miss the __set_current_state (I am
> talking from my own experience).

Well, that's what you get; if you call schedule() and forget to set a
blocking state you also don't block, where the problem?

> This also seems to be a pretty common
> pattern so why not wrap it under a common call.

It just seems extremely silly to create a (out-of-line even) function
for a store and a call.

> > Why not kill the lot?
> 
> We have over 400 users, would it be much better if we open code all of
> them? It doesn't sound like a huge win to me.

Dunno, changing them around isn't much work, we've got coccinelle for
that.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-22 12:51         ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-03-22 12:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue, Mar 22, 2016 at 01:33:14PM +0100, Michal Hocko wrote:
> On Tue 22-03-16 13:23:45, Peter Zijlstra wrote:
> > On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> > 
> > >  extern signed long schedule_timeout_interruptible(signed long timeout);
> > >  extern signed long schedule_timeout_killable(signed long timeout);
> > >  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > > +extern signed long schedule_timeout_idle(signed long timeout);
> > 
> > > +/*
> > > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > > + * to load average.
> > > + */
> > > +signed long __sched schedule_timeout_idle(signed long timeout)
> > > +{
> > > +	__set_current_state(TASK_IDLE);
> > > +	return schedule_timeout(timeout);
> > > +}
> > > +EXPORT_SYMBOL(schedule_timeout_idle);
> > 
> > Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> > pretty pointless.
> 
> It seems it is just too easy to miss the __set_current_state (I am
> talking from my own experience).

Well, that's what you get; if you call schedule() and forget to set a
blocking state you also don't block, where the problem?

> This also seems to be a pretty common
> pattern so why not wrap it under a common call.

It just seems extremely silly to create a (out-of-line even) function
for a store and a call.

> > Why not kill the lot?
> 
> We have over 400 users, would it be much better if we open code all of
> them? It doesn't sound like a huge win to me.

Dunno, changing them around isn't much work, we've got coccinelle for
that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 12:51         ` Peter Zijlstra
@ 2016-03-22 13:08           ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 01:33:14PM +0100, Michal Hocko wrote:
> > On Tue 22-03-16 13:23:45, Peter Zijlstra wrote:
> > > On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> > > 
> > > >  extern signed long schedule_timeout_interruptible(signed long timeout);
> > > >  extern signed long schedule_timeout_killable(signed long timeout);
> > > >  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > > > +extern signed long schedule_timeout_idle(signed long timeout);
> > > 
> > > > +/*
> > > > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > > > + * to load average.
> > > > + */
> > > > +signed long __sched schedule_timeout_idle(signed long timeout)
> > > > +{
> > > > +	__set_current_state(TASK_IDLE);
> > > > +	return schedule_timeout(timeout);
> > > > +}
> > > > +EXPORT_SYMBOL(schedule_timeout_idle);
> > > 
> > > Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> > > pretty pointless.
> > 
> > It seems it is just too easy to miss the __set_current_state (I am
> > talking from my own experience).
> 
> Well, that's what you get; if you call schedule() and forget to set a
> blocking state you also don't block, where the problem?

The error prone nature of schedule_timeout usage was the reason to
introduce them in the first place IIRC which makes me think this is
something that is not so uncommon.
 
[...]

> > > Why not kill the lot?
> > 
> > We have over 400 users, would it be much better if we open code all of
> > them? It doesn't sound like a huge win to me.
> 
> Dunno, changing them around isn't much work, we've got coccinelle for
> that.

If that sounds like a more appropriate plan I won't object. I can simply
change my patch to do __set_current_state and schedule_timeout.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-22 13:08           ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-22 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 01:33:14PM +0100, Michal Hocko wrote:
> > On Tue 22-03-16 13:23:45, Peter Zijlstra wrote:
> > > On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> > > 
> > > >  extern signed long schedule_timeout_interruptible(signed long timeout);
> > > >  extern signed long schedule_timeout_killable(signed long timeout);
> > > >  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > > > +extern signed long schedule_timeout_idle(signed long timeout);
> > > 
> > > > +/*
> > > > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > > > + * to load average.
> > > > + */
> > > > +signed long __sched schedule_timeout_idle(signed long timeout)
> > > > +{
> > > > +	__set_current_state(TASK_IDLE);
> > > > +	return schedule_timeout(timeout);
> > > > +}
> > > > +EXPORT_SYMBOL(schedule_timeout_idle);
> > > 
> > > Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> > > pretty pointless.
> > 
> > It seems it is just too easy to miss the __set_current_state (I am
> > talking from my own experience).
> 
> Well, that's what you get; if you call schedule() and forget to set a
> blocking state you also don't block, where the problem?

The error prone nature of schedule_timeout usage was the reason to
introduce them in the first place IIRC which makes me think this is
something that is not so uncommon.
 
[...]

> > > Why not kill the lot?
> > 
> > We have over 400 users, would it be much better if we open code all of
> > them? It doesn't sound like a huge win to me.
> 
> Dunno, changing them around isn't much work, we've got coccinelle for
> that.

If that sounds like a more appropriate plan I won't object. I can simply
change my patch to do __set_current_state and schedule_timeout.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 13:08           ` Michal Hocko
@ 2016-03-22 13:22             ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-03-22 13:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue, Mar 22, 2016 at 02:08:23PM +0100, Michal Hocko wrote:
> On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> If that sounds like a more appropriate plan I won't object. I can simply
> change my patch to do __set_current_state and schedule_timeout.

I dunno, I just think these wrappers are silly.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-22 13:22             ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-03-22 13:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue, Mar 22, 2016 at 02:08:23PM +0100, Michal Hocko wrote:
> On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> If that sounds like a more appropriate plan I won't object. I can simply
> change my patch to do __set_current_state and schedule_timeout.

I dunno, I just think these wrappers are silly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 13:22             ` Peter Zijlstra
@ 2016-03-22 17:56               ` Johannes Weiner
  -1 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2016-03-22 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Andrew Morton, linux-mm, LKML, Tetsuo Handa,
	David Rientjes, Ingo Molnar

On Tue, Mar 22, 2016 at 02:22:49PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 02:08:23PM +0100, Michal Hocko wrote:
> > On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> > If that sounds like a more appropriate plan I won't object. I can simply
> > change my patch to do __set_current_state and schedule_timeout.
> 
> I dunno, I just think these wrappers are silly.

Adding out-of-line, exported wrappers for every single task state is
kind of silly. But it's still a common operation to wait in a certain
state, so having a single function for that makes sense. Kind of like
spin_lock_irqsave and friends.

Maybe this would be better?:

static inline long schedule_timeout_state(long timeout, long state)
{
	__set_current_state(state);
	return schedule_timeout(timeout);
}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-22 17:56               ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2016-03-22 17:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, Andrew Morton, linux-mm, LKML, Tetsuo Handa,
	David Rientjes, Ingo Molnar

On Tue, Mar 22, 2016 at 02:22:49PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 02:08:23PM +0100, Michal Hocko wrote:
> > On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> > If that sounds like a more appropriate plan I won't object. I can simply
> > change my patch to do __set_current_state and schedule_timeout.
> 
> I dunno, I just think these wrappers are silly.

Adding out-of-line, exported wrappers for every single task state is
kind of silly. But it's still a common operation to wait in a certain
state, so having a single function for that makes sense. Kind of like
spin_lock_irqsave and friends.

Maybe this would be better?:

static inline long schedule_timeout_state(long timeout, long state)
{
	__set_current_state(state);
	return schedule_timeout(timeout);
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 17:56               ` Johannes Weiner
@ 2016-03-22 21:23                 ` Peter Zijlstra
  -1 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-03-22 21:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, linux-mm, LKML, Tetsuo Handa,
	David Rientjes, Ingo Molnar, Thomas Gleixner

On Tue, Mar 22, 2016 at 01:56:26PM -0400, Johannes Weiner wrote:
> On Tue, Mar 22, 2016 at 02:22:49PM +0100, Peter Zijlstra wrote:
> > On Tue, Mar 22, 2016 at 02:08:23PM +0100, Michal Hocko wrote:
> > > On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> > > If that sounds like a more appropriate plan I won't object. I can simply
> > > change my patch to do __set_current_state and schedule_timeout.
> > 
> > I dunno, I just think these wrappers are silly.
> 
> Adding out-of-line, exported wrappers for every single task state is
> kind of silly. But it's still a common operation to wait in a certain
> state, so having a single function for that makes sense. Kind of like
> spin_lock_irqsave and friends.
> 
> Maybe this would be better?:
> 
> static inline long schedule_timeout_state(long timeout, long state)
> {
> 	__set_current_state(state);
> 	return schedule_timeout(timeout);
> }

Probably. However, with such semantics the schedule*() name is wrong
too, you cannot use these functions to build actual wait loops etc.

So maybe:

static inline long sleep_in_state(long timeout, long state)
{
	__set_current_state(state);
	return schedule_timeout(timeout);
}

might be an even better name; but at that point we look very like the
msleep*() class of function, so maybe we should do:

long sleep_in_state(long state, long timeout)
{
	while (timeout && !signal_pending_state(state, current)) {
		__set_current_state(state);
		timeout = schedule_timeout(timeout);
	}
	return timeout;
}

Hmm ?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-22 21:23                 ` Peter Zijlstra
  0 siblings, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2016-03-22 21:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, linux-mm, LKML, Tetsuo Handa,
	David Rientjes, Ingo Molnar, Thomas Gleixner

On Tue, Mar 22, 2016 at 01:56:26PM -0400, Johannes Weiner wrote:
> On Tue, Mar 22, 2016 at 02:22:49PM +0100, Peter Zijlstra wrote:
> > On Tue, Mar 22, 2016 at 02:08:23PM +0100, Michal Hocko wrote:
> > > On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> > > If that sounds like a more appropriate plan I won't object. I can simply
> > > change my patch to do __set_current_state and schedule_timeout.
> > 
> > I dunno, I just think these wrappers are silly.
> 
> Adding out-of-line, exported wrappers for every single task state is
> kind of silly. But it's still a common operation to wait in a certain
> state, so having a single function for that makes sense. Kind of like
> spin_lock_irqsave and friends.
> 
> Maybe this would be better?:
> 
> static inline long schedule_timeout_state(long timeout, long state)
> {
> 	__set_current_state(state);
> 	return schedule_timeout(timeout);
> }

Probably. However, with such semantics the schedule*() name is wrong
too, you cannot use these functions to build actual wait loops etc.

So maybe:

static inline long sleep_in_state(long timeout, long state)
{
	__set_current_state(state);
	return schedule_timeout(timeout);
}

might be an even better name; but at that point we look very like the
msleep*() class of function, so maybe we should do:

long sleep_in_state(long state, long timeout)
{
	while (timeout && !signal_pending_state(state, current)) {
		__set_current_state(state);
		timeout = schedule_timeout(timeout);
	}
	return timeout;
}

Hmm ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/9] oom reaper v6
  2016-03-22 11:00 ` Michal Hocko
@ 2016-03-22 22:08   ` David Rientjes
  -1 siblings, 0 replies; 48+ messages in thread
From: David Rientjes @ 2016-03-22 22:08 UTC (permalink / raw)
  To: Michal Hocko, Tetsuo Handa
  Cc: Andrew Morton, linux-mm, LKML, Ingo Molnar, Johannes Weiner,
	Mel Gorman, Michal Hocko, Oleg Nesterov, Peter Zijlstra,
	Vladimir Davydov

On Tue, 22 Mar 2016, Michal Hocko wrote:

> Hi,
> I am reposting the whole patchset on top of the current Linus tree which should
> already contain big pile of Andrew's mm patches. This should serve an easier
> reviewability and I also hope that this core part of the work can go to 4.6.
> 
> The previous version was posted here [1] Hugh and David have suggested to
> drop [2] because the munlock path currently depends on the page lock and
> it is better if the initial version was conservative and prevent from
> any potential lockups even though it is not clear whether they are real
> - nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
> will have a look and try to make the munlock path not depend on the page
> lock as a follow up work.
> 
> Apart from that the feedback revealed one bug for a very unusual
> configuration (sysctl_oom_kill_allocating_task) and that has been fixed
> by patch 8 and one potential mis interaction with the pm freezer fixed by
> patch 7.
> 
> I think the current code base is already very useful for many situations.
> The rest of the feedback was mostly about potential enhancements of the
> current code which I would really prefer to build on top of the current
> series. I plan to finish my mmap_sem killable for write in the upcoming
> release cycle and hopefully have it merged in the next merge window.
> I believe more extensions will follow.
> 
> This code has been sitting in the mmotm (thus linux-next) for a while.
> Are there any fundamental objections to have this part merged in this
> merge window?
> 

Tetsuo, have you been able to run your previous test cases on top of this 
version and do you have any concerns about it or possible extensions that 
could be made?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/9] oom reaper v6
@ 2016-03-22 22:08   ` David Rientjes
  0 siblings, 0 replies; 48+ messages in thread
From: David Rientjes @ 2016-03-22 22:08 UTC (permalink / raw)
  To: Michal Hocko, Tetsuo Handa
  Cc: Andrew Morton, linux-mm, LKML, Ingo Molnar, Johannes Weiner,
	Mel Gorman, Michal Hocko, Oleg Nesterov, Peter Zijlstra,
	Vladimir Davydov

On Tue, 22 Mar 2016, Michal Hocko wrote:

> Hi,
> I am reposting the whole patchset on top of the current Linus tree which should
> already contain big pile of Andrew's mm patches. This should serve an easier
> reviewability and I also hope that this core part of the work can go to 4.6.
> 
> The previous version was posted here [1] Hugh and David have suggested to
> drop [2] because the munlock path currently depends on the page lock and
> it is better if the initial version was conservative and prevent from
> any potential lockups even though it is not clear whether they are real
> - nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
> will have a look and try to make the munlock path not depend on the page
> lock as a follow up work.
> 
> Apart from that the feedback revealed one bug for a very unusual
> configuration (sysctl_oom_kill_allocating_task) and that has been fixed
> by patch 8 and one potential mis interaction with the pm freezer fixed by
> patch 7.
> 
> I think the current code base is already very useful for many situations.
> The rest of the feedback was mostly about potential enhancements of the
> current code which I would really prefer to build on top of the current
> series. I plan to finish my mmap_sem killable for write in the upcoming
> release cycle and hopefully have it merged in the next merge window.
> I believe more extensions will follow.
> 
> This code has been sitting in the mmotm (thus linux-next) for a while.
> Are there any fundamental objections to have this part merged in this
> merge window?
> 

Tetsuo, have you been able to run your previous test cases on top of this 
version and do you have any concerns about it or possible extensions that 
could be made?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/9] mm, oom: introduce oom reaper
  2016-03-22 11:00   ` Michal Hocko
@ 2016-03-22 22:45     ` Andrew Morton
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2016-03-22 22:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

On Tue, 22 Mar 2016 12:00:19 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.

What happened to oom-reaper-handle-mlocked-pages.patch?  I have it in
-mm but I don't see it in this v6.


From: Michal Hocko <mhocko@suse.com>
Subject: oom reaper: handle mlocked pages

__oom_reap_vmas current skips over all mlocked vmas because they need a
special treatment before they are unmapped.  This is primarily done for
simplicity.  There is no reason to skip over them and reduce the amount of
reclaimed memory.  This is safe from the semantic point of view because
try_to_unmap_one during rmap walk would keep tell the reclaim to cull the
page back and mlock it again.

munlock_vma_pages_all is also safe to be called from the oom reaper
context because it doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/oom_kill.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff -puN mm/oom_kill.c~oom-reaper-handle-mlocked-pages mm/oom_kill.c
--- a/mm/oom_kill.c~oom-reaper-handle-mlocked-pages
+++ a/mm/oom_kill.c
@@ -442,13 +442,6 @@ static bool __oom_reap_vmas(struct mm_st
 			continue;
 
 		/*
-		 * mlocked VMAs require explicit munlocking before unmap.
-		 * Let's keep it simple here and skip such VMAs.
-		 */
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
-
-		/*
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
 		 * are OOM already.
@@ -458,9 +451,12 @@ static bool __oom_reap_vmas(struct mm_st
 		 * we do not want to block exit_mmap by keeping mm ref
 		 * count elevated without a good reason.
 		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_all(vma);
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 &details);
+		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
_

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/9] mm, oom: introduce oom reaper
@ 2016-03-22 22:45     ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2016-03-22 22:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

On Tue, 22 Mar 2016 12:00:19 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.

What happened to oom-reaper-handle-mlocked-pages.patch?  I have it in
-mm but I don't see it in this v6.


From: Michal Hocko <mhocko@suse.com>
Subject: oom reaper: handle mlocked pages

__oom_reap_vmas current skips over all mlocked vmas because they need a
special treatment before they are unmapped.  This is primarily done for
simplicity.  There is no reason to skip over them and reduce the amount of
reclaimed memory.  This is safe from the semantic point of view because
try_to_unmap_one during rmap walk would keep tell the reclaim to cull the
page back and mlock it again.

munlock_vma_pages_all is also safe to be called from the oom reaper
context because it doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/oom_kill.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff -puN mm/oom_kill.c~oom-reaper-handle-mlocked-pages mm/oom_kill.c
--- a/mm/oom_kill.c~oom-reaper-handle-mlocked-pages
+++ a/mm/oom_kill.c
@@ -442,13 +442,6 @@ static bool __oom_reap_vmas(struct mm_st
 			continue;
 
 		/*
-		 * mlocked VMAs require explicit munlocking before unmap.
-		 * Let's keep it simple here and skip such VMAs.
-		 */
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
-
-		/*
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
 		 * are OOM already.
@@ -458,9 +451,12 @@ static bool __oom_reap_vmas(struct mm_st
 		 * we do not want to block exit_mmap by keeping mm ref
 		 * count elevated without a good reason.
 		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_all(vma);
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 &details);
+		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/9] mm, oom: introduce oom reaper
  2016-03-22 22:45     ` Andrew Morton
@ 2016-03-22 22:58       ` Hugh Dickins
  -1 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2016-03-22 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

On Tue, 22 Mar 2016, Andrew Morton wrote:
> On Tue, 22 Mar 2016 12:00:19 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> 
> What happened to oom-reaper-handle-mlocked-pages.patch?  I have it in
> -mm but I don't see it in this v6.

Michal is dropping it for now, explained in his 0/9:

The previous version was posted here [1] Hugh and David have suggested to
drop [2] because the munlock path currently depends on the page lock and
it is better if the initial version was conservative and prevent from
any potential lockups even though it is not clear whether they are real
- nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
will have a look and try to make the munlock path not depend on the page
lock as a follow up work.

[1] http://lkml.kernel.org/r/1454505240-23446-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1454505240-23446-3-git-send-email-mhocko@kernel.org

> 
> From: Michal Hocko <mhocko@suse.com>
> Subject: oom reaper: handle mlocked pages
> 
> __oom_reap_vmas current skips over all mlocked vmas because they need a
> special treatment before they are unmapped.  This is primarily done for
> simplicity.  There is no reason to skip over them and reduce the amount of
> reclaimed memory.  This is safe from the semantic point of view because
> try_to_unmap_one during rmap walk would keep tell the reclaim to cull the
> page back and mlock it again.
> 
> munlock_vma_pages_all is also safe to be called from the oom reaper
> context because it doesn't sit on any locks but mmap_sem (for read).
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Cc: Andrea Argangeli <andrea@kernel.org>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/oom_kill.c |   12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff -puN mm/oom_kill.c~oom-reaper-handle-mlocked-pages mm/oom_kill.c
> --- a/mm/oom_kill.c~oom-reaper-handle-mlocked-pages
> +++ a/mm/oom_kill.c
> @@ -442,13 +442,6 @@ static bool __oom_reap_vmas(struct mm_st
>  			continue;
>  
>  		/*
> -		 * mlocked VMAs require explicit munlocking before unmap.
> -		 * Let's keep it simple here and skip such VMAs.
> -		 */
> -		if (vma->vm_flags & VM_LOCKED)
> -			continue;
> -
> -		/*
>  		 * Only anonymous pages have a good chance to be dropped
>  		 * without additional steps which we cannot afford as we
>  		 * are OOM already.
> @@ -458,9 +451,12 @@ static bool __oom_reap_vmas(struct mm_st
>  		 * we do not want to block exit_mmap by keeping mm ref
>  		 * count elevated without a good reason.
>  		 */
> -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> +			if (vma->vm_flags & VM_LOCKED)
> +				munlock_vma_pages_all(vma);
>  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
>  					 &details);
> +		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
>  	up_read(&mm->mmap_sem);
> _

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/9] mm, oom: introduce oom reaper
@ 2016-03-22 22:58       ` Hugh Dickins
  0 siblings, 0 replies; 48+ messages in thread
From: Hugh Dickins @ 2016-03-22 22:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, LKML, Tetsuo Handa, David Rientjes, Michal Hocko

On Tue, 22 Mar 2016, Andrew Morton wrote:
> On Tue, 22 Mar 2016 12:00:19 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> 
> What happened to oom-reaper-handle-mlocked-pages.patch?  I have it in
> -mm but I don't see it in this v6.

Michal is dropping it for now, explained in his 0/9:

The previous version was posted here [1] Hugh and David have suggested to
drop [2] because the munlock path currently depends on the page lock and
it is better if the initial version was conservative and prevent from
any potential lockups even though it is not clear whether they are real
- nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
will have a look and try to make the munlock path not depend on the page
lock as a follow up work.

[1] http://lkml.kernel.org/r/1454505240-23446-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1454505240-23446-3-git-send-email-mhocko@kernel.org

> 
> From: Michal Hocko <mhocko@suse.com>
> Subject: oom reaper: handle mlocked pages
> 
> __oom_reap_vmas current skips over all mlocked vmas because they need a
> special treatment before they are unmapped.  This is primarily done for
> simplicity.  There is no reason to skip over them and reduce the amount of
> reclaimed memory.  This is safe from the semantic point of view because
> try_to_unmap_one during rmap walk would keep tell the reclaim to cull the
> page back and mlock it again.
> 
> munlock_vma_pages_all is also safe to be called from the oom reaper
> context because it doesn't sit on any locks but mmap_sem (for read).
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Cc: Andrea Argangeli <andrea@kernel.org>
> Acked-by: David Rientjes <rientjes@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/oom_kill.c |   12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
> 
> diff -puN mm/oom_kill.c~oom-reaper-handle-mlocked-pages mm/oom_kill.c
> --- a/mm/oom_kill.c~oom-reaper-handle-mlocked-pages
> +++ a/mm/oom_kill.c
> @@ -442,13 +442,6 @@ static bool __oom_reap_vmas(struct mm_st
>  			continue;
>  
>  		/*
> -		 * mlocked VMAs require explicit munlocking before unmap.
> -		 * Let's keep it simple here and skip such VMAs.
> -		 */
> -		if (vma->vm_flags & VM_LOCKED)
> -			continue;
> -
> -		/*
>  		 * Only anonymous pages have a good chance to be dropped
>  		 * without additional steps which we cannot afford as we
>  		 * are OOM already.
> @@ -458,9 +451,12 @@ static bool __oom_reap_vmas(struct mm_st
>  		 * we do not want to block exit_mmap by keeping mm ref
>  		 * count elevated without a good reason.
>  		 */
> -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> +			if (vma->vm_flags & VM_LOCKED)
> +				munlock_vma_pages_all(vma);
>  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
>  					 &details);
> +		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
>  	up_read(&mm->mmap_sem);
> _

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 12:23     ` Peter Zijlstra
@ 2016-03-22 23:02       ` Andrew Morton
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2016-03-22 23:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue, 22 Mar 2016 13:23:45 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> 
> >  extern signed long schedule_timeout_interruptible(signed long timeout);
> >  extern signed long schedule_timeout_killable(signed long timeout);
> >  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > +extern signed long schedule_timeout_idle(signed long timeout);
> 
> > +/*
> > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > + * to load average.
> > + */
> > +signed long __sched schedule_timeout_idle(signed long timeout)
> > +{
> > +	__set_current_state(TASK_IDLE);
> > +	return schedule_timeout(timeout);
> > +}
> > +EXPORT_SYMBOL(schedule_timeout_idle);
> 
> Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> pretty pointless.

I like the wrappers.  At least, more than having to read the open-coded
version.  The latter is just more stuff to interpret and to check
whereas I can look at "schedule_timeout_idle" and think "yup, I know
what that does".

But whatever.  I'll probably be sending this series up for 4.6 and we can
worry about the schedule_timeout_foo() stuff later.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-22 23:02       ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2016-03-22 23:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michal Hocko, linux-mm, LKML, Tetsuo Handa, David Rientjes, Ingo Molnar

On Tue, 22 Mar 2016 13:23:45 +0100 Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> 
> >  extern signed long schedule_timeout_interruptible(signed long timeout);
> >  extern signed long schedule_timeout_killable(signed long timeout);
> >  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > +extern signed long schedule_timeout_idle(signed long timeout);
> 
> > +/*
> > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > + * to load average.
> > + */
> > +signed long __sched schedule_timeout_idle(signed long timeout)
> > +{
> > +	__set_current_state(TASK_IDLE);
> > +	return schedule_timeout(timeout);
> > +}
> > +EXPORT_SYMBOL(schedule_timeout_idle);
> 
> Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> pretty pointless.

I like the wrappers.  At least, more than having to read the open-coded
version.  The latter is just more stuff to interpret and to check
whereas I can look at "schedule_timeout_idle" and think "yup, I know
what that does".

But whatever.  I'll probably be sending this series up for 4.6 and we can
worry about the schedule_timeout_foo() stuff later.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
  2016-03-22 21:23                 ` Peter Zijlstra
@ 2016-03-23 10:43                   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-23 10:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, Andrew Morton, linux-mm, LKML, Tetsuo Handa,
	David Rientjes, Ingo Molnar, Thomas Gleixner

On Tue 22-03-16 22:23:52, Peter Zijlstra wrote:
[...]
> Probably. However, with such semantics the schedule*() name is wrong
> too, you cannot use these functions to build actual wait loops etc.
> 
> So maybe:
> 
> static inline long sleep_in_state(long timeout, long state)
> {
> 	__set_current_state(state);
> 	return schedule_timeout(timeout);
> }
> 
> might be an even better name; but at that point we look very like the
> msleep*() class of function, so maybe we should do:
> 
> long sleep_in_state(long state, long timeout)
> {
> 	while (timeout && !signal_pending_state(state, current)) {
> 		__set_current_state(state);
> 		timeout = schedule_timeout(timeout);
> 	}
> 	return timeout;
> }
> 
> Hmm ?

I am not sure how many callers do care about premature wake-ups (e.g
I could find a use for it) but this indeed has a better and cleaner
semantic.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/9] sched: add schedule_timeout_idle()
@ 2016-03-23 10:43                   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-23 10:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Johannes Weiner, Andrew Morton, linux-mm, LKML, Tetsuo Handa,
	David Rientjes, Ingo Molnar, Thomas Gleixner

On Tue 22-03-16 22:23:52, Peter Zijlstra wrote:
[...]
> Probably. However, with such semantics the schedule*() name is wrong
> too, you cannot use these functions to build actual wait loops etc.
> 
> So maybe:
> 
> static inline long sleep_in_state(long timeout, long state)
> {
> 	__set_current_state(state);
> 	return schedule_timeout(timeout);
> }
> 
> might be an even better name; but at that point we look very like the
> msleep*() class of function, so maybe we should do:
> 
> long sleep_in_state(long state, long timeout)
> {
> 	while (timeout && !signal_pending_state(state, current)) {
> 		__set_current_state(state);
> 		timeout = schedule_timeout(timeout);
> 	}
> 	return timeout;
> }
> 
> Hmm ?

I am not sure how many callers do care about premature wake-ups (e.g
I could find a use for it) but this indeed has a better and cleaner
semantic.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/9] oom reaper v6
  2016-03-22 22:08   ` David Rientjes
@ 2016-03-23 11:11     ` Tetsuo Handa
  -1 siblings, 0 replies; 48+ messages in thread
From: Tetsuo Handa @ 2016-03-23 11:11 UTC (permalink / raw)
  To: rientjes, mhocko
  Cc: akpm, linux-mm, linux-kernel, mingo, hannes, mgorman, mhocko,
	oleg, a.p.zijlstra, vdavydov

David Rientjes wrote:
> On Tue, 22 Mar 2016, Michal Hocko wrote:
> 
> > Hi,
> > I am reposting the whole patchset on top of the current Linus tree which should
> > already contain big pile of Andrew's mm patches. This should serve an easier
> > reviewability and I also hope that this core part of the work can go to 4.6.
> > 
> > The previous version was posted here [1] Hugh and David have suggested to
> > drop [2] because the munlock path currently depends on the page lock and
> > it is better if the initial version was conservative and prevent from
> > any potential lockups even though it is not clear whether they are real
> > - nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
> > will have a look and try to make the munlock path not depend on the page
> > lock as a follow up work.
> > 
> > Apart from that the feedback revealed one bug for a very unusual
> > configuration (sysctl_oom_kill_allocating_task) and that has been fixed
> > by patch 8 and one potential mis interaction with the pm freezer fixed by
> > patch 7.
> > 
> > I think the current code base is already very useful for many situations.
> > The rest of the feedback was mostly about potential enhancements of the
> > current code which I would really prefer to build on top of the current
> > series. I plan to finish my mmap_sem killable for write in the upcoming
> > release cycle and hopefully have it merged in the next merge window.
> > I believe more extensions will follow.
> > 
> > This code has been sitting in the mmotm (thus linux-next) for a while.
> > Are there any fundamental objections to have this part merged in this
> > merge window?
> > 
> 
> Tetsuo, have you been able to run your previous test cases on top of this 
> version and do you have any concerns about it or possible extensions that 
> could be made?
> 

I think [PATCH 3/9] [PATCH 4/9] [PATCH 8/9] will be mostly reverted.
My concerns and possible extensions are explained in

    Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
    http://lkml.kernel.org/r/201603152015.JAE86937.VFOLtQFOFJOSHM@I-love.SAKURA.ne.jp

. Regarding "[PATCH 4/9] mm, oom_reaper: report success/failure",
debug_show_all_locks() may not be safe

    commit 856848737bd944c1 "lockdep: fix debug_show_all_locks()"
    commit 82a1fcb90287052a "softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks"

and showing traces might be more useful.
(A discussion for making printk() completely async is in progress.)

But we don't have time to update this series before merge window for 4.6 closes.
We want to send current patchset as is for now, don't we? So, please go ahead.



My other concerns about OOM handling:

  Change TIF_MEMDIE strategy from per a thread to per a signal_struct.

    [PATCH] mm,oom: Set TIF_MEMDIE on all OOM-killed threads.
    http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

  Found a bug in too_many_isolated() assumption.

    How to handle infinite too_many_isolated() loop (for OOM detection rework v4) ?
    http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp

  Waiting for a patch to be merged.

    [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
    http://lkml.kernel.org/r/20160318134230.GC30225@dhcp22.suse.cz

  And, kmallocwd, __GFP_KILLABLE, and timeout (or something finite one) for TIF_MEMDIE.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/9] oom reaper v6
@ 2016-03-23 11:11     ` Tetsuo Handa
  0 siblings, 0 replies; 48+ messages in thread
From: Tetsuo Handa @ 2016-03-23 11:11 UTC (permalink / raw)
  To: rientjes, mhocko
  Cc: akpm, linux-mm, linux-kernel, mingo, hannes, mgorman, mhocko,
	oleg, a.p.zijlstra, vdavydov

David Rientjes wrote:
> On Tue, 22 Mar 2016, Michal Hocko wrote:
> 
> > Hi,
> > I am reposting the whole patchset on top of the current Linus tree which should
> > already contain big pile of Andrew's mm patches. This should serve an easier
> > reviewability and I also hope that this core part of the work can go to 4.6.
> > 
> > The previous version was posted here [1] Hugh and David have suggested to
> > drop [2] because the munlock path currently depends on the page lock and
> > it is better if the initial version was conservative and prevent from
> > any potential lockups even though it is not clear whether they are real
> > - nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
> > will have a look and try to make the munlock path not depend on the page
> > lock as a follow up work.
> > 
> > Apart from that the feedback revealed one bug for a very unusual
> > configuration (sysctl_oom_kill_allocating_task) and that has been fixed
> > by patch 8 and one potential mis interaction with the pm freezer fixed by
> > patch 7.
> > 
> > I think the current code base is already very useful for many situations.
> > The rest of the feedback was mostly about potential enhancements of the
> > current code which I would really prefer to build on top of the current
> > series. I plan to finish my mmap_sem killable for write in the upcoming
> > release cycle and hopefully have it merged in the next merge window.
> > I believe more extensions will follow.
> > 
> > This code has been sitting in the mmotm (thus linux-next) for a while.
> > Are there any fundamental objections to have this part merged in this
> > merge window?
> > 
> 
> Tetsuo, have you been able to run your previous test cases on top of this 
> version and do you have any concerns about it or possible extensions that 
> could be made?
> 

I think [PATCH 3/9] [PATCH 4/9] [PATCH 8/9] will be mostly reverted.
My concerns and possible extensions are explained in

    Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
    http://lkml.kernel.org/r/201603152015.JAE86937.VFOLtQFOFJOSHM@I-love.SAKURA.ne.jp

. Regarding "[PATCH 4/9] mm, oom_reaper: report success/failure",
debug_show_all_locks() may not be safe

    commit 856848737bd944c1 "lockdep: fix debug_show_all_locks()"
    commit 82a1fcb90287052a "softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks"

and showing traces might be more useful.
(A discussion for making printk() completely async is in progress.)

But we don't have time to update this series before merge window for 4.6 closes.
We want to send current patchset as is for now, don't we? So, please go ahead.



My other concerns about OOM handling:

  Change TIF_MEMDIE strategy from per a thread to per a signal_struct.

    [PATCH] mm,oom: Set TIF_MEMDIE on all OOM-killed threads.
    http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

  Found a bug in too_many_isolated() assumption.

    How to handle infinite too_many_isolated() loop (for OOM detection rework v4) ?
    http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp

  Waiting for a patch to be merged.

    [PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
    http://lkml.kernel.org/r/20160318134230.GC30225@dhcp22.suse.cz

  And, kmallocwd, __GFP_KILLABLE, and timeout (or something finite one) for TIF_MEMDIE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/9] oom reaper v6
  2016-03-23 11:11     ` Tetsuo Handa
@ 2016-03-23 12:07       ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-23 12:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, linux-mm, linux-kernel, mingo, hannes, mgorman,
	oleg, a.p.zijlstra, vdavydov

On Wed 23-03-16 20:11:35, Tetsuo Handa wrote:
> David Rientjes wrote:
[...]
> > Tetsuo, have you been able to run your previous test cases on top of this 
> > version and do you have any concerns about it or possible extensions that 
> > could be made?
> > 
> 
> I think [PATCH 3/9] [PATCH 4/9] [PATCH 8/9] will be mostly reverted.
> My concerns and possible extensions are explained in
> 
>     Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
>     http://lkml.kernel.org/r/201603152015.JAE86937.VFOLtQFOFJOSHM@I-love.SAKURA.ne.jp

I believe issues you have raised there are a matter for further
discussion as they are potential improvements of the existing
functionality rather than fixing a regression of the current code.

> . Regarding "[PATCH 4/9] mm, oom_reaper: report success/failure",
> debug_show_all_locks() may not be safe
> 
>     commit 856848737bd944c1 "lockdep: fix debug_show_all_locks()"
>     commit 82a1fcb90287052a "softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks"

Let me ask again. What exactly is unsafe about calling
debug_show_all_locks here? It is true that 856848737bd944c1 has
changed debug_show_all_locks to ignore running tasks which limits
this functionality to some degree but I still think this might be
useful. Proposed alternatives were way too verbose and complex on its
own. This is something to be further discussed as well, though.

> and showing traces might be more useful.
> (A discussion for making printk() completely async is in progress.)
> 
> But we don't have time to update this series before merge window for 4.6 closes.
> We want to send current patchset as is for now, don't we? So, please go ahead.

I am happy that we are on the same patch here.

> My other concerns about OOM handling:

Let's stick to oom reaper here, please.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/9] oom reaper v6
@ 2016-03-23 12:07       ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-23 12:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, linux-mm, linux-kernel, mingo, hannes, mgorman,
	oleg, a.p.zijlstra, vdavydov

On Wed 23-03-16 20:11:35, Tetsuo Handa wrote:
> David Rientjes wrote:
[...]
> > Tetsuo, have you been able to run your previous test cases on top of this 
> > version and do you have any concerns about it or possible extensions that 
> > could be made?
> > 
> 
> I think [PATCH 3/9] [PATCH 4/9] [PATCH 8/9] will be mostly reverted.
> My concerns and possible extensions are explained in
> 
>     Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
>     http://lkml.kernel.org/r/201603152015.JAE86937.VFOLtQFOFJOSHM@I-love.SAKURA.ne.jp

I believe issues you have raised there are a matter for further
discussion as they are potential improvements of the existing
functionality rather than fixing a regression of the current code.

> . Regarding "[PATCH 4/9] mm, oom_reaper: report success/failure",
> debug_show_all_locks() may not be safe
> 
>     commit 856848737bd944c1 "lockdep: fix debug_show_all_locks()"
>     commit 82a1fcb90287052a "softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks"

Let me ask again. What exactly is unsafe about calling
debug_show_all_locks here? It is true that 856848737bd944c1 has
changed debug_show_all_locks to ignore running tasks which limits
this functionality to some degree but I still think this might be
useful. Proposed alternatives were way too verbose and complex on its
own. This is something to be further discussed as well, though.

> and showing traces might be more useful.
> (A discussion for making printk() completely async is in progress.)
> 
> But we don't have time to update this series before merge window for 4.6 closes.
> We want to send current patchset as is for now, don't we? So, please go ahead.

I am happy that we are on the same patch here.

> My other concerns about OOM handling:

Let's stick to oom reaper here, please.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2016-03-23 12:07 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-22 11:00 [PATCH 0/9] oom reaper v6 Michal Hocko
2016-03-22 11:00 ` Michal Hocko
2016-03-22 11:00 ` [PATCH 1/9] sched: add schedule_timeout_idle() Michal Hocko
2016-03-22 11:00   ` Michal Hocko
2016-03-22 12:23   ` Peter Zijlstra
2016-03-22 12:23     ` Peter Zijlstra
2016-03-22 12:33     ` Michal Hocko
2016-03-22 12:33       ` Michal Hocko
2016-03-22 12:51       ` Peter Zijlstra
2016-03-22 12:51         ` Peter Zijlstra
2016-03-22 13:08         ` Michal Hocko
2016-03-22 13:08           ` Michal Hocko
2016-03-22 13:22           ` Peter Zijlstra
2016-03-22 13:22             ` Peter Zijlstra
2016-03-22 17:56             ` Johannes Weiner
2016-03-22 17:56               ` Johannes Weiner
2016-03-22 21:23               ` Peter Zijlstra
2016-03-22 21:23                 ` Peter Zijlstra
2016-03-23 10:43                 ` Michal Hocko
2016-03-23 10:43                   ` Michal Hocko
2016-03-22 23:02     ` Andrew Morton
2016-03-22 23:02       ` Andrew Morton
2016-03-22 11:00 ` [PATCH 2/9] mm, oom: introduce oom reaper Michal Hocko
2016-03-22 11:00   ` Michal Hocko
2016-03-22 22:45   ` Andrew Morton
2016-03-22 22:45     ` Andrew Morton
2016-03-22 22:58     ` Hugh Dickins
2016-03-22 22:58       ` Hugh Dickins
2016-03-22 11:00 ` [PATCH 3/9] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Michal Hocko
2016-03-22 11:00   ` Michal Hocko
2016-03-22 11:00 ` [PATCH 4/9] mm, oom_reaper: report success/failure Michal Hocko
2016-03-22 11:00   ` Michal Hocko
2016-03-22 11:00 ` [PATCH 5/9] mm, oom_reaper: implement OOM victims queuing Michal Hocko
2016-03-22 11:00   ` Michal Hocko
2016-03-22 11:00 ` [PATCH 6/9] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task Michal Hocko
2016-03-22 11:00   ` Michal Hocko
2016-03-22 11:00 ` [PATCH 7/9] oom: make oom_reaper_list single linked Michal Hocko
2016-03-22 11:00   ` Michal Hocko
2016-03-22 11:00 ` [PATCH 8/9] oom: make oom_reaper freezable Michal Hocko
2016-03-22 11:00   ` Michal Hocko
2016-03-22 11:00 ` [PATCH 9/9] oom, oom_reaper: protect oom_reaper_list using simpler way Michal Hocko
2016-03-22 11:00   ` Michal Hocko
2016-03-22 22:08 ` [PATCH 0/9] oom reaper v6 David Rientjes
2016-03-22 22:08   ` David Rientjes
2016-03-23 11:11   ` Tetsuo Handa
2016-03-23 11:11     ` Tetsuo Handa
2016-03-23 12:07     ` Michal Hocko
2016-03-23 12:07       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.