All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] oom reaper v5
@ 2016-02-03 13:13 ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

Hi,
I am reposting the whole patchset on top of mmotm with the previous
version of the patchset reverted for an easier review. The series
applies cleanly on top of the current Linus tree as well.

The previous version was posted http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
I have tried to address most of the feedback. There was some push
for extending the current implementation further but I do not feel
comfortable to do that right now. I believe that we should start
as easy as possible and add extensions on top. That shouldn't be
hard with the current architecture.

Wrt. the previous version, I have added patches 4 and 5. Patch4 reports
success/failures to reap a task which is useful to see how the reaper
operates. Patch 5 is implementing a more robust API between the oom
killer and the oom reaper. We allow more tasks to be queued for the
reaper at the same time rather than the original signle task mode.

Patch 1 also dropped oom_reaper thread priority handling as per
David.  I ended up keeping vma filtering code inside __oom_reap_task.
I still believe this is a better fit because the rules are a single
fit for the reaper. They cannot be shared with a larger code base.

In the meantime I have prepared down_write_killable rw_semaphore
variant http://lkml.kernel.org/r/1454444369-2146-1-git-send-email-mhocko@kernel.org
and also have a tentative patch to convert some users of mmap_sem for
write to use the killable version. This needs more checking though but
I guess I will have something ready in 2 weeks or so (I will be on
vacation next week).

For the general description of the oom_reaper functionality, please
refer to Patch1.

I would be really greatful if we could postpone any
functional/semantical enhancements for later discussion and focus on the
correctness of these particular patches as there were no fundamental
objectios to the current approach.

Thanks!

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH 0/5] oom reaper v5
@ 2016-02-03 13:13 ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

Hi,
I am reposting the whole patchset on top of mmotm with the previous
version of the patchset reverted for an easier review. The series
applies cleanly on top of the current Linus tree as well.

The previous version was posted http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
I have tried to address most of the feedback. There was some push
for extending the current implementation further but I do not feel
comfortable to do that right now. I believe that we should start
as easy as possible and add extensions on top. That shouldn't be
hard with the current architecture.

Wrt. the previous version, I have added patches 4 and 5. Patch4 reports
success/failures to reap a task which is useful to see how the reaper
operates. Patch 5 is implementing a more robust API between the oom
killer and the oom reaper. We allow more tasks to be queued for the
reaper at the same time rather than the original signle task mode.

Patch 1 also dropped oom_reaper thread priority handling as per
David.  I ended up keeping vma filtering code inside __oom_reap_task.
I still believe this is a better fit because the rules are a single
fit for the reaper. They cannot be shared with a larger code base.

In the meantime I have prepared down_write_killable rw_semaphore
variant http://lkml.kernel.org/r/1454444369-2146-1-git-send-email-mhocko@kernel.org
and also have a tentative patch to convert some users of mmap_sem for
write to use the killable version. This needs more checking though but
I guess I will have something ready in 2 weeks or so (I will be on
vacation next week).

For the general description of the oom_reaper functionality, please
refer to Patch1.

I would be really greatful if we could postpone any
functional/semantical enhancements for later discussion and focus on the
correctness of these particular patches as there were no fundamental
objectios to the current approach.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-03 13:13 ` Michal Hocko
@ 2016-02-03 13:13   ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Chnages since v4
- drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
  might interfere with the rest of the system
Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
- do not tear down mm if it is shared with the global init because this
  could lead to SEGV and panic - Tetsuo
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 162 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05b9fbbceb01..8a67ea2a6323 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1111,6 +1111,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index ed90298c12db..cac6eb458727 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -42,6 +42,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index 42d5bec9bb91..c158dc53ca3d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1105,6 +1105,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1123,8 +1129,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1229,7 +1235,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1237,9 +1243,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2419,7 +2422,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e3ab892903ee..9a0e4e5f50b4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -406,6 +411,133 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -511,6 +643,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -601,17 +734,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (is_global_init(p))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 1/5] mm, oom: introduce oom reaper
@ 2016-02-03 13:13   ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Chnages since v4
- drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
  might interfere with the rest of the system
Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
- do not tear down mm if it is shared with the global init because this
  could lead to SEGV and panic - Tetsuo
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 162 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05b9fbbceb01..8a67ea2a6323 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1111,6 +1111,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index ed90298c12db..cac6eb458727 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -42,6 +42,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index 42d5bec9bb91..c158dc53ca3d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1105,6 +1105,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1123,8 +1129,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1229,7 +1235,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1237,9 +1243,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2419,7 +2422,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e3ab892903ee..9a0e4e5f50b4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -406,6 +411,133 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -511,6 +643,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -601,17 +734,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (is_global_init(p))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-03 13:13 ` Michal Hocko
@ 2016-02-03 13:13   ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__oom_reap_vmas current skips over all mlocked vmas because they need a
special treatment before they are unmapped. This is primarily done for
simplicity. There is no reason to skip over them and reduce the amount
of reclaimed memory. This is safe from the semantic point of view
because try_to_unmap_one during rmap walk would keep tell the reclaim
to cull the page back and mlock it again.

munlock_vma_pages_all is also safe to be called from the oom reaper
context because it doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9a0e4e5f50b4..840e03986497 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -443,13 +443,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 			continue;
 
 		/*
-		 * mlocked VMAs require explicit munlocking before unmap.
-		 * Let's keep it simple here and skip such VMAs.
-		 */
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
-
-		/*
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
 		 * are OOM already.
@@ -459,9 +452,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 		 * we do not want to block exit_mmap by keeping mm ref
 		 * count elevated without a good reason.
 		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_all(vma);
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 &details);
+		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 2/5] oom reaper: handle mlocked pages
@ 2016-02-03 13:13   ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__oom_reap_vmas current skips over all mlocked vmas because they need a
special treatment before they are unmapped. This is primarily done for
simplicity. There is no reason to skip over them and reduce the amount
of reclaimed memory. This is safe from the semantic point of view
because try_to_unmap_one during rmap walk would keep tell the reclaim
to cull the page back and mlock it again.

munlock_vma_pages_all is also safe to be called from the oom reaper
context because it doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9a0e4e5f50b4..840e03986497 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -443,13 +443,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 			continue;
 
 		/*
-		 * mlocked VMAs require explicit munlocking before unmap.
-		 * Let's keep it simple here and skip such VMAs.
-		 */
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
-
-		/*
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
 		 * are OOM already.
@@ -459,9 +452,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 		 * we do not want to block exit_mmap by keeping mm ref
 		 * count elevated without a good reason.
 		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_all(vma);
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 &details);
+		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-03 13:13 ` Michal Hocko
@ 2016-02-03 13:13   ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem
for write should be blocked killable which will help to reduce such a
lock contention. This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up
waiters too early.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  2 +-
 kernel/exit.c       |  2 +-
 mm/oom_kill.c       | 73 +++++++++++++++++++++++++++++++++++------------------
 3 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257321f0..45993b840ed6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(void);
+extern void exit_oom_victim(struct task_struct *tsk);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index 07110c6020a0..4b1c6e2658bd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -436,7 +436,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
+		exit_oom_victim(tsk);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 840e03986497..8e345126d73e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,20 +417,36 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct mm_struct *mm_to_reap;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
-static bool __oom_reap_vmas(struct mm_struct *mm)
+static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/* We might have raced with exit path */
-	if (!atomic_inc_not_zero(&mm->mm_users))
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	p = find_lock_task_mm(tsk);
+	if (!p)
+		return true;
+
+	mm = p->mm;
+	if (!atomic_inc_not_zero(&mm->mm_users)) {
+		task_unlock(p);
 		return true;
+	}
+
+	task_unlock(p);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -461,60 +477,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * reasonably reclaimable memory anymore. OOM killer can continue
+	 * by selecting other victim if unmapping hasn't led to any
+	 * improvements. This also means that selecting this task doesn't
+	 * make any sense.
+	 */
+	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
 }
 
-static void oom_reap_vmas(struct mm_struct *mm)
+static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
 	/* Drop a reference taken by wake_oom_reaper */
-	mmdrop(mm);
+	put_task_struct(tsk);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct mm_struct *mm;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (mm = READ_ONCE(mm_to_reap)));
-		oom_reap_vmas(mm);
-		WRITE_ONCE(mm_to_reap, NULL);
+				     (tsk = READ_ONCE(task_to_reap)));
+		oom_reap_task(tsk);
+		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct mm_struct *old_mm;
+	struct task_struct *old_tsk;
 
 	if (!oom_reaper_th)
 		return;
 
-	/*
-	 * Pin the given mm. Use mm_count instead of mm_users because
-	 * we do not want to delay the address space tear down.
-	 */
-	atomic_inc(&mm->mm_count);
+	get_task_struct(tsk);
 
 	/*
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
 	 */
-	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm)
+	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
+	if (!old_tsk)
 		wake_up(&oom_reaper_wait);
 	else
-		mmdrop(mm);
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
@@ -529,7 +551,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *mm)
 {
 }
 #endif
@@ -560,9 +582,10 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(void)
+void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -745,7 +768,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		wake_oom_reaper(mm);
+		wake_oom_reaper(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-03 13:13   ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem
for write should be blocked killable which will help to reduce such a
lock contention. This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up
waiters too early.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  2 +-
 kernel/exit.c       |  2 +-
 mm/oom_kill.c       | 73 +++++++++++++++++++++++++++++++++++------------------
 3 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257321f0..45993b840ed6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(void);
+extern void exit_oom_victim(struct task_struct *tsk);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index 07110c6020a0..4b1c6e2658bd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -436,7 +436,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
+		exit_oom_victim(tsk);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 840e03986497..8e345126d73e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,20 +417,36 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct mm_struct *mm_to_reap;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
-static bool __oom_reap_vmas(struct mm_struct *mm)
+static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/* We might have raced with exit path */
-	if (!atomic_inc_not_zero(&mm->mm_users))
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	p = find_lock_task_mm(tsk);
+	if (!p)
+		return true;
+
+	mm = p->mm;
+	if (!atomic_inc_not_zero(&mm->mm_users)) {
+		task_unlock(p);
 		return true;
+	}
+
+	task_unlock(p);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -461,60 +477,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * reasonably reclaimable memory anymore. OOM killer can continue
+	 * by selecting other victim if unmapping hasn't led to any
+	 * improvements. This also means that selecting this task doesn't
+	 * make any sense.
+	 */
+	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
 }
 
-static void oom_reap_vmas(struct mm_struct *mm)
+static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
 	/* Drop a reference taken by wake_oom_reaper */
-	mmdrop(mm);
+	put_task_struct(tsk);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct mm_struct *mm;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (mm = READ_ONCE(mm_to_reap)));
-		oom_reap_vmas(mm);
-		WRITE_ONCE(mm_to_reap, NULL);
+				     (tsk = READ_ONCE(task_to_reap)));
+		oom_reap_task(tsk);
+		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct mm_struct *old_mm;
+	struct task_struct *old_tsk;
 
 	if (!oom_reaper_th)
 		return;
 
-	/*
-	 * Pin the given mm. Use mm_count instead of mm_users because
-	 * we do not want to delay the address space tear down.
-	 */
-	atomic_inc(&mm->mm_count);
+	get_task_struct(tsk);
 
 	/*
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
 	 */
-	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm)
+	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
+	if (!old_tsk)
 		wake_up(&oom_reaper_wait);
 	else
-		mmdrop(mm);
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
@@ -529,7 +551,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *mm)
 {
 }
 #endif
@@ -560,9 +582,10 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(void)
+void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -745,7 +768,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		wake_oom_reaper(mm);
+		wake_oom_reaper(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-03 13:13 ` Michal Hocko
@ 2016-02-03 13:13   ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8e345126d73e..b87acdca2a41 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -420,6 +420,7 @@ static struct task_struct *oom_reaper_th;
 static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -476,6 +477,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
+	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
+			task_pid_nr(tsk), tsk->comm,
+			K(get_mm_counter(mm, MM_ANONPAGES)),
+			K(get_mm_counter(mm, MM_FILEPAGES)),
+			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
@@ -492,14 +498,21 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	return ret;
 }
 
+#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
+	if (attempts > MAX_OOM_REAP_RETRIES) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -646,7 +659,6 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 4/5] mm, oom_reaper: report success/failure
@ 2016-02-03 13:13   ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8e345126d73e..b87acdca2a41 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -420,6 +420,7 @@ static struct task_struct *oom_reaper_th;
 static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -476,6 +477,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
+	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
+			task_pid_nr(tsk), tsk->comm,
+			K(get_mm_counter(mm, MM_ANONPAGES)),
+			K(get_mm_counter(mm, MM_FILEPAGES)),
+			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
@@ -492,14 +498,21 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	return ret;
 }
 
+#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
+	if (attempts > MAX_OOM_REAP_RETRIES) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -646,7 +659,6 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-03 13:13 ` Michal Hocko
@ 2016-02-03 13:14   ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wake_oom_reaper has allowed only 1 oom victim to be queued. The main
reason for that was the simplicity as other solutions would require
some way of queuing. The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach
to help with oom handling when the OOM victim cannot terminate in a
reasonable time. The race could lead to missing an oom victim which can
get stuck

out_of_memory
  wake_oom_reaper
    cmpxchg // OK
    			oom_reaper
			  oom_reap_task
			    __oom_reap_task
oom_victim terminates
			      atomic_inc_not_zero // fail
out_of_memory
  wake_oom_reaper
    cmpxchg // fails
			  task_to_reap = NULL

This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible. E.g. the original victim might
have not released a lot of memory for some reason.

The situation would improve considerably if wake_oom_reaper used a more
robust queuing. This is what this patch implements. This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization. wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  3 +++
 mm/oom_kill.c         | 36 +++++++++++++++++++-----------------
 2 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9cdd032b988..c25996c336de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1814,6 +1814,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_MMU
+	struct list_head oom_reaper_list;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b87acdca2a41..87d644c97ac9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,8 +417,10 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+static LIST_HEAD(oom_reaper_list);
+static DEFINE_SPINLOCK(oom_reaper_lock);
+
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
@@ -520,12 +522,20 @@ static void oom_reap_task(struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk;
+		struct task_struct *tsk = NULL;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (tsk = READ_ONCE(task_to_reap)));
-		oom_reap_task(tsk);
-		WRITE_ONCE(task_to_reap, NULL);
+				     (!list_empty(&oom_reaper_list)));
+		spin_lock(&oom_reaper_lock);
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_first_entry(&oom_reaper_list,
+					struct task_struct, oom_reaper_list);
+			list_del(&tsk->oom_reaper_list);
+		}
+		spin_unlock(&oom_reaper_lock);
+
+		if (tsk)
+			oom_reap_task(tsk);
 	}
 
 	return 0;
@@ -533,23 +543,15 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct task_struct *old_tsk;
-
 	if (!oom_reaper_th)
 		return;
 
 	get_task_struct(tsk);
 
-	/*
-	 * Make sure that only a single mm is ever queued for the reaper
-	 * because multiple are not necessary and the operation might be
-	 * disruptive so better reduce it to the bare minimum.
-	 */
-	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
-	if (!old_tsk)
-		wake_up(&oom_reaper_wait);
-	else
-		put_task_struct(tsk);
+	spin_lock(&oom_reaper_lock);
+	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
+	wake_up(&oom_reaper_wait);
 }
 
 static int __init oom_init(void)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
@ 2016-02-03 13:14   ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-03 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wake_oom_reaper has allowed only 1 oom victim to be queued. The main
reason for that was the simplicity as other solutions would require
some way of queuing. The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach
to help with oom handling when the OOM victim cannot terminate in a
reasonable time. The race could lead to missing an oom victim which can
get stuck

out_of_memory
  wake_oom_reaper
    cmpxchg // OK
    			oom_reaper
			  oom_reap_task
			    __oom_reap_task
oom_victim terminates
			      atomic_inc_not_zero // fail
out_of_memory
  wake_oom_reaper
    cmpxchg // fails
			  task_to_reap = NULL

This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible. E.g. the original victim might
have not released a lot of memory for some reason.

The situation would improve considerably if wake_oom_reaper used a more
robust queuing. This is what this patch implements. This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization. wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  3 +++
 mm/oom_kill.c         | 36 +++++++++++++++++++-----------------
 2 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9cdd032b988..c25996c336de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1814,6 +1814,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_MMU
+	struct list_head oom_reaper_list;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b87acdca2a41..87d644c97ac9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,8 +417,10 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+static LIST_HEAD(oom_reaper_list);
+static DEFINE_SPINLOCK(oom_reaper_lock);
+
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
@@ -520,12 +522,20 @@ static void oom_reap_task(struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk;
+		struct task_struct *tsk = NULL;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (tsk = READ_ONCE(task_to_reap)));
-		oom_reap_task(tsk);
-		WRITE_ONCE(task_to_reap, NULL);
+				     (!list_empty(&oom_reaper_list)));
+		spin_lock(&oom_reaper_lock);
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_first_entry(&oom_reaper_list,
+					struct task_struct, oom_reaper_list);
+			list_del(&tsk->oom_reaper_list);
+		}
+		spin_unlock(&oom_reaper_lock);
+
+		if (tsk)
+			oom_reap_task(tsk);
 	}
 
 	return 0;
@@ -533,23 +543,15 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct task_struct *old_tsk;
-
 	if (!oom_reaper_th)
 		return;
 
 	get_task_struct(tsk);
 
-	/*
-	 * Make sure that only a single mm is ever queued for the reaper
-	 * because multiple are not necessary and the operation might be
-	 * disruptive so better reduce it to the bare minimum.
-	 */
-	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
-	if (!old_tsk)
-		wake_up(&oom_reaper_wait);
-	else
-		put_task_struct(tsk);
+	spin_lock(&oom_reaper_lock);
+	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
+	wake_up(&oom_reaper_wait);
 }
 
 static int __init oom_init(void)
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-03 13:13   ` Michal Hocko
@ 2016-02-03 23:10     ` David Rientjes
  -1 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-03 23:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8e345126d73e..b87acdca2a41 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -420,6 +420,7 @@ static struct task_struct *oom_reaper_th;
>  static struct task_struct *task_to_reap;
>  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  
> +#define K(x) ((x) << (PAGE_SHIFT-10))
>  static bool __oom_reap_task(struct task_struct *tsk)
>  {
>  	struct mmu_gather tlb;
> @@ -476,6 +477,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> +	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
> +			task_pid_nr(tsk), tsk->comm,
> +			K(get_mm_counter(mm, MM_ANONPAGES)),
> +			K(get_mm_counter(mm, MM_FILEPAGES)),
> +			K(get_mm_counter(mm, MM_SHMEMPAGES)));
>  	up_read(&mm->mmap_sem);
>  
>  	/*

This is a bit misleading, it would appear that the rss values are what was 
reaped when in fact they represent just the values of the mm being reaped.  
We have already printed these values as an artifact in the kernel log.

I think it would be helpful to show anon-rss after reaping, however, so we 
can compare to the previous anon-rss that was reported.  And, I agree that 
leaving behind a message in the kernel log that reaping has been 
successful is worthwhile.  So this line should just show what anon-rss is 
after reaping and make it clear that this is not the memory reaped.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
@ 2016-02-03 23:10     ` David Rientjes
  0 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-03 23:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8e345126d73e..b87acdca2a41 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -420,6 +420,7 @@ static struct task_struct *oom_reaper_th;
>  static struct task_struct *task_to_reap;
>  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  
> +#define K(x) ((x) << (PAGE_SHIFT-10))
>  static bool __oom_reap_task(struct task_struct *tsk)
>  {
>  	struct mmu_gather tlb;
> @@ -476,6 +477,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> +	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
> +			task_pid_nr(tsk), tsk->comm,
> +			K(get_mm_counter(mm, MM_ANONPAGES)),
> +			K(get_mm_counter(mm, MM_FILEPAGES)),
> +			K(get_mm_counter(mm, MM_SHMEMPAGES)));
>  	up_read(&mm->mmap_sem);
>  
>  	/*

This is a bit misleading, it would appear that the rss values are what was 
reaped when in fact they represent just the values of the mm being reaped.  
We have already printed these values as an artifact in the kernel log.

I think it would be helpful to show anon-rss after reaping, however, so we 
can compare to the previous anon-rss that was reported.  And, I agree that 
leaving behind a message in the kernel log that reaping has been 
successful is worthwhile.  So this line should just show what anon-rss is 
after reaping and make it clear that this is not the memory reaped.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-03 13:13   ` Michal Hocko
@ 2016-02-03 23:48     ` David Rientjes
  -1 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-03 23:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
> 
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
> 
> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
> 
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
> 
> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
> 
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
> 
> Chnages since v4
> - drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
>   might interfere with the rest of the system
> Changes since v3
> - many style/compile fixups by Andrew
> - unmap_mapping_range_tree needs full initialization of zap_details
>   to prevent from missing unmaps and follow up BUG_ON during truncate
>   resp. misaccounting - Kirill/Andrew
> - exclude mlocked pages because they need an explicit munlock by Kirill
> - use subsys_initcall instead of module_init - Paul Gortmaker
> - do not tear down mm if it is shared with the global init because this
>   could lead to SEGV and panic - Tetsuo
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> 
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Suggested-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

I think all the patches could really have been squashed together because 
subsequent patches just overwrite already added code.  I was going to 
suggest not doing atomic_inc(&mm->mm_count) in wake_oom_reaper() and 
change oom_kill_process() to do

	if (can_oom_reap)
		wake_oom_reaper(mm);
	else
		mmdrop(mm);

but I see that we don't even touch mm->mm_count after the third patch.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
@ 2016-02-03 23:48     ` David Rientjes
  0 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-03 23:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
> 
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
> 
> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
> 
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
> 
> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
> 
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
> 
> Chnages since v4
> - drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
>   might interfere with the rest of the system
> Changes since v3
> - many style/compile fixups by Andrew
> - unmap_mapping_range_tree needs full initialization of zap_details
>   to prevent from missing unmaps and follow up BUG_ON during truncate
>   resp. misaccounting - Kirill/Andrew
> - exclude mlocked pages because they need an explicit munlock by Kirill
> - use subsys_initcall instead of module_init - Paul Gortmaker
> - do not tear down mm if it is shared with the global init because this
>   could lead to SEGV and panic - Tetsuo
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> 
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Suggested-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

I think all the patches could really have been squashed together because 
subsequent patches just overwrite already added code.  I was going to 
suggest not doing atomic_inc(&mm->mm_count) in wake_oom_reaper() and 
change oom_kill_process() to do

	if (can_oom_reap)
		wake_oom_reaper(mm);
	else
		mmdrop(mm);

but I see that we don't even touch mm->mm_count after the third patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-03 13:13   ` Michal Hocko
@ 2016-02-03 23:57     ` David Rientjes
  -1 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-03 23:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> __oom_reap_vmas current skips over all mlocked vmas because they need a
> special treatment before they are unmapped. This is primarily done for
> simplicity. There is no reason to skip over them and reduce the amount
> of reclaimed memory. This is safe from the semantic point of view
> because try_to_unmap_one during rmap walk would keep tell the reclaim
> to cull the page back and mlock it again.
> 
> munlock_vma_pages_all is also safe to be called from the oom reaper
> context because it doesn't sit on any locks but mmap_sem (for read).
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
@ 2016-02-03 23:57     ` David Rientjes
  0 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-03 23:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> __oom_reap_vmas current skips over all mlocked vmas because they need a
> special treatment before they are unmapped. This is primarily done for
> simplicity. There is no reason to skip over them and reduce the amount
> of reclaimed memory. This is safe from the semantic point of view
> because try_to_unmap_one during rmap walk would keep tell the reclaim
> to cull the page back and mlock it again.
> 
> munlock_vma_pages_all is also safe to be called from the oom reaper
> context because it doesn't sit on any locks but mmap_sem (for read).
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-03 23:48     ` David Rientjes
@ 2016-02-04  6:41       ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04  6:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 03-02-16 15:48:18, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> > The OOM killer currently allows to kill only a single task in a good
> > hope that the task will terminate in a reasonable time and frees up its
> > memory.  Such a task (oom victim) will get an access to memory reserves
> > via mark_oom_victim to allow a forward progress should there be a need
> > for additional memory during exit path.
> > 
> > It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> > construct workloads which break the core assumption mentioned above and
> > the OOM victim might take unbounded amount of time to exit because it
> > might be blocked in the uninterruptible state waiting for an event
> > (e.g. lock) which is blocked by another task looping in the page
> > allocator.
> > 
> > This patch reduces the probability of such a lockup by introducing a
> > specialized kernel thread (oom_reaper) which tries to reclaim additional
> > memory by preemptively reaping the anonymous or swapped out memory
> > owned by the oom victim under an assumption that such a memory won't
> > be needed when its owner is killed and kicked from the userspace anyway.
> > There is one notable exception to this, though, if the OOM victim was
> > in the process of coredumping the result would be incomplete. This is
> > considered a reasonable constrain because the overall system health is
> > more important than debugability of a particular application.
> > 
> > A kernel thread has been chosen because we need a reliable way of
> > invocation so workqueue context is not appropriate because all the
> > workers might be busy (e.g. allocating memory). Kswapd which sounds
> > like another good fit is not appropriate as well because it might get
> > blocked on locks during reclaim as well.
> > 
> > oom_reaper has to take mmap_sem on the target task for reading so the
> > solution is not 100% because the semaphore might be held or blocked for
> > write but the probability is reduced considerably wrt. basically any
> > lock blocking forward progress as described above. In order to prevent
> > from blocking on the lock without any forward progress we are using only
> > a trylock and retry 10 times with a short sleep in between.
> > Users of mmap_sem which need it for write should be carefully reviewed
> > to use _killable waiting as much as possible and reduce allocations
> > requests done with the lock held to absolute minimum to reduce the risk
> > even further.
> > 
> > The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> > updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> > and oom_reaper clear this atomically once it is done with the work. This
> > means that only a single mm_struct can be reaped at the time. As the
> > operation is potentially disruptive we are trying to limit it to the
> > ncessary minimum and the reaper blocks any updates while it operates on
> > an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> > race is detected by atomic_inc_not_zero(mm_users).
> > 
> > Chnages since v4
> > - drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
> >   might interfere with the rest of the system
> > Changes since v3
> > - many style/compile fixups by Andrew
> > - unmap_mapping_range_tree needs full initialization of zap_details
> >   to prevent from missing unmaps and follow up BUG_ON during truncate
> >   resp. misaccounting - Kirill/Andrew
> > - exclude mlocked pages because they need an explicit munlock by Kirill
> > - use subsys_initcall instead of module_init - Paul Gortmaker
> > - do not tear down mm if it is shared with the global init because this
> >   could lead to SEGV and panic - Tetsuo
> > Changes since v2
> > - fix mm_count refernce leak reported by Tetsuo
> > - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> > - use wait_event_freezable rather than open coded wait loop - suggested
> >   by Tetsuo
> > Changes since v1
> > - fix the screwed up detail->check_swap_entries - Johannes
> > - do not use kthread_should_stop because that would need a cleanup
> >   and we do not have anybody to stop us - Tetsuo
> > - move wake_oom_reaper to oom_kill_process because we have to wait
> >   for all tasks sharing the same mm to get killed - Tetsuo
> > - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> > 
> > Suggested-by: Oleg Nesterov <oleg@redhat.com>
> > Suggested-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: David Rientjes <rientjes@google.com>

Thanks!
 
> I think all the patches could really have been squashed together because 
> subsequent patches just overwrite already added code. 

The primary reason is a better bisectability and incremental nature of
changes.

> I was going to 
> suggest not doing atomic_inc(&mm->mm_count) in wake_oom_reaper() and 
> change oom_kill_process() to do

I found it easier to follow the reference counting that way (pin the mm
at the place when I hand over it to the async thread).

> 
> 	if (can_oom_reap)
> 		wake_oom_reaper(mm);
> 	else
> 		mmdrop(mm);
> 
> but I see that we don't even touch mm->mm_count after the third patch.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
@ 2016-02-04  6:41       ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04  6:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 03-02-16 15:48:18, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> > The OOM killer currently allows to kill only a single task in a good
> > hope that the task will terminate in a reasonable time and frees up its
> > memory.  Such a task (oom victim) will get an access to memory reserves
> > via mark_oom_victim to allow a forward progress should there be a need
> > for additional memory during exit path.
> > 
> > It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> > construct workloads which break the core assumption mentioned above and
> > the OOM victim might take unbounded amount of time to exit because it
> > might be blocked in the uninterruptible state waiting for an event
> > (e.g. lock) which is blocked by another task looping in the page
> > allocator.
> > 
> > This patch reduces the probability of such a lockup by introducing a
> > specialized kernel thread (oom_reaper) which tries to reclaim additional
> > memory by preemptively reaping the anonymous or swapped out memory
> > owned by the oom victim under an assumption that such a memory won't
> > be needed when its owner is killed and kicked from the userspace anyway.
> > There is one notable exception to this, though, if the OOM victim was
> > in the process of coredumping the result would be incomplete. This is
> > considered a reasonable constrain because the overall system health is
> > more important than debugability of a particular application.
> > 
> > A kernel thread has been chosen because we need a reliable way of
> > invocation so workqueue context is not appropriate because all the
> > workers might be busy (e.g. allocating memory). Kswapd which sounds
> > like another good fit is not appropriate as well because it might get
> > blocked on locks during reclaim as well.
> > 
> > oom_reaper has to take mmap_sem on the target task for reading so the
> > solution is not 100% because the semaphore might be held or blocked for
> > write but the probability is reduced considerably wrt. basically any
> > lock blocking forward progress as described above. In order to prevent
> > from blocking on the lock without any forward progress we are using only
> > a trylock and retry 10 times with a short sleep in between.
> > Users of mmap_sem which need it for write should be carefully reviewed
> > to use _killable waiting as much as possible and reduce allocations
> > requests done with the lock held to absolute minimum to reduce the risk
> > even further.
> > 
> > The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> > updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> > and oom_reaper clear this atomically once it is done with the work. This
> > means that only a single mm_struct can be reaped at the time. As the
> > operation is potentially disruptive we are trying to limit it to the
> > ncessary minimum and the reaper blocks any updates while it operates on
> > an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> > race is detected by atomic_inc_not_zero(mm_users).
> > 
> > Chnages since v4
> > - drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
> >   might interfere with the rest of the system
> > Changes since v3
> > - many style/compile fixups by Andrew
> > - unmap_mapping_range_tree needs full initialization of zap_details
> >   to prevent from missing unmaps and follow up BUG_ON during truncate
> >   resp. misaccounting - Kirill/Andrew
> > - exclude mlocked pages because they need an explicit munlock by Kirill
> > - use subsys_initcall instead of module_init - Paul Gortmaker
> > - do not tear down mm if it is shared with the global init because this
> >   could lead to SEGV and panic - Tetsuo
> > Changes since v2
> > - fix mm_count refernce leak reported by Tetsuo
> > - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> > - use wait_event_freezable rather than open coded wait loop - suggested
> >   by Tetsuo
> > Changes since v1
> > - fix the screwed up detail->check_swap_entries - Johannes
> > - do not use kthread_should_stop because that would need a cleanup
> >   and we do not have anybody to stop us - Tetsuo
> > - move wake_oom_reaper to oom_kill_process because we have to wait
> >   for all tasks sharing the same mm to get killed - Tetsuo
> > - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> > 
> > Suggested-by: Oleg Nesterov <oleg@redhat.com>
> > Suggested-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: David Rientjes <rientjes@google.com>

Thanks!
 
> I think all the patches could really have been squashed together because 
> subsequent patches just overwrite already added code. 

The primary reason is a better bisectability and incremental nature of
changes.

> I was going to 
> suggest not doing atomic_inc(&mm->mm_count) in wake_oom_reaper() and 
> change oom_kill_process() to do

I found it easier to follow the reference counting that way (pin the mm
at the place when I hand over it to the async thread).

> 
> 	if (can_oom_reap)
> 		wake_oom_reaper(mm);
> 	else
> 		mmdrop(mm);
> 
> but I see that we don't even touch mm->mm_count after the third patch.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-03 23:10     ` David Rientjes
@ 2016-02-04  6:46       ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04  6:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 03-02-16 15:10:57, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 8e345126d73e..b87acdca2a41 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -420,6 +420,7 @@ static struct task_struct *oom_reaper_th;
> >  static struct task_struct *task_to_reap;
> >  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> >  
> > +#define K(x) ((x) << (PAGE_SHIFT-10))
> >  static bool __oom_reap_task(struct task_struct *tsk)
> >  {
> >  	struct mmu_gather tlb;
> > @@ -476,6 +477,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
> >  		}
> >  	}
> >  	tlb_finish_mmu(&tlb, 0, -1);
> > +	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
> > +			task_pid_nr(tsk), tsk->comm,
> > +			K(get_mm_counter(mm, MM_ANONPAGES)),
> > +			K(get_mm_counter(mm, MM_FILEPAGES)),
> > +			K(get_mm_counter(mm, MM_SHMEMPAGES)));
> >  	up_read(&mm->mmap_sem);
> >  
> >  	/*
> 
> This is a bit misleading, it would appear that the rss values are what was 
> reaped when in fact they represent just the values of the mm being reaped.  
> We have already printed these values as an artifact in the kernel log.

Yes and the idea was to provide the after state to compare before and
after. That's why I have kept the similar format. Just dropped the
virtual memory size because that doesn't make any sense in this context
now.

> I think it would be helpful to show anon-rss after reaping, however, so we 
> can compare to the previous anon-rss that was reported.  And, I agree that 
> leaving behind a message in the kernel log that reaping has been 
> successful is worthwhile.  So this line should just show what anon-rss is 
> after reaping and make it clear that this is not the memory reaped.

Does
"oom_reaper: reaped process %d (%s) current memory anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB "

sound any better?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
@ 2016-02-04  6:46       ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04  6:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 03-02-16 15:10:57, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 8e345126d73e..b87acdca2a41 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -420,6 +420,7 @@ static struct task_struct *oom_reaper_th;
> >  static struct task_struct *task_to_reap;
> >  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> >  
> > +#define K(x) ((x) << (PAGE_SHIFT-10))
> >  static bool __oom_reap_task(struct task_struct *tsk)
> >  {
> >  	struct mmu_gather tlb;
> > @@ -476,6 +477,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
> >  		}
> >  	}
> >  	tlb_finish_mmu(&tlb, 0, -1);
> > +	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
> > +			task_pid_nr(tsk), tsk->comm,
> > +			K(get_mm_counter(mm, MM_ANONPAGES)),
> > +			K(get_mm_counter(mm, MM_FILEPAGES)),
> > +			K(get_mm_counter(mm, MM_SHMEMPAGES)));
> >  	up_read(&mm->mmap_sem);
> >  
> >  	/*
> 
> This is a bit misleading, it would appear that the rss values are what was 
> reaped when in fact they represent just the values of the mm being reaped.  
> We have already printed these values as an artifact in the kernel log.

Yes and the idea was to provide the after state to compare before and
after. That's why I have kept the similar format. Just dropped the
virtual memory size because that doesn't make any sense in this context
now.

> I think it would be helpful to show anon-rss after reaping, however, so we 
> can compare to the previous anon-rss that was reported.  And, I agree that 
> leaving behind a message in the kernel log that reaping has been 
> successful is worthwhile.  So this line should just show what anon-rss is 
> after reaping and make it clear that this is not the memory reaped.

Does
"oom_reaper: reaped process %d (%s) current memory anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB "

sound any better?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-03 13:14   ` Michal Hocko
@ 2016-02-04 10:49     ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-04 10:49 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> wake_oom_reaper has allowed only 1 oom victim to be queued. The main
> reason for that was the simplicity as other solutions would require
> some way of queuing. The current approach is racy and that was deemed
> sufficient as the oom_reaper is considered a best effort approach
> to help with oom handling when the OOM victim cannot terminate in a
> reasonable time. The race could lead to missing an oom victim which can
> get stuck
> 
> out_of_memory
>   wake_oom_reaper
>     cmpxchg // OK
>     			oom_reaper
> 			  oom_reap_task
> 			    __oom_reap_task
> oom_victim terminates
> 			      atomic_inc_not_zero // fail
> out_of_memory
>   wake_oom_reaper
>     cmpxchg // fails
> 			  task_to_reap = NULL
> 
> This race requires 2 OOM invocations in a short time period which is not
> very likely but certainly not impossible. E.g. the original victim might
> have not released a lot of memory for some reason.
> 
> The situation would improve considerably if wake_oom_reaper used a more
> robust queuing. This is what this patch implements. This means adding
> oom_reaper_list list_head into task_struct (eat a hole before embeded
> thread_struct for that purpose) and a oom_reaper_lock spinlock for
> queuing synchronization. wake_oom_reaper will then add the task on the
> queue and oom_reaper will dequeue it.
> 

I think we want to rewrite this patch's description from a different point
of view.

As of "[PATCH 1/5] mm, oom: introduce oom reaper", we assumed that we try to
manage OOM livelock caused by system-wide OOM events using the OOM reaper.
Therefore, the OOM reaper had high scheduling priority and we considered side
effect of the OOM reaper as a reasonable constraint.

But as the discussion went by, we started to try to manage OOM livelock
caused by non system-wide OOM events (e.g. memcg OOM) using the OOM reaper.
Therefore, the OOM reaper now has normal scheduling priority. For non
system-wide OOM events, side effect of the OOM reaper might not be a
reasonable constraint. Some administrator might expect that the OOM reaper
does not break coredumping unless the system is under system-wide OOM events.

The race described in this patch's description sounds as if 2 OOM invocations
are by system-wide OOM events. If we consider only system-wide OOM events,
there is no need to keep task_to_reap != NULL after the OOM reaper found
a task to reap (shown below) because existing victim will prevent the OOM
killer from calling wake_oom_reaper().

----------------------------------------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 012dd6f..c919ddb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1835,9 +1835,6 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct list_head oom_reaper_list;
-#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..fa6a302 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -422,10 +422,8 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static LIST_HEAD(oom_reaper_list);
-static DEFINE_SPINLOCK(oom_reaper_lock);
-
 
 static bool __oom_reap_task(struct task_struct *tsk)
 {
@@ -526,20 +524,11 @@ static void oom_reap_task(struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (!list_empty(&oom_reaper_list)));
-		spin_lock(&oom_reaper_lock);
-		if (!list_empty(&oom_reaper_list)) {
-			tsk = list_first_entry(&oom_reaper_list,
-					struct task_struct, oom_reaper_list);
-			list_del(&tsk->oom_reaper_list);
-		}
-		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+				     (tsk = xchg(&task_to_reap, NULL)));
+		oom_reap_task(tsk);
 	}
 
 	return 0;
@@ -551,11 +540,11 @@ static void wake_oom_reaper(struct task_struct *tsk)
 		return;
 
 	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
-	spin_unlock(&oom_reaper_lock);
-	wake_up(&oom_reaper_wait);
+	tsk = xchg(&task_to_reap, tsk);
+	if (!tsk)
+		wake_up(&oom_reaper_wait);
+	else
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
----------------------------------------

But if we consider non system-wide OOM events, it is not very unlikely to hit
this race. This queue is useful for situations where memcg1 and memcg2 hit
memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.

I expect parallel reaping (shown below) because there is no need to serialize
victim tasks (e.g. wait for reaping victim1 in memcg1 which can take up to
1 second to complete before start reaping victim2 in memcg2) if we implement
this queue.

----------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..c2d6472 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -427,7 +427,7 @@ static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
@@ -504,42 +504,42 @@ out:
 	return ret;
 }
 
-#define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
-{
-	int attempts = 0;
-
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
-		schedule_timeout_idle(HZ/10);
-
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
-		debug_show_all_locks();
-	}
-
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
-}
-
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
-
+		struct task_struct *tsk;
+		struct task_struct *t;
+		LIST_HEAD(list);
+		int i;
+
 		wait_event_freezable(oom_reaper_wait,
 				     (!list_empty(&oom_reaper_list)));
 		spin_lock(&oom_reaper_lock);
-		if (!list_empty(&oom_reaper_list)) {
-			tsk = list_first_entry(&oom_reaper_list,
-					struct task_struct, oom_reaper_list);
-			list_del(&tsk->oom_reaper_list);
-		}
+		list_splice(&oom_reaper_list, &list);
+		INIT_LIST_HEAD(&oom_reaper_list);
 		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+		/* Retry the down_read_trylock(mmap_sem) a few times */
+		for (i = 0; i < 10; i++) {
+			list_for_each_entry_safe(tsk, t, &list,
+						 oom_reaper_list) {
+				if (!oom_reap_task(tsk))
+					continue;
+				list_del(&tsk->oom_reaper_list);
+				/* Drop a reference taken by wake_oom_reaper */
+				put_task_struct(tsk);
+			}
+			if (list_empty(&list))
+				break;
+			schedule_timeout_idle(HZ/10);
+		}
+		if (list_empty(&list))
+			continue;
+		list_for_each_entry(tsk, &list, oom_reaper_list) {
+			pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+			put_task_struct(tsk);
+		}
+		debug_show_all_locks();
 	}
 
 	return 0;
----------------------------------------

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
@ 2016-02-04 10:49     ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-04 10:49 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> wake_oom_reaper has allowed only 1 oom victim to be queued. The main
> reason for that was the simplicity as other solutions would require
> some way of queuing. The current approach is racy and that was deemed
> sufficient as the oom_reaper is considered a best effort approach
> to help with oom handling when the OOM victim cannot terminate in a
> reasonable time. The race could lead to missing an oom victim which can
> get stuck
> 
> out_of_memory
>   wake_oom_reaper
>     cmpxchg // OK
>     			oom_reaper
> 			  oom_reap_task
> 			    __oom_reap_task
> oom_victim terminates
> 			      atomic_inc_not_zero // fail
> out_of_memory
>   wake_oom_reaper
>     cmpxchg // fails
> 			  task_to_reap = NULL
> 
> This race requires 2 OOM invocations in a short time period which is not
> very likely but certainly not impossible. E.g. the original victim might
> have not released a lot of memory for some reason.
> 
> The situation would improve considerably if wake_oom_reaper used a more
> robust queuing. This is what this patch implements. This means adding
> oom_reaper_list list_head into task_struct (eat a hole before embeded
> thread_struct for that purpose) and a oom_reaper_lock spinlock for
> queuing synchronization. wake_oom_reaper will then add the task on the
> queue and oom_reaper will dequeue it.
> 

I think we want to rewrite this patch's description from a different point
of view.

As of "[PATCH 1/5] mm, oom: introduce oom reaper", we assumed that we try to
manage OOM livelock caused by system-wide OOM events using the OOM reaper.
Therefore, the OOM reaper had high scheduling priority and we considered side
effect of the OOM reaper as a reasonable constraint.

But as the discussion went by, we started to try to manage OOM livelock
caused by non system-wide OOM events (e.g. memcg OOM) using the OOM reaper.
Therefore, the OOM reaper now has normal scheduling priority. For non
system-wide OOM events, side effect of the OOM reaper might not be a
reasonable constraint. Some administrator might expect that the OOM reaper
does not break coredumping unless the system is under system-wide OOM events.

The race described in this patch's description sounds as if 2 OOM invocations
are by system-wide OOM events. If we consider only system-wide OOM events,
there is no need to keep task_to_reap != NULL after the OOM reaper found
a task to reap (shown below) because existing victim will prevent the OOM
killer from calling wake_oom_reaper().

----------------------------------------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 012dd6f..c919ddb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1835,9 +1835,6 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct list_head oom_reaper_list;
-#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..fa6a302 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -422,10 +422,8 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static LIST_HEAD(oom_reaper_list);
-static DEFINE_SPINLOCK(oom_reaper_lock);
-
 
 static bool __oom_reap_task(struct task_struct *tsk)
 {
@@ -526,20 +524,11 @@ static void oom_reap_task(struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (!list_empty(&oom_reaper_list)));
-		spin_lock(&oom_reaper_lock);
-		if (!list_empty(&oom_reaper_list)) {
-			tsk = list_first_entry(&oom_reaper_list,
-					struct task_struct, oom_reaper_list);
-			list_del(&tsk->oom_reaper_list);
-		}
-		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+				     (tsk = xchg(&task_to_reap, NULL)));
+		oom_reap_task(tsk);
 	}
 
 	return 0;
@@ -551,11 +540,11 @@ static void wake_oom_reaper(struct task_struct *tsk)
 		return;
 
 	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
-	spin_unlock(&oom_reaper_lock);
-	wake_up(&oom_reaper_wait);
+	tsk = xchg(&task_to_reap, tsk);
+	if (!tsk)
+		wake_up(&oom_reaper_wait);
+	else
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
----------------------------------------

But if we consider non system-wide OOM events, it is not very unlikely to hit
this race. This queue is useful for situations where memcg1 and memcg2 hit
memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.

I expect parallel reaping (shown below) because there is no need to serialize
victim tasks (e.g. wait for reaping victim1 in memcg1 which can take up to
1 second to complete before start reaping victim2 in memcg2) if we implement
this queue.

----------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..c2d6472 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -427,7 +427,7 @@ static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
@@ -504,42 +504,42 @@ out:
 	return ret;
 }
 
-#define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
-{
-	int attempts = 0;
-
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
-		schedule_timeout_idle(HZ/10);
-
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
-		debug_show_all_locks();
-	}
-
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
-}
-
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
-
+		struct task_struct *tsk;
+		struct task_struct *t;
+		LIST_HEAD(list);
+		int i;
+
 		wait_event_freezable(oom_reaper_wait,
 				     (!list_empty(&oom_reaper_list)));
 		spin_lock(&oom_reaper_lock);
-		if (!list_empty(&oom_reaper_list)) {
-			tsk = list_first_entry(&oom_reaper_list,
-					struct task_struct, oom_reaper_list);
-			list_del(&tsk->oom_reaper_list);
-		}
+		list_splice(&oom_reaper_list, &list);
+		INIT_LIST_HEAD(&oom_reaper_list);
 		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+		/* Retry the down_read_trylock(mmap_sem) a few times */
+		for (i = 0; i < 10; i++) {
+			list_for_each_entry_safe(tsk, t, &list,
+						 oom_reaper_list) {
+				if (!oom_reap_task(tsk))
+					continue;
+				list_del(&tsk->oom_reaper_list);
+				/* Drop a reference taken by wake_oom_reaper */
+				put_task_struct(tsk);
+			}
+			if (list_empty(&list))
+				break;
+			schedule_timeout_idle(HZ/10);
+		}
+		if (list_empty(&list))
+			continue;
+		list_for_each_entry(tsk, &list, oom_reaper_list) {
+			pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+			put_task_struct(tsk);
+		}
+		debug_show_all_locks();
 	}
 
 	return 0;
----------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-03 13:13   ` Michal Hocko
@ 2016-02-04 14:22     ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-04 14:22 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> When oom_reaper manages to unmap all the eligible vmas there shouldn't
> be much of the freable memory held by the oom victim left anymore so it
> makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> OOM killer to select another task.

Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
reaper confuses

    wait_event(oom_victims_wait, !atomic_read(&oom_victims));

in oom_killer_disable(), I'm worrying that the freezing operation continues
before the OOM victim which escaped the __refrigerator() actually releases
memory. Does this cause consistency problem?

> +	/*
> +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> +	 * reasonably reclaimable memory anymore. OOM killer can continue
> +	 * by selecting other victim if unmapping hasn't led to any
> +	 * improvements. This also means that selecting this task doesn't
> +	 * make any sense.
> +	 */
> +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> +	exit_oom_victim(tsk);

I noticed that updating only one thread group's oom_score_adj disables
further wake_oom_reaper() calls due to rough-grained can_oom_reap check at

  p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

in oom_kill_process(). I think we need to either update all thread groups'
oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
dying or exiting.

----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int writer(void *unused)
{
	static char buffer[4096];
	int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

int main(int argc, char *argv[])
{
	unsigned long size;
	char *buf = NULL;
	unsigned long i;
	if (fork() == 0) {
		int fd = open("/proc/self/oom_score_adj", O_WRONLY);
		write(fd, "1000", 4);
		close(fd);
		for (i = 0; i < 2; i++) {
			char *stack = malloc(4096);
			if (stack)
				clone(writer, stack + 4096, CLONE_VM, NULL);
		}
		writer(NULL);
		while (1)
			pause();
	}
	sleep(1);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(5);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	pause();
	return 0;
}
----------

----------
[  177.722853] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  177.724956] a.out cpuset=/ mems_allowed=0
[  177.725735] CPU: 3 PID: 3962 Comm: a.out Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  177.802889] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  177.872248] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  177.874279] [ 3962]  1000  3962   541717   395780     784       6        0             0 a.out
[  177.876274] [ 3963]  1000  3963     1078       21       7       3        0          1000 a.out
[  177.878261] [ 3964]  1000  3964     1078       21       7       3        0          1000 a.out
[  177.880194] [ 3965]  1000  3965     1078       21       7       3        0          1000 a.out
[  177.882262] Out of memory: Kill process 3963 (a.out) score 998 or sacrifice child
[  177.884129] Killed process 3963 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  177.887100] oom_reaper: reaped process :3963 (a.out) anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
[  179.638399] crond invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[  179.647708] crond cpuset=/ mems_allowed=0
[  179.652996] CPU: 3 PID: 742 Comm: crond Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  179.771311] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  179.836221] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  179.838278] [ 3962]  1000  3962   541717   396308     785       6        0             0 a.out
[  179.840328] [ 3963]  1000  3963     1078        0       7       3        0         -1000 a.out
[  179.842443] [ 3965]  1000  3965     1078        0       7       3        0          1000 a.out
[  179.844557] Out of memory: Kill process 3965 (a.out) score 998 or sacrifice child
[  179.846404] Killed process 3965 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
----------

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-04 14:22     ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-04 14:22 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> When oom_reaper manages to unmap all the eligible vmas there shouldn't
> be much of the freable memory held by the oom victim left anymore so it
> makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> OOM killer to select another task.

Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
reaper confuses

    wait_event(oom_victims_wait, !atomic_read(&oom_victims));

in oom_killer_disable(), I'm worrying that the freezing operation continues
before the OOM victim which escaped the __refrigerator() actually releases
memory. Does this cause consistency problem?

> +	/*
> +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> +	 * reasonably reclaimable memory anymore. OOM killer can continue
> +	 * by selecting other victim if unmapping hasn't led to any
> +	 * improvements. This also means that selecting this task doesn't
> +	 * make any sense.
> +	 */
> +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> +	exit_oom_victim(tsk);

I noticed that updating only one thread group's oom_score_adj disables
further wake_oom_reaper() calls due to rough-grained can_oom_reap check at

  p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

in oom_kill_process(). I think we need to either update all thread groups'
oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
dying or exiting.

----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int writer(void *unused)
{
	static char buffer[4096];
	int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

int main(int argc, char *argv[])
{
	unsigned long size;
	char *buf = NULL;
	unsigned long i;
	if (fork() == 0) {
		int fd = open("/proc/self/oom_score_adj", O_WRONLY);
		write(fd, "1000", 4);
		close(fd);
		for (i = 0; i < 2; i++) {
			char *stack = malloc(4096);
			if (stack)
				clone(writer, stack + 4096, CLONE_VM, NULL);
		}
		writer(NULL);
		while (1)
			pause();
	}
	sleep(1);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(5);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	pause();
	return 0;
}
----------

----------
[  177.722853] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  177.724956] a.out cpuset=/ mems_allowed=0
[  177.725735] CPU: 3 PID: 3962 Comm: a.out Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  177.802889] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  177.872248] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  177.874279] [ 3962]  1000  3962   541717   395780     784       6        0             0 a.out
[  177.876274] [ 3963]  1000  3963     1078       21       7       3        0          1000 a.out
[  177.878261] [ 3964]  1000  3964     1078       21       7       3        0          1000 a.out
[  177.880194] [ 3965]  1000  3965     1078       21       7       3        0          1000 a.out
[  177.882262] Out of memory: Kill process 3963 (a.out) score 998 or sacrifice child
[  177.884129] Killed process 3963 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  177.887100] oom_reaper: reaped process :3963 (a.out) anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
[  179.638399] crond invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[  179.647708] crond cpuset=/ mems_allowed=0
[  179.652996] CPU: 3 PID: 742 Comm: crond Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  179.771311] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  179.836221] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  179.838278] [ 3962]  1000  3962   541717   396308     785       6        0             0 a.out
[  179.840328] [ 3963]  1000  3963     1078        0       7       3        0         -1000 a.out
[  179.842443] [ 3965]  1000  3965     1078        0       7       3        0          1000 a.out
[  179.844557] Out of memory: Kill process 3965 (a.out) score 998 or sacrifice child
[  179.846404] Killed process 3965 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 14:22     ` Tetsuo Handa
@ 2016-02-04 14:43       ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04 14:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 04-02-16 23:22:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > When oom_reaper manages to unmap all the eligible vmas there shouldn't
> > be much of the freable memory held by the oom victim left anymore so it
> > makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> > OOM killer to select another task.
> 
> Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
> with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
> reaper confuses
> 
>     wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> 
> in oom_killer_disable(), I'm worrying that the freezing operation continues
> before the OOM victim which escaped the __refrigerator() actually releases
> memory. Does this cause consistency problem?

This is a good question! At first sight it seems this is not safe and we
might need to make the oom_reaper freezable so that it doesn't wake up
during suspend and interfere. Let me think about that.

> > +	/*
> > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > +	 * by selecting other victim if unmapping hasn't led to any
> > +	 * improvements. This also means that selecting this task doesn't
> > +	 * make any sense.
> > +	 */
> > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > +	exit_oom_victim(tsk);
> 
> I noticed that updating only one thread group's oom_score_adj disables
> further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> 
>   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> 
> in oom_kill_process(). I think we need to either update all thread groups'
> oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> dying or exiting.

I do not understand. Why would you want to reap the mm again when
this has been done already? The mm is shared, right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-04 14:43       ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04 14:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 04-02-16 23:22:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > When oom_reaper manages to unmap all the eligible vmas there shouldn't
> > be much of the freable memory held by the oom victim left anymore so it
> > makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> > OOM killer to select another task.
> 
> Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
> with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
> reaper confuses
> 
>     wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> 
> in oom_killer_disable(), I'm worrying that the freezing operation continues
> before the OOM victim which escaped the __refrigerator() actually releases
> memory. Does this cause consistency problem?

This is a good question! At first sight it seems this is not safe and we
might need to make the oom_reaper freezable so that it doesn't wake up
during suspend and interfere. Let me think about that.

> > +	/*
> > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > +	 * by selecting other victim if unmapping hasn't led to any
> > +	 * improvements. This also means that selecting this task doesn't
> > +	 * make any sense.
> > +	 */
> > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > +	exit_oom_victim(tsk);
> 
> I noticed that updating only one thread group's oom_score_adj disables
> further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> 
>   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> 
> in oom_kill_process(). I think we need to either update all thread groups'
> oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> dying or exiting.

I do not understand. Why would you want to reap the mm again when
this has been done already? The mm is shared, right?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-04 10:49     ` Tetsuo Handa
@ 2016-02-04 14:53       ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04 14:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 04-02-16 19:49:29, Tetsuo Handa wrote:
[...]
> I think we want to rewrite this patch's description from a different point
> of view.
> 
> As of "[PATCH 1/5] mm, oom: introduce oom reaper", we assumed that we try to
> manage OOM livelock caused by system-wide OOM events using the OOM reaper.
> Therefore, the OOM reaper had high scheduling priority and we considered side
> effect of the OOM reaper as a reasonable constraint.
> 
> But as the discussion went by, we started to try to manage OOM livelock
> caused by non system-wide OOM events (e.g. memcg OOM) using the OOM reaper.
> Therefore, the OOM reaper now has normal scheduling priority. For non
> system-wide OOM events, side effect of the OOM reaper might not be a
> reasonable constraint. Some administrator might expect that the OOM reaper
> does not break coredumping unless the system is under system-wide OOM events.

I am willing to discuss this as an option after we actually hear about a
_real_ usecase.

[...]

> But if we consider non system-wide OOM events, it is not very unlikely to hit
> this race. This queue is useful for situations where memcg1 and memcg2 hit
> memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.

This can happen of course but the likelihood is _much_ smaller without
the global OOM because the memcg OOM killer is invoked from a lockless
context so the oom context cannot block the victim to proceed.

> I expect parallel reaping (shown below) because there is no need to serialize
> victim tasks (e.g. wait for reaping victim1 in memcg1 which can take up to
> 1 second to complete before start reaping victim2 in memcg2) if we implement
> this queue.

I would really prefer to go a simpler way first and extend the code when
we see the current approach insufficient for real life loads. Please do
not get me wrong, of course the code can be enhanced in many different
ways and optimize for lots of pathological cases but I really believe
that we should start with correctness first and only later care about
optimizing corner cases. Realistically, who really cares about
oom_reaper acting on an Nth completely stuck tasks after N seconds?
For now, my target is to guarantee that oom_reaper will _eventually_
process a queued task and reap its memory if the target hasn't exited
yet. I do not think this is an unreasonable goal...

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
@ 2016-02-04 14:53       ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04 14:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 04-02-16 19:49:29, Tetsuo Handa wrote:
[...]
> I think we want to rewrite this patch's description from a different point
> of view.
> 
> As of "[PATCH 1/5] mm, oom: introduce oom reaper", we assumed that we try to
> manage OOM livelock caused by system-wide OOM events using the OOM reaper.
> Therefore, the OOM reaper had high scheduling priority and we considered side
> effect of the OOM reaper as a reasonable constraint.
> 
> But as the discussion went by, we started to try to manage OOM livelock
> caused by non system-wide OOM events (e.g. memcg OOM) using the OOM reaper.
> Therefore, the OOM reaper now has normal scheduling priority. For non
> system-wide OOM events, side effect of the OOM reaper might not be a
> reasonable constraint. Some administrator might expect that the OOM reaper
> does not break coredumping unless the system is under system-wide OOM events.

I am willing to discuss this as an option after we actually hear about a
_real_ usecase.

[...]

> But if we consider non system-wide OOM events, it is not very unlikely to hit
> this race. This queue is useful for situations where memcg1 and memcg2 hit
> memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.

This can happen of course but the likelihood is _much_ smaller without
the global OOM because the memcg OOM killer is invoked from a lockless
context so the oom context cannot block the victim to proceed.

> I expect parallel reaping (shown below) because there is no need to serialize
> victim tasks (e.g. wait for reaping victim1 in memcg1 which can take up to
> 1 second to complete before start reaping victim2 in memcg2) if we implement
> this queue.

I would really prefer to go a simpler way first and extend the code when
we see the current approach insufficient for real life loads. Please do
not get me wrong, of course the code can be enhanced in many different
ways and optimize for lots of pathological cases but I really believe
that we should start with correctness first and only later care about
optimizing corner cases. Realistically, who really cares about
oom_reaper acting on an Nth completely stuck tasks after N seconds?
For now, my target is to guarantee that oom_reaper will _eventually_
process a queued task and reap its memory if the target hasn't exited
yet. I do not think this is an unreasonable goal...

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 14:43       ` Michal Hocko
@ 2016-02-04 15:08         ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-04 15:08 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > > +	/*
> > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > +	 * by selecting other victim if unmapping hasn't led to any
> > > +	 * improvements. This also means that selecting this task doesn't
> > > +	 * make any sense.
> > > +	 */
> > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > +	exit_oom_victim(tsk);
> > 
> > I noticed that updating only one thread group's oom_score_adj disables
> > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > 
> >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > 
> > in oom_kill_process(). I think we need to either update all thread groups'
> > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > dying or exiting.
> 
> I do not understand. Why would you want to reap the mm again when
> this has been done already? The mm is shared, right?

The mm is shared between previous victim and next victim, but these victims
are in different thread groups. The OOM killer selects next victim whose mm
was already reaped due to sharing previous victim's memory. We don't want
the OOM killer to select such next victim. Maybe set MMF_OOM_REAP_DONE on
the previous victim's mm and check it instead of TIF_MEMDIE when selecting
a victim? That will also avoid problems caused by clearing TIF_MEMDIE?

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-04 15:08         ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-04 15:08 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > > +	/*
> > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > +	 * by selecting other victim if unmapping hasn't led to any
> > > +	 * improvements. This also means that selecting this task doesn't
> > > +	 * make any sense.
> > > +	 */
> > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > +	exit_oom_victim(tsk);
> > 
> > I noticed that updating only one thread group's oom_score_adj disables
> > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > 
> >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > 
> > in oom_kill_process(). I think we need to either update all thread groups'
> > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > dying or exiting.
> 
> I do not understand. Why would you want to reap the mm again when
> this has been done already? The mm is shared, right?

The mm is shared between previous victim and next victim, but these victims
are in different thread groups. The OOM killer selects next victim whose mm
was already reaped due to sharing previous victim's memory. We don't want
the OOM killer to select such next victim. Maybe set MMF_OOM_REAP_DONE on
the previous victim's mm and check it instead of TIF_MEMDIE when selecting
a victim? That will also avoid problems caused by clearing TIF_MEMDIE?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 15:08         ` Tetsuo Handa
@ 2016-02-04 16:31           ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04 16:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Fri 05-02-16 00:08:25, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > +	/*
> > > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > > +	 * by selecting other victim if unmapping hasn't led to any
> > > > +	 * improvements. This also means that selecting this task doesn't
> > > > +	 * make any sense.
> > > > +	 */
> > > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > > +	exit_oom_victim(tsk);
> > > 
> > > I noticed that updating only one thread group's oom_score_adj disables
> > > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > > 
> > >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > > 
> > > in oom_kill_process(). I think we need to either update all thread groups'
> > > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > > dying or exiting.
> > 
> > I do not understand. Why would you want to reap the mm again when
> > this has been done already? The mm is shared, right?
> 
> The mm is shared between previous victim and next victim, but these victims
> are in different thread groups. The OOM killer selects next victim whose mm
> was already reaped due to sharing previous victim's memory.

OK, now I got your point. From your previous email it sounded like you
were talking about oom_reaper and its invocation which is was confusing.

> We don't want the OOM killer to select such next victim.

Yes, selecting such a task doesn't make much sense. It has been killed
so it has fatal_signal_pending. If it wanted to allocate it would get
TIF_MEMDIE already and it's address space has been reaped so there is
nothing to free left. These CLONE_VM without CLONE_SIGHAND is really
crazy combo, it is just causing troubles all over and I am not convinced
it is actually that helpful </rant>.


> Maybe set MMF_OOM_REAP_DONE on
> the previous victim's mm and check it instead of TIF_MEMDIE when selecting
> a victim? That will also avoid problems caused by clearing TIF_MEMDIE?

Hmm, it doesn't seem we are under MMF_ availabel bits pressure right now
so using the flag sounds like the easiest way to go. Then we even do not
have to play with OOM_SCORE_ADJ_MIN which might be updated from the
userspace after the oom reaper has done that. Care to send a patch?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-04 16:31           ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-04 16:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Fri 05-02-16 00:08:25, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > +	/*
> > > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > > +	 * by selecting other victim if unmapping hasn't led to any
> > > > +	 * improvements. This also means that selecting this task doesn't
> > > > +	 * make any sense.
> > > > +	 */
> > > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > > +	exit_oom_victim(tsk);
> > > 
> > > I noticed that updating only one thread group's oom_score_adj disables
> > > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > > 
> > >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > > 
> > > in oom_kill_process(). I think we need to either update all thread groups'
> > > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > > dying or exiting.
> > 
> > I do not understand. Why would you want to reap the mm again when
> > this has been done already? The mm is shared, right?
> 
> The mm is shared between previous victim and next victim, but these victims
> are in different thread groups. The OOM killer selects next victim whose mm
> was already reaped due to sharing previous victim's memory.

OK, now I got your point. From your previous email it sounded like you
were talking about oom_reaper and its invocation which is was confusing.

> We don't want the OOM killer to select such next victim.

Yes, selecting such a task doesn't make much sense. It has been killed
so it has fatal_signal_pending. If it wanted to allocate it would get
TIF_MEMDIE already and it's address space has been reaped so there is
nothing to free left. These CLONE_VM without CLONE_SIGHAND is really
crazy combo, it is just causing troubles all over and I am not convinced
it is actually that helpful </rant>.


> Maybe set MMF_OOM_REAP_DONE on
> the previous victim's mm and check it instead of TIF_MEMDIE when selecting
> a victim? That will also avoid problems caused by clearing TIF_MEMDIE?

Hmm, it doesn't seem we are under MMF_ availabel bits pressure right now
so using the flag sounds like the easiest way to go. Then we even do not
have to play with OOM_SCORE_ADJ_MIN which might be updated from the
userspace after the oom reaper has done that. Care to send a patch?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-04  6:46       ` Michal Hocko
@ 2016-02-04 22:31         ` David Rientjes
  -1 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-04 22:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 4 Feb 2016, Michal Hocko wrote:

> > I think it would be helpful to show anon-rss after reaping, however, so we 
> > can compare to the previous anon-rss that was reported.  And, I agree that 
> > leaving behind a message in the kernel log that reaping has been 
> > successful is worthwhile.  So this line should just show what anon-rss is 
> > after reaping and make it clear that this is not the memory reaped.
> 
> Does
> "oom_reaper: reaped process %d (%s) current memory anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB "
> 
> sound any better?

oom_reaper: reaped process %d (%s), now anon-rss:%lukB

would probably be better until additional support is added to do other 
kinds of reaping other than just primarily heap.  This should help to 
quantify the exact amount of memory that could be reaped (or otherwise 
unmapped) iff oom_reaper has to get involved rather than fluctations that 
have nothing to do with it.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
@ 2016-02-04 22:31         ` David Rientjes
  0 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-04 22:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 4 Feb 2016, Michal Hocko wrote:

> > I think it would be helpful to show anon-rss after reaping, however, so we 
> > can compare to the previous anon-rss that was reported.  And, I agree that 
> > leaving behind a message in the kernel log that reaping has been 
> > successful is worthwhile.  So this line should just show what anon-rss is 
> > after reaping and make it clear that this is not the memory reaped.
> 
> Does
> "oom_reaper: reaped process %d (%s) current memory anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB "
> 
> sound any better?

oom_reaper: reaped process %d (%s), now anon-rss:%lukB

would probably be better until additional support is added to do other 
kinds of reaping other than just primarily heap.  This should help to 
quantify the exact amount of memory that could be reaped (or otherwise 
unmapped) iff oom_reaper has to get involved rather than fluctations that 
have nothing to do with it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-04 22:31         ` David Rientjes
@ 2016-02-05  9:26           ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-05  9:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu 04-02-16 14:31:26, David Rientjes wrote:
> On Thu, 4 Feb 2016, Michal Hocko wrote:
> 
> > > I think it would be helpful to show anon-rss after reaping, however, so we 
> > > can compare to the previous anon-rss that was reported.  And, I agree that 
> > > leaving behind a message in the kernel log that reaping has been 
> > > successful is worthwhile.  So this line should just show what anon-rss is 
> > > after reaping and make it clear that this is not the memory reaped.
> > 
> > Does
> > "oom_reaper: reaped process %d (%s) current memory anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB "
> > 
> > sound any better?
> 
> oom_reaper: reaped process %d (%s), now anon-rss:%lukB
> 
> would probably be better until additional support is added to do other 
> kinds of reaping other than just primarily heap.  This should help to 
> quantify the exact amount of memory that could be reaped (or otherwise 
> unmapped) iff oom_reaper has to get involved rather than fluctations that 
> have nothing to do with it.

---
>From 402090df64de7f80d7d045b0b17e860220837fa6 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 5 Feb 2016 10:24:23 +0100
Subject: [PATCH] mm-oom_reaper-report-success-failure-fix

update the log message to be more specific

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 87d644c97ac9..ca61e6cfae52 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,7 +479,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
-	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
+	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
-- 
2.7.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
@ 2016-02-05  9:26           ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-05  9:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu 04-02-16 14:31:26, David Rientjes wrote:
> On Thu, 4 Feb 2016, Michal Hocko wrote:
> 
> > > I think it would be helpful to show anon-rss after reaping, however, so we 
> > > can compare to the previous anon-rss that was reported.  And, I agree that 
> > > leaving behind a message in the kernel log that reaping has been 
> > > successful is worthwhile.  So this line should just show what anon-rss is 
> > > after reaping and make it clear that this is not the memory reaped.
> > 
> > Does
> > "oom_reaper: reaped process %d (%s) current memory anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB "
> > 
> > sound any better?
> 
> oom_reaper: reaped process %d (%s), now anon-rss:%lukB
> 
> would probably be better until additional support is added to do other 
> kinds of reaping other than just primarily heap.  This should help to 
> quantify the exact amount of memory that could be reaped (or otherwise 
> unmapped) iff oom_reaper has to get involved rather than fluctations that 
> have nothing to do with it.

---

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 16:31           ` Michal Hocko
@ 2016-02-05 11:14             ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-05 11:14 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 05-02-16 00:08:25, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > > +	/*
> > > > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > > > +	 * by selecting other victim if unmapping hasn't led to any
> > > > > +	 * improvements. This also means that selecting this task doesn't
> > > > > +	 * make any sense.
> > > > > +	 */
> > > > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > > > +	exit_oom_victim(tsk);
> > > >
> > > > I noticed that updating only one thread group's oom_score_adj disables
> > > > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > > >
> > > >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > > >
> > > > in oom_kill_process(). I think we need to either update all thread groups'
> > > > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > > > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > > > dying or exiting.
> > >
> > > I do not understand. Why would you want to reap the mm again when
> > > this has been done already? The mm is shared, right?
> >
> > The mm is shared between previous victim and next victim, but these victims
> > are in different thread groups. The OOM killer selects next victim whose mm
> > was already reaped due to sharing previous victim's memory.
>
> OK, now I got your point. From your previous email it sounded like you
> were talking about oom_reaper and its invocation which is was confusing.
>
> > We don't want the OOM killer to select such next victim.
>
> Yes, selecting such a task doesn't make much sense. It has been killed
> so it has fatal_signal_pending. If it wanted to allocate it would get
> TIF_MEMDIE already and it's address space has been reaped so there is
> nothing to free left. These CLONE_VM without CLONE_SIGHAND is really
> crazy combo, it is just causing troubles all over and I am not convinced
> it is actually that helpful </rant>.
>

I think moving "whether a mm is reapable or not" check to the OOM reaper
is preferable (shown below). In most cases, mm_is_reapable() will return
true.

----------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..fc114b3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -426,6 +426,39 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+static bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *g;
+	struct task_struct *p;
+
+	/*
+	 * Since it is possible that p voluntarily called do_exit() or
+	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
+	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
+	 * has pending SIGKILL or already reached do_exit().
+	 *
+	 * On the other hand, it is possible that mark_oom_victim(p) is called
+	 * without sending SIGKILL to all tasks using this mm. In this case,
+	 * the OOM reaper cannot reap this mm unless p is the only task using
+	 * this mm.
+	 *
+	 * Therefore, determine whether this mm is reapable by testing whether
+	 * all tasks using this mm are dying or already exiting rather than
+	 * depending on p->signal->oom_score_adj value which is updated by the
+	 * OOM reaper.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (mm != READ_ONCE(p->mm) ||
+		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
+			continue;
+		mm = NULL;
+		goto out;
+	}
+ out:
+	rcu_read_unlock();
+	return mm != NULL;
+}
 
 static bool __oom_reap_task(struct task_struct *tsk)
 {
@@ -455,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
 	}
@@ -596,6 +629,7 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	wake_oom_reaper(tsk);
 }
 
 /**
@@ -680,7 +714,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -771,23 +804,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
-		}
+		if (is_global_init(p))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
----------------------------------------

Then, I think we need to kill two lies in allocation retry loop.

The first lie is that we pretend as if we are making forward progress
without hitting the OOM killer. This lie is preventing any OOM victim
(with SIGKILL and without TIF_MEMDIE) doing !__GFP_FS allocations from
hitting the OOM killer, which can prevent the OOM victim tasks from
reaching do_exit(), and which is conflicting with assumption

  * That thread will now get access to memory reserves since it has a
  * pending fatal signal.

in oom_kill_process() and similar assertions like this patch's description.

  The lack of TIF_MEMDIE also means that the victim cannot access memory
  reserves anymore but that shouldn't be a problem because it would get
  the access again if it needs to allocate and hits the OOM killer again
  due to the fatal_signal_pending resp. PF_EXITING check.

If the callers of !__GFP_FS allocation do not need to loop until somebody
else reclaims memory on behalf of them, they can add __GFP_NORETRY. Otherwise,
the callers of !__GFP_FS allocation can and should call out_of_memory().
That's all we need for killing this lie.

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1668159..67591a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2772,20 +2772,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 *
-			 * But do not keep looping if oom_killer_disable()
-			 * was already called, for the system is trying to
-			 * enter a quiescent state during suspend.
-			 */
-			*did_some_progress = !oom_killer_disabled;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */
----------------------------------------

The second lie is that we pretend as if we are making forward progress
without taking any action when the OOM killer found a TIF_MEMDIE task.
This lie is preventing any task (without SIGKILL and without TIF_MEMDIE)
which is blocking the OOM victim (which might be looping without getting
TIF_MEMDIE due to doing !__GFP_FS allocation) from making forward progress
if the OOM reaper does not clear TIF_MEMDIE.

----------------------------------------
	/* Retry as long as the OOM killer is making progress */
	if (did_some_progress) {
		no_progress_loops = 0;
		goto retry;
	}
----------------------------------------
        v.s.
----------------------------------------
	/*
	 * This task already has access to memory reserves and is being killed.
	 * Don't allow any other task to have access to the reserves.
	 */
	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
		if (!is_sysrq_oom(oc))
			return OOM_SCAN_ABORT;
	}
----------------------------------------

By moving "whether a mm is reapable or not" check to the OOM reaper, we can
delegate the duty of clearing TIF_MEMDIE to the OOM reaper because the OOM
reaper is tracking all TIF_MEMDIE tasks. Since mm_is_reapable() can return
true for most situations, it becomes an unlikely corner case that we need to
clear TIF_MEMDIE and prevent the OOM killer from setting TIF_MEMDIE on the
same task again when the OOM reaper gave up. Like you commented in [PATCH 5/5],
falling back to simple timer would be sufficient for handling such corner cases.

| I would really prefer to go a simpler way first and extend the code when
| we see the current approach insufficient for real life loads. Please do
| not get me wrong, of course the code can be enhanced in many different
| ways and optimize for lots of pathological cases but I really believe
| that we should start with correctness first and only later care about
| optimizing corner cases.

>
> > Maybe set MMF_OOM_REAP_DONE on
> > the previous victim's mm and check it instead of TIF_MEMDIE when selecting
> > a victim? That will also avoid problems caused by clearing TIF_MEMDIE?
>
> Hmm, it doesn't seem we are under MMF_ availabel bits pressure right now
> so using the flag sounds like the easiest way to go. Then we even do not
> have to play with OOM_SCORE_ADJ_MIN which might be updated from the
> userspace after the oom reaper has done that. Care to send a patch?

Not only we don't need to worry about ->oom_score_adj being modified from
outside the SIGKILL pending tasks, I think we also don't need to clear remote
TIF_MEMDIE if we use MMF_OOM_REAP_DONE. Something like below untested patch?

----------------------------------------
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..03e6257 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 012dd6f..442ba46 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -515,6 +515,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
+#define MMF_OOM_REAP_DONE       21      /* set when OOM reap completed */
+
 struct sighand_struct {
 	atomic_t		count;
 	struct k_sigaction	action[_NSIG];
diff --git a/kernel/exit.c b/kernel/exit.c
index ba3bd29..10e0882 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,7 +434,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..b67d8bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -271,19 +271,24 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
+	struct mm_struct *mm;
+
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
 		return OOM_SCAN_CONTINUE;
 
+	mm = task->mm;
+	if (!mm)
+		return OOM_SCAN_CONTINUE;
 	/*
 	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * Don't allow any other task to have access to the reserves unless
+	 * this task's memory was OOM reaped.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (!is_sysrq_oom(oc))
+		if (!is_sysrq_oom(oc) &&
+		    !test_bit(MMF_OOM_REAP_DONE, &mm->flags))
 			return OOM_SCAN_ABORT;
 	}
-	if (!task->mm)
-		return OOM_SCAN_CONTINUE;
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -448,7 +453,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		return true;
 
 	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
+	if (test_bit(MMF_OOM_REAP_DONE, &mm->flags) ||
+	    !atomic_inc_not_zero(&mm->mm_users)) {
 		task_unlock(p);
 		return true;
 	}
@@ -491,14 +497,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
+	 * Set MMF_OOM_REAP_DONE on this mm so that OOM killer can continue
+	 * by selecting other victim which does not use this mm if unmapping
+	 * this mm hasn't led to any improvements.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
+	set_bit(MMF_OOM_REAP_DONE, &mm->flags);
 out:
 	mmput(mm);
 	return ret;
@@ -601,10 +604,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
----------------------------------------

I suggested many changes in this post because [PATCH 3/5] and [PATCH 5/5]
made it possible for us to simplify [PATCH 1/5] like old versions. I think
you want to rebuild this series with these changes merged as appropriate.

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-05 11:14             ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-05 11:14 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 05-02-16 00:08:25, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > > +	/*
> > > > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > > > +	 * by selecting other victim if unmapping hasn't led to any
> > > > > +	 * improvements. This also means that selecting this task doesn't
> > > > > +	 * make any sense.
> > > > > +	 */
> > > > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > > > +	exit_oom_victim(tsk);
> > > >
> > > > I noticed that updating only one thread group's oom_score_adj disables
> > > > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > > >
> > > >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > > >
> > > > in oom_kill_process(). I think we need to either update all thread groups'
> > > > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > > > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > > > dying or exiting.
> > >
> > > I do not understand. Why would you want to reap the mm again when
> > > this has been done already? The mm is shared, right?
> >
> > The mm is shared between previous victim and next victim, but these victims
> > are in different thread groups. The OOM killer selects next victim whose mm
> > was already reaped due to sharing previous victim's memory.
>
> OK, now I got your point. From your previous email it sounded like you
> were talking about oom_reaper and its invocation which is was confusing.
>
> > We don't want the OOM killer to select such next victim.
>
> Yes, selecting such a task doesn't make much sense. It has been killed
> so it has fatal_signal_pending. If it wanted to allocate it would get
> TIF_MEMDIE already and it's address space has been reaped so there is
> nothing to free left. These CLONE_VM without CLONE_SIGHAND is really
> crazy combo, it is just causing troubles all over and I am not convinced
> it is actually that helpful </rant>.
>

I think moving "whether a mm is reapable or not" check to the OOM reaper
is preferable (shown below). In most cases, mm_is_reapable() will return
true.

----------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..fc114b3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -426,6 +426,39 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+static bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *g;
+	struct task_struct *p;
+
+	/*
+	 * Since it is possible that p voluntarily called do_exit() or
+	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
+	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
+	 * has pending SIGKILL or already reached do_exit().
+	 *
+	 * On the other hand, it is possible that mark_oom_victim(p) is called
+	 * without sending SIGKILL to all tasks using this mm. In this case,
+	 * the OOM reaper cannot reap this mm unless p is the only task using
+	 * this mm.
+	 *
+	 * Therefore, determine whether this mm is reapable by testing whether
+	 * all tasks using this mm are dying or already exiting rather than
+	 * depending on p->signal->oom_score_adj value which is updated by the
+	 * OOM reaper.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (mm != READ_ONCE(p->mm) ||
+		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
+			continue;
+		mm = NULL;
+		goto out;
+	}
+ out:
+	rcu_read_unlock();
+	return mm != NULL;
+}
 
 static bool __oom_reap_task(struct task_struct *tsk)
 {
@@ -455,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
 	}
@@ -596,6 +629,7 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	wake_oom_reaper(tsk);
 }
 
 /**
@@ -680,7 +714,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -771,23 +804,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
-		}
+		if (is_global_init(p))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
----------------------------------------

Then, I think we need to kill two lies in allocation retry loop.

The first lie is that we pretend as if we are making forward progress
without hitting the OOM killer. This lie is preventing any OOM victim
(with SIGKILL and without TIF_MEMDIE) doing !__GFP_FS allocations from
hitting the OOM killer, which can prevent the OOM victim tasks from
reaching do_exit(), and which is conflicting with assumption

  * That thread will now get access to memory reserves since it has a
  * pending fatal signal.

in oom_kill_process() and similar assertions like this patch's description.

  The lack of TIF_MEMDIE also means that the victim cannot access memory
  reserves anymore but that shouldn't be a problem because it would get
  the access again if it needs to allocate and hits the OOM killer again
  due to the fatal_signal_pending resp. PF_EXITING check.

If the callers of !__GFP_FS allocation do not need to loop until somebody
else reclaims memory on behalf of them, they can add __GFP_NORETRY. Otherwise,
the callers of !__GFP_FS allocation can and should call out_of_memory().
That's all we need for killing this lie.

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1668159..67591a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2772,20 +2772,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 *
-			 * But do not keep looping if oom_killer_disable()
-			 * was already called, for the system is trying to
-			 * enter a quiescent state during suspend.
-			 */
-			*did_some_progress = !oom_killer_disabled;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */
----------------------------------------

The second lie is that we pretend as if we are making forward progress
without taking any action when the OOM killer found a TIF_MEMDIE task.
This lie is preventing any task (without SIGKILL and without TIF_MEMDIE)
which is blocking the OOM victim (which might be looping without getting
TIF_MEMDIE due to doing !__GFP_FS allocation) from making forward progress
if the OOM reaper does not clear TIF_MEMDIE.

----------------------------------------
	/* Retry as long as the OOM killer is making progress */
	if (did_some_progress) {
		no_progress_loops = 0;
		goto retry;
	}
----------------------------------------
        v.s.
----------------------------------------
	/*
	 * This task already has access to memory reserves and is being killed.
	 * Don't allow any other task to have access to the reserves.
	 */
	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
		if (!is_sysrq_oom(oc))
			return OOM_SCAN_ABORT;
	}
----------------------------------------

By moving "whether a mm is reapable or not" check to the OOM reaper, we can
delegate the duty of clearing TIF_MEMDIE to the OOM reaper because the OOM
reaper is tracking all TIF_MEMDIE tasks. Since mm_is_reapable() can return
true for most situations, it becomes an unlikely corner case that we need to
clear TIF_MEMDIE and prevent the OOM killer from setting TIF_MEMDIE on the
same task again when the OOM reaper gave up. Like you commented in [PATCH 5/5],
falling back to simple timer would be sufficient for handling such corner cases.

| I would really prefer to go a simpler way first and extend the code when
| we see the current approach insufficient for real life loads. Please do
| not get me wrong, of course the code can be enhanced in many different
| ways and optimize for lots of pathological cases but I really believe
| that we should start with correctness first and only later care about
| optimizing corner cases.

>
> > Maybe set MMF_OOM_REAP_DONE on
> > the previous victim's mm and check it instead of TIF_MEMDIE when selecting
> > a victim? That will also avoid problems caused by clearing TIF_MEMDIE?
>
> Hmm, it doesn't seem we are under MMF_ availabel bits pressure right now
> so using the flag sounds like the easiest way to go. Then we even do not
> have to play with OOM_SCORE_ADJ_MIN which might be updated from the
> userspace after the oom reaper has done that. Care to send a patch?

Not only we don't need to worry about ->oom_score_adj being modified from
outside the SIGKILL pending tasks, I think we also don't need to clear remote
TIF_MEMDIE if we use MMF_OOM_REAP_DONE. Something like below untested patch?

----------------------------------------
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..03e6257 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 012dd6f..442ba46 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -515,6 +515,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
+#define MMF_OOM_REAP_DONE       21      /* set when OOM reap completed */
+
 struct sighand_struct {
 	atomic_t		count;
 	struct k_sigaction	action[_NSIG];
diff --git a/kernel/exit.c b/kernel/exit.c
index ba3bd29..10e0882 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,7 +434,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..b67d8bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -271,19 +271,24 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
+	struct mm_struct *mm;
+
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
 		return OOM_SCAN_CONTINUE;
 
+	mm = task->mm;
+	if (!mm)
+		return OOM_SCAN_CONTINUE;
 	/*
 	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * Don't allow any other task to have access to the reserves unless
+	 * this task's memory was OOM reaped.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (!is_sysrq_oom(oc))
+		if (!is_sysrq_oom(oc) &&
+		    !test_bit(MMF_OOM_REAP_DONE, &mm->flags))
 			return OOM_SCAN_ABORT;
 	}
-	if (!task->mm)
-		return OOM_SCAN_CONTINUE;
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -448,7 +453,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		return true;
 
 	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
+	if (test_bit(MMF_OOM_REAP_DONE, &mm->flags) ||
+	    !atomic_inc_not_zero(&mm->mm_users)) {
 		task_unlock(p);
 		return true;
 	}
@@ -491,14 +497,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
+	 * Set MMF_OOM_REAP_DONE on this mm so that OOM killer can continue
+	 * by selecting other victim which does not use this mm if unmapping
+	 * this mm hasn't led to any improvements.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
+	set_bit(MMF_OOM_REAP_DONE, &mm->flags);
 out:
 	mmput(mm);
 	return ret;
@@ -601,10 +604,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
----------------------------------------

I suggested many changes in this post because [PATCH 3/5] and [PATCH 5/5]
made it possible for us to simplify [PATCH 1/5] like old versions. I think
you want to rebuild this series with these changes merged as appropriate.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-04 14:53       ` Michal Hocko
@ 2016-02-06  5:54         ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06  5:54 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> 
> This can happen of course but the likelihood is _much_ smaller without
> the global OOM because the memcg OOM killer is invoked from a lockless
> context so the oom context cannot block the victim to proceed.

Suppose mem_cgroup_out_of_memory() is called from a lockless context via
mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
"lockless" is talking about only current thread, doesn't it?

Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
victim process, it is possible that non-first mm!=NULL thread triggers
pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
isn't it?

Then, where is the guarantee that victim1 (first mm!=NULL thread in memcg1
which got TIF_MEMDIE) is not waiting at down_read(&victim2->mm->mmap_sem)
when victim2 (first mm!=NULL thread in memcg2 which got TIF_MEMDIE) is
waiting at down_write(&victim2->mm->mmap_sem) or both victim1 and victim2
are waiting on a lock somewhere in memory reclaim path (e.g.
mutex_lock(&inode->i_mutex))?

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
@ 2016-02-06  5:54         ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06  5:54 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> 
> This can happen of course but the likelihood is _much_ smaller without
> the global OOM because the memcg OOM killer is invoked from a lockless
> context so the oom context cannot block the victim to proceed.

Suppose mem_cgroup_out_of_memory() is called from a lockless context via
mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
"lockless" is talking about only current thread, doesn't it?

Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
victim process, it is possible that non-first mm!=NULL thread triggers
pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
isn't it?

Then, where is the guarantee that victim1 (first mm!=NULL thread in memcg1
which got TIF_MEMDIE) is not waiting at down_read(&victim2->mm->mmap_sem)
when victim2 (first mm!=NULL thread in memcg2 which got TIF_MEMDIE) is
waiting at down_write(&victim2->mm->mmap_sem) or both victim1 and victim2
are waiting on a lock somewhere in memory reclaim path (e.g.
mutex_lock(&inode->i_mutex))?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-05  9:26           ` Michal Hocko
@ 2016-02-06  6:34             ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-06  6:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Fri 05-02-16 10:26:40, Michal Hocko wrote:
[...]
> From 402090df64de7f80d7d045b0b17e860220837fa6 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 5 Feb 2016 10:24:23 +0100
> Subject: [PATCH] mm-oom_reaper-report-success-failure-fix
> 
> update the log message to be more specific
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 87d644c97ac9..ca61e6cfae52 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,7 +479,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> -	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
> +	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",

Dohh, s@lulB@ulkB@

>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
>  			K(get_mm_counter(mm, MM_FILEPAGES)),
> -- 
> 2.7.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
@ 2016-02-06  6:34             ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-06  6:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Fri 05-02-16 10:26:40, Michal Hocko wrote:
[...]
> From 402090df64de7f80d7d045b0b17e860220837fa6 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 5 Feb 2016 10:24:23 +0100
> Subject: [PATCH] mm-oom_reaper-report-success-failure-fix
> 
> update the log message to be more specific
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 87d644c97ac9..ca61e6cfae52 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,7 +479,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> -	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
> +	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",

Dohh, s@lulB@ulkB@

>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
>  			K(get_mm_counter(mm, MM_FILEPAGES)),
> -- 
> 2.7.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 14:43       ` Michal Hocko
@ 2016-02-06  6:45         ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-06  6:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 04-02-16 15:43:19, Michal Hocko wrote:
> On Thu 04-02-16 23:22:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > When oom_reaper manages to unmap all the eligible vmas there shouldn't
> > > be much of the freable memory held by the oom victim left anymore so it
> > > makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> > > OOM killer to select another task.
> > 
> > Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
> > with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
> > reaper confuses
> > 
> >     wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> > 
> > in oom_killer_disable(), I'm worrying that the freezing operation continues
> > before the OOM victim which escaped the __refrigerator() actually releases
> > memory. Does this cause consistency problem?
> 
> This is a good question! At first sight it seems this is not safe and we
> might need to make the oom_reaper freezable so that it doesn't wake up
> during suspend and interfere. Let me think about that.

OK, I was thinking about it some more and it seems you are right here.
oom_reaper as a kernel thread is not freezable automatically and so it
might interfere after all the processes/kernel threads are considered
frozen. Then it really might shut down TIF_MEMDIE too early and wake out
oom_killer_disable. wait_event_freezable is not sufficient because the
oom_reaper might running while the PM freezer is freezing tasks and it
will miss it because it doesn't see it.

So I think we might need this. I am heading to vacation today and will
be offline for the next week so I will prepare the full patch with the
proper changelog after I get back:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ca61e6cfae52..7e9953a64489 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -521,6 +521,8 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
+	set_freezable();
+
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-06  6:45         ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-06  6:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 04-02-16 15:43:19, Michal Hocko wrote:
> On Thu 04-02-16 23:22:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > When oom_reaper manages to unmap all the eligible vmas there shouldn't
> > > be much of the freable memory held by the oom victim left anymore so it
> > > makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> > > OOM killer to select another task.
> > 
> > Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
> > with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
> > reaper confuses
> > 
> >     wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> > 
> > in oom_killer_disable(), I'm worrying that the freezing operation continues
> > before the OOM victim which escaped the __refrigerator() actually releases
> > memory. Does this cause consistency problem?
> 
> This is a good question! At first sight it seems this is not safe and we
> might need to make the oom_reaper freezable so that it doesn't wake up
> during suspend and interfere. Let me think about that.

OK, I was thinking about it some more and it seems you are right here.
oom_reaper as a kernel thread is not freezable automatically and so it
might interfere after all the processes/kernel threads are considered
frozen. Then it really might shut down TIF_MEMDIE too early and wake out
oom_killer_disable. wait_event_freezable is not sufficient because the
oom_reaper might running while the PM freezer is freezing tasks and it
will miss it because it doesn't see it.

So I think we might need this. I am heading to vacation today and will
be offline for the next week so I will prepare the full patch with the
proper changelog after I get back:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ca61e6cfae52..7e9953a64489 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -521,6 +521,8 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
+	set_freezable();
+
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-05 11:14             ` Tetsuo Handa
@ 2016-02-06  8:30               ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-06  8:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Fri 05-02-16 20:14:40, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 05-02-16 00:08:25, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > > +	/*
> > > > > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > > > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > > > > +	 * by selecting other victim if unmapping hasn't led to any
> > > > > > +	 * improvements. This also means that selecting this task doesn't
> > > > > > +	 * make any sense.
> > > > > > +	 */
> > > > > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > > > > +	exit_oom_victim(tsk);
> > > > >
> > > > > I noticed that updating only one thread group's oom_score_adj disables
> > > > > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > > > >
> > > > >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > > > >
> > > > > in oom_kill_process(). I think we need to either update all thread groups'
> > > > > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > > > > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > > > > dying or exiting.
> > > >
> > > > I do not understand. Why would you want to reap the mm again when
> > > > this has been done already? The mm is shared, right?
> > >
> > > The mm is shared between previous victim and next victim, but these victims
> > > are in different thread groups. The OOM killer selects next victim whose mm
> > > was already reaped due to sharing previous victim's memory.
> >
> > OK, now I got your point. From your previous email it sounded like you
> > were talking about oom_reaper and its invocation which is was confusing.
> >
> > > We don't want the OOM killer to select such next victim.
> >
> > Yes, selecting such a task doesn't make much sense. It has been killed
> > so it has fatal_signal_pending. If it wanted to allocate it would get
> > TIF_MEMDIE already and it's address space has been reaped so there is
> > nothing to free left. These CLONE_VM without CLONE_SIGHAND is really
> > crazy combo, it is just causing troubles all over and I am not convinced
> > it is actually that helpful </rant>.
> >
> 
> I think moving "whether a mm is reapable or not" check to the OOM reaper
> is preferable (shown below). In most cases, mm_is_reapable() will return
> true.

Why should we select such a task in the first place when it's been
already oom reaped? I think it belongs to oom_scan_process_thread.
 
> ----------------------------------------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b42c6bc..fc114b3 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -426,6 +426,39 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static LIST_HEAD(oom_reaper_list);
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> +static bool mm_is_reapable(struct mm_struct *mm)
> +{
> +	struct task_struct *g;
> +	struct task_struct *p;
> +
> +	/*
> +	 * Since it is possible that p voluntarily called do_exit() or
> +	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
> +	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
> +	 * has pending SIGKILL or already reached do_exit().
> +	 *
> +	 * On the other hand, it is possible that mark_oom_victim(p) is called
> +	 * without sending SIGKILL to all tasks using this mm. In this case,
> +	 * the OOM reaper cannot reap this mm unless p is the only task using
> +	 * this mm.
> +	 *
> +	 * Therefore, determine whether this mm is reapable by testing whether
> +	 * all tasks using this mm are dying or already exiting rather than
> +	 * depending on p->signal->oom_score_adj value which is updated by the
> +	 * OOM reaper.
> +	 */
> +	rcu_read_lock();
> +	for_each_process_thread(g, p) {
> +		if (mm != READ_ONCE(p->mm) ||
> +		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
> +			continue;
> +		mm = NULL;
> +		goto out;
> +	}
> + out:
> +	rcu_read_unlock();
> +	return mm != NULL;
> +}
>  
>  static bool __oom_reap_task(struct task_struct *tsk)
>  {
> @@ -455,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  
>  	task_unlock(p);
>  
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> +	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
>  		ret = false;
>  		goto out;
>  	}
> @@ -596,6 +629,7 @@ void mark_oom_victim(struct task_struct *tsk)
>  	 */
>  	__thaw_task(tsk);
>  	atomic_inc(&oom_victims);
> +	wake_oom_reaper(tsk);
>  }
>  
>  /**
> @@ -680,7 +714,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> -	bool can_oom_reap = true;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -771,23 +804,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> -			/*
> -			 * We cannot use oom_reaper for the mm shared by this
> -			 * process because it wouldn't get killed and so the
> -			 * memory might be still used.
> -			 */
> -			can_oom_reap = false;
> +		if (unlikely(p->flags & PF_KTHREAD))
>  			continue;
> -		}
> +		if (is_global_init(p))
> +			continue;
> +		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +			continue;
> +
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	}
>  	rcu_read_unlock();
>  
> -	if (can_oom_reap)
> -		wake_oom_reaper(victim);
> -
>  	mmdrop(mm);
>  	put_task_struct(victim);
>  }
> ----------------------------------------

this is unnecessarily too complex IMO

> Then, I think we need to kill two lies in allocation retry loop.

This is completely unrelated to the topic discussed here.
[...]

> By moving "whether a mm is reapable or not" check to the OOM reaper, we can
> delegate the duty of clearing TIF_MEMDIE to the OOM reaper because the OOM
> reaper is tracking all TIF_MEMDIE tasks. Since mm_is_reapable() can return
> true for most situations, it becomes an unlikely corner case that we need to
> clear TIF_MEMDIE and prevent the OOM killer from setting TIF_MEMDIE on the
> same task again when the OOM reaper gave up. Like you commented in [PATCH 5/5],
> falling back to simple timer would be sufficient for handling such corner cases.

I am not really sure I understand what you are trying to tell here to be honest
but no I am not going to add any timers at this stage.

> | I would really prefer to go a simpler way first and extend the code when
> | we see the current approach insufficient for real life loads. Please do
> | not get me wrong, of course the code can be enhanced in many different
> | ways and optimize for lots of pathological cases but I really believe
> | that we should start with correctness first and only later care about
> | optimizing corner cases.
> 
> >
> > > Maybe set MMF_OOM_REAP_DONE on
> > > the previous victim's mm and check it instead of TIF_MEMDIE when selecting
> > > a victim? That will also avoid problems caused by clearing TIF_MEMDIE?
> >
> > Hmm, it doesn't seem we are under MMF_ availabel bits pressure right now
> > so using the flag sounds like the easiest way to go. Then we even do not
> > have to play with OOM_SCORE_ADJ_MIN which might be updated from the
> > userspace after the oom reaper has done that. Care to send a patch?
> 
> Not only we don't need to worry about ->oom_score_adj being modified from
> outside the SIGKILL pending tasks, I think we also don't need to clear remote
> TIF_MEMDIE if we use MMF_OOM_REAP_DONE. Something like below untested patch?

Dropping TIF_MEMDIE will help to unlock OOM killer as soon as we know
the current victim is no longer interesting for the OOM killer to allow
further victims selection. If we add MMF_OOM_REAP_DONE after reaping and
oom_scan_process_thread is taught to ignore those you will get all cases
of shared memory handles properly AFAICS. Such a patch should be really
trivial enhancement on top of the current code.

[...]

> I suggested many changes in this post because [PATCH 3/5] and [PATCH 5/5]
> made it possible for us to simplify [PATCH 1/5] like old versions. I think
> you want to rebuild this series with these changes merged as appropriate.

I would really _appreciate_ incremental changes as mentioned several
times already. And it would be really helpful if you could stick to
that. If you want to propose additional enhancements on top of the
current code you are free to do so but I am really reluctant to respin
everything to add more stuff into each patch and risk introduction of new
bugs. As things stand now patches are aiming to be as simple as possible
they do not introduce new bugs (except for the PM freezer which should
be fixable by a trivial patch which I will post after I get back from
vacation) and they improve things considerably in its current form
already.

I would like to target the next merge window rather than have this out
of tree for another release cycle which means that we should really
focus on the current functionality and make sure we haven't missed
anything. As there is no fundamental disagreement to the approach all
the rest are just technicalities.

Thanks

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-06  8:30               ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-06  8:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Fri 05-02-16 20:14:40, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 05-02-16 00:08:25, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > > +	/*
> > > > > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > > > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > > > > +	 * by selecting other victim if unmapping hasn't led to any
> > > > > > +	 * improvements. This also means that selecting this task doesn't
> > > > > > +	 * make any sense.
> > > > > > +	 */
> > > > > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > > > > +	exit_oom_victim(tsk);
> > > > >
> > > > > I noticed that updating only one thread group's oom_score_adj disables
> > > > > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > > > >
> > > > >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > > > >
> > > > > in oom_kill_process(). I think we need to either update all thread groups'
> > > > > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > > > > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > > > > dying or exiting.
> > > >
> > > > I do not understand. Why would you want to reap the mm again when
> > > > this has been done already? The mm is shared, right?
> > >
> > > The mm is shared between previous victim and next victim, but these victims
> > > are in different thread groups. The OOM killer selects next victim whose mm
> > > was already reaped due to sharing previous victim's memory.
> >
> > OK, now I got your point. From your previous email it sounded like you
> > were talking about oom_reaper and its invocation which is was confusing.
> >
> > > We don't want the OOM killer to select such next victim.
> >
> > Yes, selecting such a task doesn't make much sense. It has been killed
> > so it has fatal_signal_pending. If it wanted to allocate it would get
> > TIF_MEMDIE already and it's address space has been reaped so there is
> > nothing to free left. These CLONE_VM without CLONE_SIGHAND is really
> > crazy combo, it is just causing troubles all over and I am not convinced
> > it is actually that helpful </rant>.
> >
> 
> I think moving "whether a mm is reapable or not" check to the OOM reaper
> is preferable (shown below). In most cases, mm_is_reapable() will return
> true.

Why should we select such a task in the first place when it's been
already oom reaped? I think it belongs to oom_scan_process_thread.
 
> ----------------------------------------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b42c6bc..fc114b3 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -426,6 +426,39 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static LIST_HEAD(oom_reaper_list);
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> +static bool mm_is_reapable(struct mm_struct *mm)
> +{
> +	struct task_struct *g;
> +	struct task_struct *p;
> +
> +	/*
> +	 * Since it is possible that p voluntarily called do_exit() or
> +	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
> +	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
> +	 * has pending SIGKILL or already reached do_exit().
> +	 *
> +	 * On the other hand, it is possible that mark_oom_victim(p) is called
> +	 * without sending SIGKILL to all tasks using this mm. In this case,
> +	 * the OOM reaper cannot reap this mm unless p is the only task using
> +	 * this mm.
> +	 *
> +	 * Therefore, determine whether this mm is reapable by testing whether
> +	 * all tasks using this mm are dying or already exiting rather than
> +	 * depending on p->signal->oom_score_adj value which is updated by the
> +	 * OOM reaper.
> +	 */
> +	rcu_read_lock();
> +	for_each_process_thread(g, p) {
> +		if (mm != READ_ONCE(p->mm) ||
> +		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
> +			continue;
> +		mm = NULL;
> +		goto out;
> +	}
> + out:
> +	rcu_read_unlock();
> +	return mm != NULL;
> +}
>  
>  static bool __oom_reap_task(struct task_struct *tsk)
>  {
> @@ -455,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  
>  	task_unlock(p);
>  
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> +	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
>  		ret = false;
>  		goto out;
>  	}
> @@ -596,6 +629,7 @@ void mark_oom_victim(struct task_struct *tsk)
>  	 */
>  	__thaw_task(tsk);
>  	atomic_inc(&oom_victims);
> +	wake_oom_reaper(tsk);
>  }
>  
>  /**
> @@ -680,7 +714,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> -	bool can_oom_reap = true;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -771,23 +804,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> -			/*
> -			 * We cannot use oom_reaper for the mm shared by this
> -			 * process because it wouldn't get killed and so the
> -			 * memory might be still used.
> -			 */
> -			can_oom_reap = false;
> +		if (unlikely(p->flags & PF_KTHREAD))
>  			continue;
> -		}
> +		if (is_global_init(p))
> +			continue;
> +		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +			continue;
> +
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	}
>  	rcu_read_unlock();
>  
> -	if (can_oom_reap)
> -		wake_oom_reaper(victim);
> -
>  	mmdrop(mm);
>  	put_task_struct(victim);
>  }
> ----------------------------------------

this is unnecessarily too complex IMO

> Then, I think we need to kill two lies in allocation retry loop.

This is completely unrelated to the topic discussed here.
[...]

> By moving "whether a mm is reapable or not" check to the OOM reaper, we can
> delegate the duty of clearing TIF_MEMDIE to the OOM reaper because the OOM
> reaper is tracking all TIF_MEMDIE tasks. Since mm_is_reapable() can return
> true for most situations, it becomes an unlikely corner case that we need to
> clear TIF_MEMDIE and prevent the OOM killer from setting TIF_MEMDIE on the
> same task again when the OOM reaper gave up. Like you commented in [PATCH 5/5],
> falling back to simple timer would be sufficient for handling such corner cases.

I am not really sure I understand what you are trying to tell here to be honest
but no I am not going to add any timers at this stage.

> | I would really prefer to go a simpler way first and extend the code when
> | we see the current approach insufficient for real life loads. Please do
> | not get me wrong, of course the code can be enhanced in many different
> | ways and optimize for lots of pathological cases but I really believe
> | that we should start with correctness first and only later care about
> | optimizing corner cases.
> 
> >
> > > Maybe set MMF_OOM_REAP_DONE on
> > > the previous victim's mm and check it instead of TIF_MEMDIE when selecting
> > > a victim? That will also avoid problems caused by clearing TIF_MEMDIE?
> >
> > Hmm, it doesn't seem we are under MMF_ availabel bits pressure right now
> > so using the flag sounds like the easiest way to go. Then we even do not
> > have to play with OOM_SCORE_ADJ_MIN which might be updated from the
> > userspace after the oom reaper has done that. Care to send a patch?
> 
> Not only we don't need to worry about ->oom_score_adj being modified from
> outside the SIGKILL pending tasks, I think we also don't need to clear remote
> TIF_MEMDIE if we use MMF_OOM_REAP_DONE. Something like below untested patch?

Dropping TIF_MEMDIE will help to unlock OOM killer as soon as we know
the current victim is no longer interesting for the OOM killer to allow
further victims selection. If we add MMF_OOM_REAP_DONE after reaping and
oom_scan_process_thread is taught to ignore those you will get all cases
of shared memory handles properly AFAICS. Such a patch should be really
trivial enhancement on top of the current code.

[...]

> I suggested many changes in this post because [PATCH 3/5] and [PATCH 5/5]
> made it possible for us to simplify [PATCH 1/5] like old versions. I think
> you want to rebuild this series with these changes merged as appropriate.

I would really _appreciate_ incremental changes as mentioned several
times already. And it would be really helpful if you could stick to
that. If you want to propose additional enhancements on top of the
current code you are free to do so but I am really reluctant to respin
everything to add more stuff into each patch and risk introduction of new
bugs. As things stand now patches are aiming to be as simple as possible
they do not introduce new bugs (except for the PM freezer which should
be fixable by a trivial patch which I will post after I get back from
vacation) and they improve things considerably in its current form
already.

I would like to target the next merge window rather than have this out
of tree for another release cycle which means that we should really
focus on the current functionality and make sure we haven't missed
anything. As there is no fundamental disagreement to the approach all
the rest are just technicalities.

Thanks

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-06  5:54         ` Tetsuo Handa
@ 2016-02-06  8:37           ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-06  8:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > 
> > This can happen of course but the likelihood is _much_ smaller without
> > the global OOM because the memcg OOM killer is invoked from a lockless
> > context so the oom context cannot block the victim to proceed.
> 
> Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> "lockless" is talking about only current thread, doesn't it?

Yes and you need the OOM context to sit on the same lock as the victim
to form a deadlock. So while the victim might be blocked somewhere it is
much less likely it would be deadlocked.

> Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> victim process, it is possible that non-first mm!=NULL thread triggers
> pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> isn't it?

I got lost here completely. Maybe it is your usage of thread terminology
again.
 
> Then, where is the guarantee that victim1 (first mm!=NULL thread in memcg1
> which got TIF_MEMDIE) is not waiting at down_read(&victim2->mm->mmap_sem)
> when victim2 (first mm!=NULL thread in memcg2 which got TIF_MEMDIE) is
> waiting at down_write(&victim2->mm->mmap_sem)

All threads/processes sharing the same mm are in fact in the same memory
cgroup. That is the reason we have owner in the task_struct

> or both victim1 and victim2
> are waiting on a lock somewhere in memory reclaim path (e.g.
> mutex_lock(&inode->i_mutex))?

Such waiting has to make a forward progress at some point in time
because the lock itself cannot be deadlocked by the memcg OOM context.


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
@ 2016-02-06  8:37           ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-06  8:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > 
> > This can happen of course but the likelihood is _much_ smaller without
> > the global OOM because the memcg OOM killer is invoked from a lockless
> > context so the oom context cannot block the victim to proceed.
> 
> Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> "lockless" is talking about only current thread, doesn't it?

Yes and you need the OOM context to sit on the same lock as the victim
to form a deadlock. So while the victim might be blocked somewhere it is
much less likely it would be deadlocked.

> Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> victim process, it is possible that non-first mm!=NULL thread triggers
> pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> isn't it?

I got lost here completely. Maybe it is your usage of thread terminology
again.
 
> Then, where is the guarantee that victim1 (first mm!=NULL thread in memcg1
> which got TIF_MEMDIE) is not waiting at down_read(&victim2->mm->mmap_sem)
> when victim2 (first mm!=NULL thread in memcg2 which got TIF_MEMDIE) is
> waiting at down_write(&victim2->mm->mmap_sem)

All threads/processes sharing the same mm are in fact in the same memory
cgroup. That is the reason we have owner in the task_struct

> or both victim1 and victim2
> are waiting on a lock somewhere in memory reclaim path (e.g.
> mutex_lock(&inode->i_mutex))?

Such waiting has to make a forward progress at some point in time
because the lock itself cannot be deadlocked by the memcg OOM context.


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-06  8:30               ` Michal Hocko
@ 2016-02-06 11:23                 ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06 11:23 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> I am not really sure I understand what you are trying to tell here to be honest
> but no I am not going to add any timers at this stage.

> Dropping TIF_MEMDIE will help to unlock OOM killer as soon as we know
> the current victim is no longer interesting for the OOM killer to allow
> further victims selection. If we add MMF_OOM_REAP_DONE after reaping and
> oom_scan_process_thread is taught to ignore those you will get all cases
> of shared memory handles properly AFAICS. Such a patch should be really
> trivial enhancement on top of the current code.

What I'm trying to tell is that we should prepare for corner cases where
dropping TIF_MEMDIE (or reaping the victim's memory) is not taken place.

What happens if TIF_MEMDIE was set at

	if (current->mm &&
	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
		mark_oom_victim(current);
		return true;
	}

in out_of_memory() because current thread received SIGKILL while doing
a GFP_KERNEL allocation of

  Do a GFP_KERNEL allocation.
  Bail out if GFP_KERNEL allocation failed.
  Hold a lock.
  Do a GFP_NOFS allocation.
  Release a lock.

sequence? If current thread is blocked at waiting for that lock held by
somebody else doing memory allocation, nothing will unlock the OOM killer
(because the OOM reaper is not woken up and a timer for unlocking the OOM
killer does not exist).

What happens if TIF_MEMDIE was set at

	task_lock(p);
	if (p->mm && task_will_free_mem(p)) {
		mark_oom_victim(p);
		task_unlock(p);
		put_task_struct(p);
		return;
	}
	task_unlock(p);

in oom_kill_process() when p is waiting for a lock held by somebody else
doing memory allocation? Since the OOM reaper will not be woken up,
nothing will unlock the OOM killer.

If TIF_MEMDIE was set at

	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
	mark_oom_victim(victim);
	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",

in oom_kill_process() but the OOM reaper was not woken up because of
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN case, nothing will unlock
the OOM killer.

By always waking the OOM reaper up, we can delegate the duty of unlocking
the OOM killer (by clearing TIF_MEMDIE or some other means) to the OOM
reaper because the OOM reaper is tracking all TIF_MEMDIE tasks.

Of course, it is possible that we handle such corner cases by adding
a timer for unlocking the OOM killer (regardless of availability of the
OOM reaper). But if we do "whether a mm is reapable or not" check at the
OOM reaper side by waking the OOM reaper up whenever TIF_MEMDIE is set
on some task, we can increase likeliness of dropping TIF_MEMDIE (after
reaping the victim's memory) being taken place and reduce frequency of
killing more OOM victims by using a timer for unlocking the OOM killer.

> I would like to target the next merge window rather than have this out
> of tree for another release cycle which means that we should really
> focus on the current functionality and make sure we haven't missed
> anything. As there is no fundamental disagreement to the approach all
> the rest are just technicalities.

Of course, we can target the OOM reaper for the next merge window. I'm
suggesting you that my changes would help handling corner cases (bugs)
you are not paying attention to.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-06 11:23                 ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06 11:23 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> I am not really sure I understand what you are trying to tell here to be honest
> but no I am not going to add any timers at this stage.

> Dropping TIF_MEMDIE will help to unlock OOM killer as soon as we know
> the current victim is no longer interesting for the OOM killer to allow
> further victims selection. If we add MMF_OOM_REAP_DONE after reaping and
> oom_scan_process_thread is taught to ignore those you will get all cases
> of shared memory handles properly AFAICS. Such a patch should be really
> trivial enhancement on top of the current code.

What I'm trying to tell is that we should prepare for corner cases where
dropping TIF_MEMDIE (or reaping the victim's memory) is not taken place.

What happens if TIF_MEMDIE was set at

	if (current->mm &&
	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
		mark_oom_victim(current);
		return true;
	}

in out_of_memory() because current thread received SIGKILL while doing
a GFP_KERNEL allocation of

  Do a GFP_KERNEL allocation.
  Bail out if GFP_KERNEL allocation failed.
  Hold a lock.
  Do a GFP_NOFS allocation.
  Release a lock.

sequence? If current thread is blocked at waiting for that lock held by
somebody else doing memory allocation, nothing will unlock the OOM killer
(because the OOM reaper is not woken up and a timer for unlocking the OOM
killer does not exist).

What happens if TIF_MEMDIE was set at

	task_lock(p);
	if (p->mm && task_will_free_mem(p)) {
		mark_oom_victim(p);
		task_unlock(p);
		put_task_struct(p);
		return;
	}
	task_unlock(p);

in oom_kill_process() when p is waiting for a lock held by somebody else
doing memory allocation? Since the OOM reaper will not be woken up,
nothing will unlock the OOM killer.

If TIF_MEMDIE was set at

	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
	mark_oom_victim(victim);
	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",

in oom_kill_process() but the OOM reaper was not woken up because of
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN case, nothing will unlock
the OOM killer.

By always waking the OOM reaper up, we can delegate the duty of unlocking
the OOM killer (by clearing TIF_MEMDIE or some other means) to the OOM
reaper because the OOM reaper is tracking all TIF_MEMDIE tasks.

Of course, it is possible that we handle such corner cases by adding
a timer for unlocking the OOM killer (regardless of availability of the
OOM reaper). But if we do "whether a mm is reapable or not" check at the
OOM reaper side by waking the OOM reaper up whenever TIF_MEMDIE is set
on some task, we can increase likeliness of dropping TIF_MEMDIE (after
reaping the victim's memory) being taken place and reduce frequency of
killing more OOM victims by using a timer for unlocking the OOM killer.

> I would like to target the next merge window rather than have this out
> of tree for another release cycle which means that we should really
> focus on the current functionality and make sure we haven't missed
> anything. As there is no fundamental disagreement to the approach all
> the rest are just technicalities.

Of course, we can target the OOM reaper for the next merge window. I'm
suggesting you that my changes would help handling corner cases (bugs)
you are not paying attention to.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-03 13:13   ` Michal Hocko
@ 2016-02-06 13:22     ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06 13:22 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.

Is it possible to clarify what "the result would be incomplete" mean?

  (1) The size of coredump file becomes smaller than it should be, and
      data in reaped pages is not included into the file.

  (2) The size of coredump file does not change, and data in reaped pages
      is included into the file as NUL byte.

  (3) The size of coredump file does not change, and data in reaped pages
      is included into the file as-is (i.e. information leak security risk).

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
@ 2016-02-06 13:22     ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06 13:22 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.

Is it possible to clarify what "the result would be incomplete" mean?

  (1) The size of coredump file becomes smaller than it should be, and
      data in reaped pages is not included into the file.

  (2) The size of coredump file does not change, and data in reaped pages
      is included into the file as NUL byte.

  (3) The size of coredump file does not change, and data in reaped pages
      is included into the file as-is (i.e. information leak security risk).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-06  6:45         ` Michal Hocko
@ 2016-02-06 14:33           ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06 14:33 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 04-02-16 15:43:19, Michal Hocko wrote:
> > On Thu 04-02-16 23:22:18, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > When oom_reaper manages to unmap all the eligible vmas there shouldn't
> > > > be much of the freable memory held by the oom victim left anymore so it
> > > > makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> > > > OOM killer to select another task.
> > > 
> > > Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
> > > with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
> > > reaper confuses
> > > 
> > >     wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> > > 
> > > in oom_killer_disable(), I'm worrying that the freezing operation continues
> > > before the OOM victim which escaped the __refrigerator() actually releases
> > > memory. Does this cause consistency problem?
> > 
> > This is a good question! At first sight it seems this is not safe and we
> > might need to make the oom_reaper freezable so that it doesn't wake up
> > during suspend and interfere. Let me think about that.
> 
> OK, I was thinking about it some more and it seems you are right here.
> oom_reaper as a kernel thread is not freezable automatically and so it
> might interfere after all the processes/kernel threads are considered
> frozen. Then it really might shut down TIF_MEMDIE too early and wake out
> oom_killer_disable. wait_event_freezable is not sufficient because the
> oom_reaper might running while the PM freezer is freezing tasks and it
> will miss it because it doesn't see it.

I'm not using PM freezer, but your answer is opposite to my guess.
I thought try_to_freeze_tasks(false) is called by freeze_kernel_threads()
after oom_killer_disable() succeeded, and try_to_freeze_tasks(false) will
freeze both userspace tasks (including OOM victims which got TIF_MEMDIE
cleared by the OOM reaper) and kernel threads (including the OOM reaper).
Thus, I was guessing that clearing TIF_MEMDIE without reaching do_exit() is
safe.

> 
> So I think we might need this. I am heading to vacation today and will
> be offline for the next week so I will prepare the full patch with the
> proper changelog after I get back:
> 
I can't judge whether we need this set_freezable().

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ca61e6cfae52..7e9953a64489 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -521,6 +521,8 @@ static void oom_reap_task(struct task_struct *tsk)
>  
>  static int oom_reaper(void *unused)
>  {
> +	set_freezable();
> +
>  	while (true) {
>  		struct task_struct *tsk = NULL;
>  
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-06 14:33           ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06 14:33 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 04-02-16 15:43:19, Michal Hocko wrote:
> > On Thu 04-02-16 23:22:18, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > When oom_reaper manages to unmap all the eligible vmas there shouldn't
> > > > be much of the freable memory held by the oom victim left anymore so it
> > > > makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> > > > OOM killer to select another task.
> > > 
> > > Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
> > > with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
> > > reaper confuses
> > > 
> > >     wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> > > 
> > > in oom_killer_disable(), I'm worrying that the freezing operation continues
> > > before the OOM victim which escaped the __refrigerator() actually releases
> > > memory. Does this cause consistency problem?
> > 
> > This is a good question! At first sight it seems this is not safe and we
> > might need to make the oom_reaper freezable so that it doesn't wake up
> > during suspend and interfere. Let me think about that.
> 
> OK, I was thinking about it some more and it seems you are right here.
> oom_reaper as a kernel thread is not freezable automatically and so it
> might interfere after all the processes/kernel threads are considered
> frozen. Then it really might shut down TIF_MEMDIE too early and wake out
> oom_killer_disable. wait_event_freezable is not sufficient because the
> oom_reaper might running while the PM freezer is freezing tasks and it
> will miss it because it doesn't see it.

I'm not using PM freezer, but your answer is opposite to my guess.
I thought try_to_freeze_tasks(false) is called by freeze_kernel_threads()
after oom_killer_disable() succeeded, and try_to_freeze_tasks(false) will
freeze both userspace tasks (including OOM victims which got TIF_MEMDIE
cleared by the OOM reaper) and kernel threads (including the OOM reaper).
Thus, I was guessing that clearing TIF_MEMDIE without reaching do_exit() is
safe.

> 
> So I think we might need this. I am heading to vacation today and will
> be offline for the next week so I will prepare the full patch with the
> proper changelog after I get back:
> 
I can't judge whether we need this set_freezable().

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ca61e6cfae52..7e9953a64489 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -521,6 +521,8 @@ static void oom_reap_task(struct task_struct *tsk)
>  
>  static int oom_reaper(void *unused)
>  {
> +	set_freezable();
> +
>  	while (true) {
>  		struct task_struct *tsk = NULL;
>  
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-06  8:37           ` Michal Hocko
@ 2016-02-06 15:33             ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06 15:33 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > > 
> > > This can happen of course but the likelihood is _much_ smaller without
> > > the global OOM because the memcg OOM killer is invoked from a lockless
> > > context so the oom context cannot block the victim to proceed.
> > 
> > Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> > mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> > "lockless" is talking about only current thread, doesn't it?
> 
> Yes and you need the OOM context to sit on the same lock as the victim
> to form a deadlock. So while the victim might be blocked somewhere it is
> much less likely it would be deadlocked.
> 
> > Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> > victim process, it is possible that non-first mm!=NULL thread triggers
> > pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> > isn't it?
> 
> I got lost here completely. Maybe it is your usage of thread terminology
> again.

I'm using "process" == "thread group" which contains at least one "thread",
and "thread" == "struct task_struct".
My assumption is

   (1) app1 process has two threads named app1t1 and app1t2
   (2) app2 process has two threads named app2t1 and app2t2
   (3) app1t1->mm == app1t2->mm != NULL and app2t1->mm == app2t2->mm != NULL
   (4) app1 is in memcg1 and app2 is in memcg2

and sequence is

   (1) app1t2 triggers pagefault_out_of_memory()
   (2) app1t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
   (3) oom_scan_process_thread() selects app1 as an OOM victim process
   (4) find_lock_task_mm() selects app1t1 as an OOM victim thread
   (5) app1t1 gets TIF_MEMDIE
   (6) app2t2 triggers pagefault_out_of_memory()
   (7) app2t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
   (8) oom_scan_process_thread() selects app2 as an OOM victim process
   (9) find_lock_task_mm() selects app2t1 as an OOM victim thread
   (10) app2t1 gets TIF_MEMDIE

.

I'm talking about situation where app1t1 is blocked at down_write(&app1t1->mm->mmap_sem)
because somebody else is already waiting at down_read(&app1t1->mm->mmap_sem) or is
doing memory allocation between down_read(&app1t1->mm->mmap_sem) and
up_read(&app1t1->mm->mmap_sem). In this case, this [PATCH 5/5] helps the OOM reaper to
reap app2t1->mm after giving up waiting for down_read(&app1t1->mm->mmap_sem) to succeed.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
@ 2016-02-06 15:33             ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-06 15:33 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > > 
> > > This can happen of course but the likelihood is _much_ smaller without
> > > the global OOM because the memcg OOM killer is invoked from a lockless
> > > context so the oom context cannot block the victim to proceed.
> > 
> > Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> > mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> > "lockless" is talking about only current thread, doesn't it?
> 
> Yes and you need the OOM context to sit on the same lock as the victim
> to form a deadlock. So while the victim might be blocked somewhere it is
> much less likely it would be deadlocked.
> 
> > Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> > victim process, it is possible that non-first mm!=NULL thread triggers
> > pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> > isn't it?
> 
> I got lost here completely. Maybe it is your usage of thread terminology
> again.

I'm using "process" == "thread group" which contains at least one "thread",
and "thread" == "struct task_struct".
My assumption is

   (1) app1 process has two threads named app1t1 and app1t2
   (2) app2 process has two threads named app2t1 and app2t2
   (3) app1t1->mm == app1t2->mm != NULL and app2t1->mm == app2t2->mm != NULL
   (4) app1 is in memcg1 and app2 is in memcg2

and sequence is

   (1) app1t2 triggers pagefault_out_of_memory()
   (2) app1t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
   (3) oom_scan_process_thread() selects app1 as an OOM victim process
   (4) find_lock_task_mm() selects app1t1 as an OOM victim thread
   (5) app1t1 gets TIF_MEMDIE
   (6) app2t2 triggers pagefault_out_of_memory()
   (7) app2t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
   (8) oom_scan_process_thread() selects app2 as an OOM victim process
   (9) find_lock_task_mm() selects app2t1 as an OOM victim thread
   (10) app2t1 gets TIF_MEMDIE

.

I'm talking about situation where app1t1 is blocked at down_write(&app1t1->mm->mmap_sem)
because somebody else is already waiting at down_read(&app1t1->mm->mmap_sem) or is
doing memory allocation between down_read(&app1t1->mm->mmap_sem) and
up_read(&app1t1->mm->mmap_sem). In this case, this [PATCH 5/5] helps the OOM reaper to
reap app2t1->mm after giving up waiting for down_read(&app1t1->mm->mmap_sem) to succeed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-06 15:33             ` Tetsuo Handa
@ 2016-02-15 20:15               ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-15 20:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sun 07-02-16 00:33:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > > > 
> > > > This can happen of course but the likelihood is _much_ smaller without
> > > > the global OOM because the memcg OOM killer is invoked from a lockless
> > > > context so the oom context cannot block the victim to proceed.
> > > 
> > > Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> > > mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> > > "lockless" is talking about only current thread, doesn't it?
> > 
> > Yes and you need the OOM context to sit on the same lock as the victim
> > to form a deadlock. So while the victim might be blocked somewhere it is
> > much less likely it would be deadlocked.
> > 
> > > Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> > > victim process, it is possible that non-first mm!=NULL thread triggers
> > > pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> > > isn't it?
> > 
> > I got lost here completely. Maybe it is your usage of thread terminology
> > again.
> 
> I'm using "process" == "thread group" which contains at least one "thread",
> and "thread" == "struct task_struct".
> My assumption is
> 
>    (1) app1 process has two threads named app1t1 and app1t2
>    (2) app2 process has two threads named app2t1 and app2t2
>    (3) app1t1->mm == app1t2->mm != NULL and app2t1->mm == app2t2->mm != NULL
>    (4) app1 is in memcg1 and app2 is in memcg2
> 
> and sequence is
> 
>    (1) app1t2 triggers pagefault_out_of_memory()
>    (2) app1t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
>    (3) oom_scan_process_thread() selects app1 as an OOM victim process
>    (4) find_lock_task_mm() selects app1t1 as an OOM victim thread
>    (5) app1t1 gets TIF_MEMDIE

OK so we have a victim in memcg1 and app1t2 will get to do_exit right away
because we are in the page fault path...

>    (6) app2t2 triggers pagefault_out_of_memory()
>    (7) app2t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
>    (8) oom_scan_process_thread() selects app2 as an OOM victim process
>    (9) find_lock_task_mm() selects app2t1 as an OOM victim thread
>    (10) app2t1 gets TIF_MEMDIE
> 
> .
> 
> I'm talking about situation where app1t1 is blocked at down_write(&app1t1->mm->mmap_sem)
> because somebody else is already waiting at down_read(&app1t1->mm->mmap_sem) or is
> doing memory allocation between down_read(&app1t1->mm->mmap_sem) and
> up_read(&app1t1->mm->mmap_sem).

Unless we are under global OOM then this doesn't matter much because the
allocation request should succeed at some point in time and memcg
charges are bypassed for tasks with pending fatal signals. So we can
make a forward progress.

> In this case, this [PATCH 5/5] helps the OOM reaper to reap app2t1->mm
> after giving up waiting for down_read(&app1t1->mm->mmap_sem) to
> succeed.

Why does that matter at all?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
@ 2016-02-15 20:15               ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-15 20:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sun 07-02-16 00:33:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > > > 
> > > > This can happen of course but the likelihood is _much_ smaller without
> > > > the global OOM because the memcg OOM killer is invoked from a lockless
> > > > context so the oom context cannot block the victim to proceed.
> > > 
> > > Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> > > mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> > > "lockless" is talking about only current thread, doesn't it?
> > 
> > Yes and you need the OOM context to sit on the same lock as the victim
> > to form a deadlock. So while the victim might be blocked somewhere it is
> > much less likely it would be deadlocked.
> > 
> > > Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> > > victim process, it is possible that non-first mm!=NULL thread triggers
> > > pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> > > isn't it?
> > 
> > I got lost here completely. Maybe it is your usage of thread terminology
> > again.
> 
> I'm using "process" == "thread group" which contains at least one "thread",
> and "thread" == "struct task_struct".
> My assumption is
> 
>    (1) app1 process has two threads named app1t1 and app1t2
>    (2) app2 process has two threads named app2t1 and app2t2
>    (3) app1t1->mm == app1t2->mm != NULL and app2t1->mm == app2t2->mm != NULL
>    (4) app1 is in memcg1 and app2 is in memcg2
> 
> and sequence is
> 
>    (1) app1t2 triggers pagefault_out_of_memory()
>    (2) app1t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
>    (3) oom_scan_process_thread() selects app1 as an OOM victim process
>    (4) find_lock_task_mm() selects app1t1 as an OOM victim thread
>    (5) app1t1 gets TIF_MEMDIE

OK so we have a victim in memcg1 and app1t2 will get to do_exit right away
because we are in the page fault path...

>    (6) app2t2 triggers pagefault_out_of_memory()
>    (7) app2t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
>    (8) oom_scan_process_thread() selects app2 as an OOM victim process
>    (9) find_lock_task_mm() selects app2t1 as an OOM victim thread
>    (10) app2t1 gets TIF_MEMDIE
> 
> .
> 
> I'm talking about situation where app1t1 is blocked at down_write(&app1t1->mm->mmap_sem)
> because somebody else is already waiting at down_read(&app1t1->mm->mmap_sem) or is
> doing memory allocation between down_read(&app1t1->mm->mmap_sem) and
> up_read(&app1t1->mm->mmap_sem).

Unless we are under global OOM then this doesn't matter much because the
allocation request should succeed at some point in time and memcg
charges are bypassed for tasks with pending fatal signals. So we can
make a forward progress.

> In this case, this [PATCH 5/5] helps the OOM reaper to reap app2t1->mm
> after giving up waiting for down_read(&app1t1->mm->mmap_sem) to
> succeed.

Why does that matter at all?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [PATCH 3.1/5] oom: make oom_reaper freezable
  2016-02-06 14:33           ` Tetsuo Handa
@ 2016-02-15 20:40             ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-15 20:40 UTC (permalink / raw)
  To: akpm, Tetsuo Handa
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

Andrew,
this can be either folded into 3/5 patch or go as a standalone one. I
would be inclined to have it standalone for the record (the description
should be pretty clear about the intention) and because the issue is
highly unlikely. OOM during the PM freezer doesn't happen in 99% cases.

On Sat 06-02-16 23:33:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > OK, I was thinking about it some more and it seems you are right here.
> > oom_reaper as a kernel thread is not freezable automatically and so it
> > might interfere after all the processes/kernel threads are considered
> > frozen. Then it really might shut down TIF_MEMDIE too early and wake out
> > oom_killer_disable. wait_event_freezable is not sufficient because the
> > oom_reaper might running while the PM freezer is freezing tasks and it
> > will miss it because it doesn't see it.
> 
> I'm not using PM freezer, but your answer is opposite to my guess.
> I thought try_to_freeze_tasks(false) is called by freeze_kernel_threads()
> after oom_killer_disable() succeeded, and try_to_freeze_tasks(false) will
> freeze both userspace tasks (including OOM victims which got TIF_MEMDIE
> cleared by the OOM reaper) and kernel threads (including the OOM reaper).

kernel threads which are not freezable are ignored by the freezer.

> Thus, I was guessing that clearing TIF_MEMDIE without reaching do_exit() is
> safe.

Does the following explains it better?
---
>From d7f57b1ac07532657312c91f3bba67cf0332b32f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 15 Feb 2016 10:09:39 +0100
Subject: [PATCH] oom: make oom_reaper freezable

After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
address space" oom_reaper will call exit_oom_victim on the target
task after it is done. This might however race with the PM freezer:

CPU0				CPU1				CPU2
freeze_processes
  try_to_freeze_tasks
  				# Allocation request
				out_of_memory
  oom_killer_disable
				  wake_oom_reaper(P1)
				  				__oom_reap_task
								  exit_oom_victim(P1)
    wait_event(oom_victims==0)
[...]
    				do_exit(P1)
				  perform IO/interfere with the freezer

which breaks the oom_killer_disable semantic. We no longer have a
guarantee that the oom victim won't interfere with the freezer because
it might be anywhere on the way to do_exit while the freezer thinks the
task has already terminated. It might trigger IO or touch devices which
are frozen already.

In order to close this race, make the oom_reaper thread freezable. This
will work because
	a) already running oom_reaper will block freezer to enter the
	   quiescent state
	b) wake_oom_reaper will not wake up the reaper after it has been
	   frozen
	c) the only way to call exit_oom_victim after try_to_freeze_tasks
	   is from the oom victim's context when we know the further
	   interference shouldn't be possible

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ca61e6cfae52..7e9953a64489 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -521,6 +521,8 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
+	set_freezable();
+
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 3.1/5] oom: make oom_reaper freezable
@ 2016-02-15 20:40             ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-15 20:40 UTC (permalink / raw)
  To: akpm, Tetsuo Handa
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

Andrew,
this can be either folded into 3/5 patch or go as a standalone one. I
would be inclined to have it standalone for the record (the description
should be pretty clear about the intention) and because the issue is
highly unlikely. OOM during the PM freezer doesn't happen in 99% cases.

On Sat 06-02-16 23:33:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > OK, I was thinking about it some more and it seems you are right here.
> > oom_reaper as a kernel thread is not freezable automatically and so it
> > might interfere after all the processes/kernel threads are considered
> > frozen. Then it really might shut down TIF_MEMDIE too early and wake out
> > oom_killer_disable. wait_event_freezable is not sufficient because the
> > oom_reaper might running while the PM freezer is freezing tasks and it
> > will miss it because it doesn't see it.
> 
> I'm not using PM freezer, but your answer is opposite to my guess.
> I thought try_to_freeze_tasks(false) is called by freeze_kernel_threads()
> after oom_killer_disable() succeeded, and try_to_freeze_tasks(false) will
> freeze both userspace tasks (including OOM victims which got TIF_MEMDIE
> cleared by the OOM reaper) and kernel threads (including the OOM reaper).

kernel threads which are not freezable are ignored by the freezer.

> Thus, I was guessing that clearing TIF_MEMDIE without reaching do_exit() is
> safe.

Does the following explains it better?
---

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-06 11:23                 ` Tetsuo Handa
@ 2016-02-15 20:47                   ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-15 20:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 06-02-16 20:23:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> By always waking the OOM reaper up, we can delegate the duty of unlocking
> the OOM killer (by clearing TIF_MEMDIE or some other means) to the OOM
> reaper because the OOM reaper is tracking all TIF_MEMDIE tasks.

And again, I didn't say this would be incorrect. I am just saying that
this will get more complex if we want to handle all the cases properly.

> > I would like to target the next merge window rather than have this out
> > of tree for another release cycle which means that we should really
> > focus on the current functionality and make sure we haven't missed
> > anything. As there is no fundamental disagreement to the approach all
> > the rest are just technicalities.
> 
> Of course, we can target the OOM reaper for the next merge window. I'm
> suggesting you that my changes would help handling corner cases (bugs)
> you are not paying attention to.

I am paying attention to them. I just think that incremental changes are
preferable and we should start with simpler cases before we go further
steps. There is no reason to rush this.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
@ 2016-02-15 20:47                   ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-15 20:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 06-02-16 20:23:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> By always waking the OOM reaper up, we can delegate the duty of unlocking
> the OOM killer (by clearing TIF_MEMDIE or some other means) to the OOM
> reaper because the OOM reaper is tracking all TIF_MEMDIE tasks.

And again, I didn't say this would be incorrect. I am just saying that
this will get more complex if we want to handle all the cases properly.

> > I would like to target the next merge window rather than have this out
> > of tree for another release cycle which means that we should really
> > focus on the current functionality and make sure we haven't missed
> > anything. As there is no fundamental disagreement to the approach all
> > the rest are just technicalities.
> 
> Of course, we can target the OOM reaper for the next merge window. I'm
> suggesting you that my changes would help handling corner cases (bugs)
> you are not paying attention to.

I am paying attention to them. I just think that incremental changes are
preferable and we should start with simpler cases before we go further
steps. There is no reason to rush this.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-06 13:22     ` Tetsuo Handa
@ 2016-02-15 20:50       ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-15 20:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 06-02-16 22:22:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > There is one notable exception to this, though, if the OOM victim was
> > in the process of coredumping the result would be incomplete. This is
> > considered a reasonable constrain because the overall system health is
> > more important than debugability of a particular application.
> 
> Is it possible to clarify what "the result would be incomplete" mean?
> 
>   (1) The size of coredump file becomes smaller than it should be, and
>       data in reaped pages is not included into the file.
> 
>   (2) The size of coredump file does not change, and data in reaped pages
>       is included into the file as NUL byte.

AFAIU this will be the case. We are not destroying VMAs we are just
unmapping the page ranges. So what would happen is that the core dump
will contain zero pages for anonymous mappings. This might change in
future though because the oom repear might be extended to do more work
(e.g. drop associated page tables when I would expect the core dumping
could SEGV).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
@ 2016-02-15 20:50       ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-15 20:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 06-02-16 22:22:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > There is one notable exception to this, though, if the OOM victim was
> > in the process of coredumping the result would be incomplete. This is
> > considered a reasonable constrain because the overall system health is
> > more important than debugability of a particular application.
> 
> Is it possible to clarify what "the result would be incomplete" mean?
> 
>   (1) The size of coredump file becomes smaller than it should be, and
>       data in reaped pages is not included into the file.
> 
>   (2) The size of coredump file does not change, and data in reaped pages
>       is included into the file as NUL byte.

AFAIU this will be the case. We are not destroying VMAs we are just
unmapping the page ranges. So what would happen is that the core dump
will contain zero pages for anonymous mappings. This might change in
future though because the oom repear might be extended to do more work
(e.g. drop associated page tables when I would expect the core dumping
could SEGV).

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-15 20:15               ` Michal Hocko
@ 2016-02-16 11:11                 ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-16 11:11 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> Unless we are under global OOM then this doesn't matter much because the
> allocation request should succeed at some point in time and memcg
> charges are bypassed for tasks with pending fatal signals. So we can
> make a forward progress.

Hmm, then I wonder how memcg OOM livelock occurs. Anyway, OK for now.

But current OOM reaper forgot a protection for list item "double add" bug.
Precisely speaking, this is not a OOM reaper's bug.

----------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

static int file_writer(void)
{
	static char buffer[4096] = { }; 
	const int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

static int memory_consumer(void)
{
	const int fd = open("/dev/zero", O_RDONLY);
	unsigned long size;
	char *buf = NULL;
	sleep(1);
	unlink("/tmp/file");
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	for (i = 0; i < 1024; i++)
		if (fork() == 0) {
			file_writer();
			_exit(0);
		}
	memory_consumer();
	while (1)
		pause();
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160216.txt.xz .
----------
[  140.758667] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  140.760706] a.out cpuset=/ mems_allowed=0
(...snipped...)
[  140.860676] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
[  140.864883] Killed process 10596 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  140.868483] oom_reaper: reaped process 10596 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
(...snipped...)
** 3 printk messages dropped ** [  206.416481] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 2 printk messages dropped ** [  206.418908] Killed process 10600 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
** 2 printk messages dropped ** [  206.421956] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 2 printk messages dropped ** [  206.424293] INFO: rcu_sched self-detected stall on CPU
** 3 printk messages dropped ** [  206.424300] oom_reaper      R  running task        0    33      2 0x00000008
[  206.424302]  ffff88007cd35900 000000000b04b66b ffff88007fcc3dd0 ffffffff8109db68
** 1 printk messages dropped ** [  206.424304]  ffffffff810a0bc4 0000000000000003 ffff88007fcc3e18 ffffffff8113e092
** 4 printk messages dropped ** [  206.424316]  [<ffffffff8113e092>] rcu_dump_cpu_stacks+0x73/0x94
** 3 printk messages dropped ** [  206.424322]  [<ffffffff810e2d44>] update_process_times+0x34/0x60
** 2 printk messages dropped ** [  206.424327]  [<ffffffff810f2a50>] ? tick_sched_handle.isra.20+0x40/0x40
** 4 printk messages dropped ** [  206.424334]  [<ffffffff81049598>] smp_apic_timer_interrupt+0x38/0x50
** 3 printk messages dropped ** [  206.424342]  [<ffffffff810d0f9a>] vprintk_default+0x1a/0x20
** 2 printk messages dropped ** [  206.424346]  [<ffffffff81708291>] ? _raw_spin_unlock_irqrestore+0x31/0x60
** 5 printk messages dropped ** [  206.424355]  [<ffffffff81090950>] ? kthread_create_on_node+0x230/0x230
** 2 printk messages dropped ** [  206.431196] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 3 printk messages dropped ** [  206.434491] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 1 printk messages dropped ** [  206.436152] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 3 printk messages dropped ** [  206.439359] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
(...snipped...)
[  312.387913] general protection fault: 0000 [#1] SMP 
[  312.387943] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper cryptd ppdev vmw_balloon pcspkr sg parport_pc shpchp parport i2c_piix4 vmw_vmci ip_tables sd_mod ata_generic pata_acpi crc32c_intel serio_raw mptspi scsi_transport_spi mptscsih vmwgfx mptbase drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm libahci drm ata_piix e1000 libata i2c_core
[  312.387945] CPU: 0 PID: 33 Comm: oom_reaper Not tainted 4.5.0-rc4-next-20160216 #305
[  312.387946] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  312.387947] task: ffff88007cd35900 ti: ffff88007cdf0000 task.ti: ffff88007cdf0000
[  312.387953] RIP: 0010:[<ffffffff8114425c>]  [<ffffffff8114425c>] oom_reaper+0x9c/0x1e0
[  312.387954] RSP: 0018:ffff88007cdf3e00  EFLAGS: 00010287
[  312.387954] RAX: dead000000000200 RBX: ffff8800621bac80 RCX: 0000000000000001
[  312.387955] RDX: dead000000000100 RSI: 0000000000000000 RDI: ffffffff81c4cac0
[  312.387955] RBP: ffff88007cdf3e60 R08: 0000000000000001 R09: 0000000000000000
[  312.387956] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007cd35900
[  312.387956] R13: 0000000000000001 R14: ffff88007cdf3e20 R15: ffff8800621bbe50
[  312.387957] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[  312.387958] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  312.387958] CR2: 00007f6d8eaa5000 CR3: 000000006361e000 CR4: 00000000001406f0
[  312.387988] Stack:
[  312.387990]  ffff88007cd35900 ffffffff00000000 ffff88007cd35900 ffffffff810b6100
[  312.387991]  ffff88007cdf3e20 ffff88007cdf3e20 000000000b04b66b ffff88007cd9f340
[  312.387992]  0000000000000000 ffffffff811441c0 ffffffff81c4e300 0000000000000000
[  312.387992] Call Trace:
[  312.387998]  [<ffffffff810b6100>] ? wait_woken+0x90/0x90
[  312.388003]  [<ffffffff811441c0>] ? __oom_reap_task+0x220/0x220
[  312.388005]  [<ffffffff81090a49>] kthread+0xf9/0x110
[  312.388011]  [<ffffffff81708c32>] ret_from_fork+0x22/0x50
[  312.388012]  [<ffffffff81090950>] ? kthread_create_on_node+0x230/0x230
[  312.388025] Code: cb c4 81 0f 84 a0 00 00 00 4c 8b 3d bf 88 b0 00 48 c7 c7 c0 ca c4 81 41 bd 01 00 00 00 49 8b 47 08 49 8b 17 49 8d 9f 30 ee ff ff <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 07 66 
[  312.388027] RIP  [<ffffffff8114425c>] oom_reaper+0x9c/0x1e0
[  312.388028]  RSP <ffff88007cdf3e00>
[  312.388029] ---[ end trace ea62d9784868759a ]---
[  312.388031] BUG: sleeping function called from invalid context at include/linux/sched.h:2819
[  312.388032] in_atomic(): 1, irqs_disabled(): 0, pid: 33, name: oom_reaper
[  312.388033] INFO: lockdep is turned off.
[  312.388034] CPU: 0 PID: 33 Comm: oom_reaper Tainted: G      D         4.5.0-rc4-next-20160216 #305
[  312.388035] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  312.388036]  0000000000000286 000000000b04b66b ffff88007cdf3be0 ffffffff8139e83d
[  312.388037]  ffff88007cd35900 ffffffff819b931a ffff88007cdf3c08 ffffffff810961eb
[  312.388038]  ffffffff819b931a 0000000000000b03 0000000000000000 ffff88007cdf3c30
[  312.388038] Call Trace:
[  312.388043]  [<ffffffff8139e83d>] dump_stack+0x85/0xc8
[  312.388046]  [<ffffffff810961eb>] ___might_sleep+0x14b/0x240
[  312.388047]  [<ffffffff81096324>] __might_sleep+0x44/0x80
[  312.388050]  [<ffffffff8107f7ee>] exit_signals+0x2e/0x150
[  312.388051]  [<ffffffff81091fa1>] ? blocking_notifier_call_chain+0x11/0x20
[  312.388054]  [<ffffffff81072e32>] do_exit+0xc2/0xb50
[  312.388057]  [<ffffffff8101893c>] oops_end+0x9c/0xd0
[  312.388058]  [<ffffffff81018ba6>] die+0x46/0x60
[  312.388059]  [<ffffffff8101602b>] do_general_protection+0xdb/0x1b0
[  312.388061]  [<ffffffff8170a4f8>] general_protection+0x28/0x30
[  312.388064]  [<ffffffff8114425c>] ? oom_reaper+0x9c/0x1e0
[  312.388066]  [<ffffffff810b6100>] ? wait_woken+0x90/0x90
[  312.388067]  [<ffffffff811441c0>] ? __oom_reap_task+0x220/0x220
[  312.388068]  [<ffffffff81090a49>] kthread+0xf9/0x110
[  312.388071]  [<ffffffff81708c32>] ret_from_fork+0x22/0x50
[  312.388072]  [<ffffffff81090950>] ? kthread_create_on_node+0x230/0x230
[  312.388075] note: oom_reaper[33] exited with preempt_count 1
----------

For oom_kill_allocating_task = 1 case (despite the name, it still tries to kill
children first), the OOM killer does not wait for OOM victim to clear TIF_MEMDIE
because select_bad_process() is not called. Therefore, if an OOM victim fails to
terminate because the OOM reaper failed to reap enough memory, the kernel is
flooded with OOM killer messages trying to kill that stuck victim (with OOM
reaper lockup due to list corruption).

Adding an OOM victim which was already added to oom_reaper_list is wrong.
What should we do here?

(Choice 1) Make sure that TIF_MEMDIE thread is not chosen as an OOM victim.
           This will avoid list corruption, but choosing other !TIF_MEMDIE
           threads sharing the TIF_MEMDIE thread's mm and adding them to
           oom_reaper_list does not make sense.

(Choice 2) Make sure that any process which includes a TIF_MEMDIE thread is
           not chosen as an OOM victim. But choosing other processes without
           TIF_MEMDIE thread sharing the TIF_MEMDIE thread's mm and adding
           them to oom_reaper_list does not make sense. A mm should be added
           to oom_reaper_list up to only once.

(Choice 3) Make sure that any process which uses a mm which was added to
           oom_reaper_list is not chosen as an OOM victim. This would mean
           replacing test_tsk_thread_flag(task, TIF_MEMDIE) with
           (mm = task->mm, mm && test_bit(MMF_OOM_VICTIM, &mm->flags))
           in OOM killer and replacing test_thread_flag(TIF_MEMDIE) with
           (current->mm && test_bit(MMF_OOM_VICTIM, &current->mm->flags) &&
            (fatal_signal_pending(current) || (current->flags & PF_EXITING)))
           in ALLOC_NO_WATERMARKS check.

(Choice 4) Call select_bad_process() for oom_kill_allocating_task = 1 case.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7653055..5e3e2f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -880,15 +880,6 @@ bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc, constraint, NULL);
 
-	if (sysctl_oom_kill_allocating_task && current->mm &&
-	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-		get_task_struct(current);
-		oom_kill_process(oc, current, 0, totalpages, NULL,
-				 "Out of memory (oom_kill_allocating_task)");
-		return true;
-	}
-
 	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
@@ -896,8 +887,16 @@ bool out_of_memory(struct oom_control *oc)
 		panic("Out of memory and no killable processes...\n");
 	}
 	if (p && p != (void *)-1UL) {
-		oom_kill_process(oc, p, points, totalpages, NULL,
-				 "Out of memory");
+		if (sysctl_oom_kill_allocating_task && current->mm &&
+		    !oom_unkillable_task(current, NULL, oc->nodemask) &&
+		    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+			put_task_struct(p);
+			get_task_struct(current);
+			oom_kill_process(oc, current, 0, totalpages, NULL,
+					 "Out of memory (oom_kill_allocating_task)");
+		} else
+			oom_kill_process(oc, p, points, totalpages, NULL,
+					 "Out of memory");
 		/*
 		 * Give the killed process a good chance to exit before trying
 		 * to allocate memory again.
----------

(Choice 5) Any other ideas?

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
@ 2016-02-16 11:11                 ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-16 11:11 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> Unless we are under global OOM then this doesn't matter much because the
> allocation request should succeed at some point in time and memcg
> charges are bypassed for tasks with pending fatal signals. So we can
> make a forward progress.

Hmm, then I wonder how memcg OOM livelock occurs. Anyway, OK for now.

But current OOM reaper forgot a protection for list item "double add" bug.
Precisely speaking, this is not a OOM reaper's bug.

----------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

static int file_writer(void)
{
	static char buffer[4096] = { }; 
	const int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

static int memory_consumer(void)
{
	const int fd = open("/dev/zero", O_RDONLY);
	unsigned long size;
	char *buf = NULL;
	sleep(1);
	unlink("/tmp/file");
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	for (i = 0; i < 1024; i++)
		if (fork() == 0) {
			file_writer();
			_exit(0);
		}
	memory_consumer();
	while (1)
		pause();
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160216.txt.xz .
----------
[  140.758667] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  140.760706] a.out cpuset=/ mems_allowed=0
(...snipped...)
[  140.860676] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
[  140.864883] Killed process 10596 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  140.868483] oom_reaper: reaped process 10596 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
(...snipped...)
** 3 printk messages dropped ** [  206.416481] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 2 printk messages dropped ** [  206.418908] Killed process 10600 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
** 2 printk messages dropped ** [  206.421956] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 2 printk messages dropped ** [  206.424293] INFO: rcu_sched self-detected stall on CPU
** 3 printk messages dropped ** [  206.424300] oom_reaper      R  running task        0    33      2 0x00000008
[  206.424302]  ffff88007cd35900 000000000b04b66b ffff88007fcc3dd0 ffffffff8109db68
** 1 printk messages dropped ** [  206.424304]  ffffffff810a0bc4 0000000000000003 ffff88007fcc3e18 ffffffff8113e092
** 4 printk messages dropped ** [  206.424316]  [<ffffffff8113e092>] rcu_dump_cpu_stacks+0x73/0x94
** 3 printk messages dropped ** [  206.424322]  [<ffffffff810e2d44>] update_process_times+0x34/0x60
** 2 printk messages dropped ** [  206.424327]  [<ffffffff810f2a50>] ? tick_sched_handle.isra.20+0x40/0x40
** 4 printk messages dropped ** [  206.424334]  [<ffffffff81049598>] smp_apic_timer_interrupt+0x38/0x50
** 3 printk messages dropped ** [  206.424342]  [<ffffffff810d0f9a>] vprintk_default+0x1a/0x20
** 2 printk messages dropped ** [  206.424346]  [<ffffffff81708291>] ? _raw_spin_unlock_irqrestore+0x31/0x60
** 5 printk messages dropped ** [  206.424355]  [<ffffffff81090950>] ? kthread_create_on_node+0x230/0x230
** 2 printk messages dropped ** [  206.431196] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 3 printk messages dropped ** [  206.434491] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 1 printk messages dropped ** [  206.436152] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 3 printk messages dropped ** [  206.439359] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
(...snipped...)
[  312.387913] general protection fault: 0000 [#1] SMP 
[  312.387943] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper cryptd ppdev vmw_balloon pcspkr sg parport_pc shpchp parport i2c_piix4 vmw_vmci ip_tables sd_mod ata_generic pata_acpi crc32c_intel serio_raw mptspi scsi_transport_spi mptscsih vmwgfx mptbase drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm libahci drm ata_piix e1000 libata i2c_core
[  312.387945] CPU: 0 PID: 33 Comm: oom_reaper Not tainted 4.5.0-rc4-next-20160216 #305
[  312.387946] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  312.387947] task: ffff88007cd35900 ti: ffff88007cdf0000 task.ti: ffff88007cdf0000
[  312.387953] RIP: 0010:[<ffffffff8114425c>]  [<ffffffff8114425c>] oom_reaper+0x9c/0x1e0
[  312.387954] RSP: 0018:ffff88007cdf3e00  EFLAGS: 00010287
[  312.387954] RAX: dead000000000200 RBX: ffff8800621bac80 RCX: 0000000000000001
[  312.387955] RDX: dead000000000100 RSI: 0000000000000000 RDI: ffffffff81c4cac0
[  312.387955] RBP: ffff88007cdf3e60 R08: 0000000000000001 R09: 0000000000000000
[  312.387956] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007cd35900
[  312.387956] R13: 0000000000000001 R14: ffff88007cdf3e20 R15: ffff8800621bbe50
[  312.387957] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[  312.387958] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  312.387958] CR2: 00007f6d8eaa5000 CR3: 000000006361e000 CR4: 00000000001406f0
[  312.387988] Stack:
[  312.387990]  ffff88007cd35900 ffffffff00000000 ffff88007cd35900 ffffffff810b6100
[  312.387991]  ffff88007cdf3e20 ffff88007cdf3e20 000000000b04b66b ffff88007cd9f340
[  312.387992]  0000000000000000 ffffffff811441c0 ffffffff81c4e300 0000000000000000
[  312.387992] Call Trace:
[  312.387998]  [<ffffffff810b6100>] ? wait_woken+0x90/0x90
[  312.388003]  [<ffffffff811441c0>] ? __oom_reap_task+0x220/0x220
[  312.388005]  [<ffffffff81090a49>] kthread+0xf9/0x110
[  312.388011]  [<ffffffff81708c32>] ret_from_fork+0x22/0x50
[  312.388012]  [<ffffffff81090950>] ? kthread_create_on_node+0x230/0x230
[  312.388025] Code: cb c4 81 0f 84 a0 00 00 00 4c 8b 3d bf 88 b0 00 48 c7 c7 c0 ca c4 81 41 bd 01 00 00 00 49 8b 47 08 49 8b 17 49 8d 9f 30 ee ff ff <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 07 66 
[  312.388027] RIP  [<ffffffff8114425c>] oom_reaper+0x9c/0x1e0
[  312.388028]  RSP <ffff88007cdf3e00>
[  312.388029] ---[ end trace ea62d9784868759a ]---
[  312.388031] BUG: sleeping function called from invalid context at include/linux/sched.h:2819
[  312.388032] in_atomic(): 1, irqs_disabled(): 0, pid: 33, name: oom_reaper
[  312.388033] INFO: lockdep is turned off.
[  312.388034] CPU: 0 PID: 33 Comm: oom_reaper Tainted: G      D         4.5.0-rc4-next-20160216 #305
[  312.388035] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  312.388036]  0000000000000286 000000000b04b66b ffff88007cdf3be0 ffffffff8139e83d
[  312.388037]  ffff88007cd35900 ffffffff819b931a ffff88007cdf3c08 ffffffff810961eb
[  312.388038]  ffffffff819b931a 0000000000000b03 0000000000000000 ffff88007cdf3c30
[  312.388038] Call Trace:
[  312.388043]  [<ffffffff8139e83d>] dump_stack+0x85/0xc8
[  312.388046]  [<ffffffff810961eb>] ___might_sleep+0x14b/0x240
[  312.388047]  [<ffffffff81096324>] __might_sleep+0x44/0x80
[  312.388050]  [<ffffffff8107f7ee>] exit_signals+0x2e/0x150
[  312.388051]  [<ffffffff81091fa1>] ? blocking_notifier_call_chain+0x11/0x20
[  312.388054]  [<ffffffff81072e32>] do_exit+0xc2/0xb50
[  312.388057]  [<ffffffff8101893c>] oops_end+0x9c/0xd0
[  312.388058]  [<ffffffff81018ba6>] die+0x46/0x60
[  312.388059]  [<ffffffff8101602b>] do_general_protection+0xdb/0x1b0
[  312.388061]  [<ffffffff8170a4f8>] general_protection+0x28/0x30
[  312.388064]  [<ffffffff8114425c>] ? oom_reaper+0x9c/0x1e0
[  312.388066]  [<ffffffff810b6100>] ? wait_woken+0x90/0x90
[  312.388067]  [<ffffffff811441c0>] ? __oom_reap_task+0x220/0x220
[  312.388068]  [<ffffffff81090a49>] kthread+0xf9/0x110
[  312.388071]  [<ffffffff81708c32>] ret_from_fork+0x22/0x50
[  312.388072]  [<ffffffff81090950>] ? kthread_create_on_node+0x230/0x230
[  312.388075] note: oom_reaper[33] exited with preempt_count 1
----------

For oom_kill_allocating_task = 1 case (despite the name, it still tries to kill
children first), the OOM killer does not wait for OOM victim to clear TIF_MEMDIE
because select_bad_process() is not called. Therefore, if an OOM victim fails to
terminate because the OOM reaper failed to reap enough memory, the kernel is
flooded with OOM killer messages trying to kill that stuck victim (with OOM
reaper lockup due to list corruption).

Adding an OOM victim which was already added to oom_reaper_list is wrong.
What should we do here?

(Choice 1) Make sure that TIF_MEMDIE thread is not chosen as an OOM victim.
           This will avoid list corruption, but choosing other !TIF_MEMDIE
           threads sharing the TIF_MEMDIE thread's mm and adding them to
           oom_reaper_list does not make sense.

(Choice 2) Make sure that any process which includes a TIF_MEMDIE thread is
           not chosen as an OOM victim. But choosing other processes without
           TIF_MEMDIE thread sharing the TIF_MEMDIE thread's mm and adding
           them to oom_reaper_list does not make sense. A mm should be added
           to oom_reaper_list up to only once.

(Choice 3) Make sure that any process which uses a mm which was added to
           oom_reaper_list is not chosen as an OOM victim. This would mean
           replacing test_tsk_thread_flag(task, TIF_MEMDIE) with
           (mm = task->mm, mm && test_bit(MMF_OOM_VICTIM, &mm->flags))
           in OOM killer and replacing test_thread_flag(TIF_MEMDIE) with
           (current->mm && test_bit(MMF_OOM_VICTIM, &current->mm->flags) &&
            (fatal_signal_pending(current) || (current->flags & PF_EXITING)))
           in ALLOC_NO_WATERMARKS check.

(Choice 4) Call select_bad_process() for oom_kill_allocating_task = 1 case.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7653055..5e3e2f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -880,15 +880,6 @@ bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc, constraint, NULL);
 
-	if (sysctl_oom_kill_allocating_task && current->mm &&
-	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-		get_task_struct(current);
-		oom_kill_process(oc, current, 0, totalpages, NULL,
-				 "Out of memory (oom_kill_allocating_task)");
-		return true;
-	}
-
 	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
@@ -896,8 +887,16 @@ bool out_of_memory(struct oom_control *oc)
 		panic("Out of memory and no killable processes...\n");
 	}
 	if (p && p != (void *)-1UL) {
-		oom_kill_process(oc, p, points, totalpages, NULL,
-				 "Out of memory");
+		if (sysctl_oom_kill_allocating_task && current->mm &&
+		    !oom_unkillable_task(current, NULL, oc->nodemask) &&
+		    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+			put_task_struct(p);
+			get_task_struct(current);
+			oom_kill_process(oc, current, 0, totalpages, NULL,
+					 "Out of memory (oom_kill_allocating_task)");
+		} else
+			oom_kill_process(oc, p, points, totalpages, NULL,
+					 "Out of memory");
 		/*
 		 * Give the killed process a good chance to exit before trying
 		 * to allocate memory again.
----------

(Choice 5) Any other ideas?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-16 11:11                 ` Tetsuo Handa
@ 2016-02-16 15:53                   ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-16 15:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Tue 16-02-16 20:11:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Unless we are under global OOM then this doesn't matter much because the
> > allocation request should succeed at some point in time and memcg
> > charges are bypassed for tasks with pending fatal signals. So we can
> > make a forward progress.
> 
> Hmm, then I wonder how memcg OOM livelock occurs. Anyway, OK for now.
> 
> But current OOM reaper forgot a protection for list item "double add" bug.
> Precisely speaking, this is not a OOM reaper's bug.
[...]
> For oom_kill_allocating_task = 1 case (despite the name, it still tries to kill
> children first),

Yes this is a long term behavior and I cannot say I would be happy about
that because it clearly breaks the defined semantic.

> the OOM killer does not wait for OOM victim to clear TIF_MEMDIE
> because select_bad_process() is not called. Therefore, if an OOM victim fails to
> terminate because the OOM reaper failed to reap enough memory, the kernel is
> flooded with OOM killer messages trying to kill that stuck victim (with OOM
> reaper lockup due to list corruption).

Hmmm, I didn't consider this possibility. For now I would simply disable
oom_reaper for sysctl_oom_kill_allocating_task. oom_kill_allocating_task
needs some more changes IMO. a) we shouldn't kill children as a
heuristic b) we should go and panic if the current task is TIF_MEMDIE
already because that means that we cannot do anything about OOM. But I
think this should be handled separately.

Whould be the following acceptable for now?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e9953a64489..357cee067950 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,7 +678,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap;
+	
+	/*
+	 * oom_kill_allocating_task doesn't follow normal OOM exclusion
+	 * and so the same task might enter oom_kill_process which oom_reaper
+	 * cannot handle currently.
+	 */
+	can_oom_reap = !sysctl_oom_kill_allocating_task;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
@ 2016-02-16 15:53                   ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-16 15:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Tue 16-02-16 20:11:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Unless we are under global OOM then this doesn't matter much because the
> > allocation request should succeed at some point in time and memcg
> > charges are bypassed for tasks with pending fatal signals. So we can
> > make a forward progress.
> 
> Hmm, then I wonder how memcg OOM livelock occurs. Anyway, OK for now.
> 
> But current OOM reaper forgot a protection for list item "double add" bug.
> Precisely speaking, this is not a OOM reaper's bug.
[...]
> For oom_kill_allocating_task = 1 case (despite the name, it still tries to kill
> children first),

Yes this is a long term behavior and I cannot say I would be happy about
that because it clearly breaks the defined semantic.

> the OOM killer does not wait for OOM victim to clear TIF_MEMDIE
> because select_bad_process() is not called. Therefore, if an OOM victim fails to
> terminate because the OOM reaper failed to reap enough memory, the kernel is
> flooded with OOM killer messages trying to kill that stuck victim (with OOM
> reaper lockup due to list corruption).

Hmmm, I didn't consider this possibility. For now I would simply disable
oom_reaper for sysctl_oom_kill_allocating_task. oom_kill_allocating_task
needs some more changes IMO. a) we shouldn't kill children as a
heuristic b) we should go and panic if the current task is TIF_MEMDIE
already because that means that we cannot do anything about OOM. But I
think this should be handled separately.

Whould be the following acceptable for now?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e9953a64489..357cee067950 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,7 +678,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap;
+	
+	/*
+	 * oom_kill_allocating_task doesn't follow normal OOM exclusion
+	 * and so the same task might enter oom_kill_process which oom_reaper
+	 * cannot handle currently.
+	 */
+	can_oom_reap = !sysctl_oom_kill_allocating_task;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
  2016-02-03 13:14   ` Michal Hocko
@ 2016-02-17  9:48     ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-17  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

Hi Andrew,
although this can be folded into patch 5
(mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
better to have it separate and revert after we sort out the proper
oom_kill_allocating_task behavior or handle exclusion at oom_reaper
level.

Thanks!
---
>From 7d8c953994f97fb38a8d71b53c06ecf8418616e9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 17 Feb 2016 10:40:41 +0100
Subject: [PATCH] oom, oom_reaper: disable oom_reaper for
 oom_kill_allocating_task

Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times. Let's workaround this issue for now until we decide
how to handle oom_kill_allocating_task properly (should it sacrifice
children at all?) or come up with some other protection.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e9953a64489..078e07ec0906 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,7 +678,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap;
+
+	/*
+	 * XXX: oom_kill_allocating_task doesn't follow normal OOM exclusion
+	 * and so the same task might enter oom_kill_process which oom_reaper
+	 * cannot handle currently.
+	 */
+	can_oom_reap = !sysctl_oom_kill_allocating_task;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
@ 2016-02-17  9:48     ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-17  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

Hi Andrew,
although this can be folded into patch 5
(mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
better to have it separate and revert after we sort out the proper
oom_kill_allocating_task behavior or handle exclusion at oom_reaper
level.

Thanks!
---

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
  2016-02-17  9:48     ` Michal Hocko
@ 2016-02-17 10:41       ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:41 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> Hi Andrew,
> although this can be folded into patch 5
> (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> better to have it separate and revert after we sort out the proper
> oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> level.

What a rough workaround. sysctl_oom_kill_allocating_task == 1 does not
always mean we must skip OOM reaper, for OOM-unkillable callers take
sysctl_oom_kill_allocating_task == 0 path.

I've just posted a patchset which allows you to merge the OOM reaper
without correcting problems found in "[PATCH 3/5] oom: clear TIF_MEMDIE
after oom_reaper managed to unmap the address space" and "[PATCH 5/5]
mm, oom_reaper: implement OOM victims queuing".

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
@ 2016-02-17 10:41       ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:41 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> Hi Andrew,
> although this can be folded into patch 5
> (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> better to have it separate and revert after we sort out the proper
> oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> level.

What a rough workaround. sysctl_oom_kill_allocating_task == 1 does not
always mean we must skip OOM reaper, for OOM-unkillable callers take
sysctl_oom_kill_allocating_task == 0 path.

I've just posted a patchset which allows you to merge the OOM reaper
without correcting problems found in "[PATCH 3/5] oom: clear TIF_MEMDIE
after oom_reaper managed to unmap the address space" and "[PATCH 5/5]
mm, oom_reaper: implement OOM victims queuing".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
  2016-02-17 10:41       ` Tetsuo Handa
@ 2016-02-17 11:33         ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-17 11:33 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 19:41:53, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Hi Andrew,
> > although this can be folded into patch 5
> > (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> > better to have it separate and revert after we sort out the proper
> > oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> > level.
> 
> What a rough workaround. sysctl_oom_kill_allocating_task == 1 does not
> always mean we must skip OOM reaper, for OOM-unkillable callers take
> sysctl_oom_kill_allocating_task == 0 path.

Yes it is indeed rough but also shouldn't add new issues. I consider
oom_kill_allocating_task as a borderline which can be sorted out later.
So while I do not like workarounds like this in general I would rather
go with obvious code first before going for more complex solutions.

> I've just posted a patchset which allows you to merge the OOM reaper
> without correcting problems found in "[PATCH 3/5] oom: clear TIF_MEMDIE
> after oom_reaper managed to unmap the address space" and "[PATCH 5/5]
> mm, oom_reaper: implement OOM victims queuing".

I will try to look at your patches but the series seems unnecessarily
heavy to be a pre-requisite for the oom_reaper.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
@ 2016-02-17 11:33         ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-17 11:33 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 19:41:53, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Hi Andrew,
> > although this can be folded into patch 5
> > (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> > better to have it separate and revert after we sort out the proper
> > oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> > level.
> 
> What a rough workaround. sysctl_oom_kill_allocating_task == 1 does not
> always mean we must skip OOM reaper, for OOM-unkillable callers take
> sysctl_oom_kill_allocating_task == 0 path.

Yes it is indeed rough but also shouldn't add new issues. I consider
oom_kill_allocating_task as a borderline which can be sorted out later.
So while I do not like workarounds like this in general I would rather
go with obvious code first before going for more complex solutions.

> I've just posted a patchset which allows you to merge the OOM reaper
> without correcting problems found in "[PATCH 3/5] oom: clear TIF_MEMDIE
> after oom_reaper managed to unmap the address space" and "[PATCH 5/5]
> mm, oom_reaper: implement OOM victims queuing".

I will try to look at your patches but the series seems unnecessarily
heavy to be a pre-requisite for the oom_reaper.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
  2016-02-17  9:48     ` Michal Hocko
@ 2016-02-19 18:34       ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-19 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 17-02-16 10:48:55, Michal Hocko wrote:
> Hi Andrew,
> although this can be folded into patch 5
> (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> better to have it separate and revert after we sort out the proper
> oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> level.

An alternative would be something like the following. It is definitely
less hackish but it steals one bit in mm->flags. We do not seem to be
in shortage there now but who knows. Does this sound better? Later
changes might even consider the flag for the victim selection and ignore
those which already have the flag set. But I didn't think about it more
to form a patch yet.
---
>From 8b17e66a70edac65ecd6df411a675cf3d840a9fe Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 17 Feb 2016 10:40:41 +0100
Subject: [PATCH] oom, oom_reaper: disable oom_reaper for
 oom_kill_allocating_task

Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times.

This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
which is set in oom_kill_process atomically and oom reaper is disabled
if the flag was already set.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 2 ++
 mm/oom_kill.c         | 6 +++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c25996c336de..0552cd5696c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -509,6 +509,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 
+#define MMF_OOM_KILLED		21	/* OOM killer has chosen this mm */
+
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
 struct sighand_struct {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e9953a64489..32ce05b1aa10 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,7 +678,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -740,6 +740,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
+
+	/* Make sure we do not try to oom reap the mm multiple times */
+	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
+
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
-- 
2.7.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
@ 2016-02-19 18:34       ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-19 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 17-02-16 10:48:55, Michal Hocko wrote:
> Hi Andrew,
> although this can be folded into patch 5
> (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> better to have it separate and revert after we sort out the proper
> oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> level.

An alternative would be something like the following. It is definitely
less hackish but it steals one bit in mm->flags. We do not seem to be
in shortage there now but who knows. Does this sound better? Later
changes might even consider the flag for the victim selection and ignore
those which already have the flag set. But I didn't think about it more
to form a patch yet.
---

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-02-19 18:34       ` Michal Hocko
@ 2016-02-20  2:32         ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-20  2:32 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> On Wed 17-02-16 10:48:55, Michal Hocko wrote:
> > Hi Andrew,
> > although this can be folded into patch 5
> > (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> > better to have it separate and revert after we sort out the proper
> > oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> > level.
> 
> An alternative would be something like the following. It is definitely
> less hackish but it steals one bit in mm->flags. We do not seem to be
> in shortage there now but who knows. Does this sound better? Later
> changes might even consider the flag for the victim selection and ignore
> those which already have the flag set. But I didn't think about it more
> to form a patch yet.

This sounds better than "can_oom_reap = !sysctl_oom_kill_allocating_task;".

> @@ -740,6 +740,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	/* Get a reference to safely compare mm after task_unlock(victim) */
>  	mm = victim->mm;
>  	atomic_inc(&mm->mm_count);
> +
> +	/* Make sure we do not try to oom reap the mm multiple times */
> +	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> +
>  	/*
>  	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
>  	 * the OOM victim from depleting the memory reserves from the user

But as of this line we don't know whether this mm is reapable.

Shouldn't this be done like

  static void wake_oom_reaper(struct task_struct *tsk)
  {
          /* Make sure we do not try to oom reap the mm multiple times */
          if (!oom_reaper_th || !test_and_set_bit(MMF_OOM_KILLED, &mm->flags))
                  return;

          get_task_struct(tsk);

          spin_lock(&oom_reaper_lock);
          list_add(&tsk->oom_reaper_list, &oom_reaper_list);
          spin_unlock(&oom_reaper_lock);
          wake_up(&oom_reaper_wait);
  }

?

Moreover, why don't you do like

  struct mm_struct {
  	(...snipped...)
  	struct list_head oom_reaper_list;
  	(...snipped...)
  }

than

  struct task_struct {
  	(...snipped...)
  	struct list_head oom_reaper_list;
  	(...snipped...)
  }

so that we can update all ->oom_score_adj using this mm_struct for handling
crazy combo ( http://lkml.kernel.org/r/20160204163113.GF14425@dhcp22.suse.cz ) ?

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
@ 2016-02-20  2:32         ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-20  2:32 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> On Wed 17-02-16 10:48:55, Michal Hocko wrote:
> > Hi Andrew,
> > although this can be folded into patch 5
> > (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> > better to have it separate and revert after we sort out the proper
> > oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> > level.
> 
> An alternative would be something like the following. It is definitely
> less hackish but it steals one bit in mm->flags. We do not seem to be
> in shortage there now but who knows. Does this sound better? Later
> changes might even consider the flag for the victim selection and ignore
> those which already have the flag set. But I didn't think about it more
> to form a patch yet.

This sounds better than "can_oom_reap = !sysctl_oom_kill_allocating_task;".

> @@ -740,6 +740,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	/* Get a reference to safely compare mm after task_unlock(victim) */
>  	mm = victim->mm;
>  	atomic_inc(&mm->mm_count);
> +
> +	/* Make sure we do not try to oom reap the mm multiple times */
> +	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> +
>  	/*
>  	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
>  	 * the OOM victim from depleting the memory reserves from the user

But as of this line we don't know whether this mm is reapable.

Shouldn't this be done like

  static void wake_oom_reaper(struct task_struct *tsk)
  {
          /* Make sure we do not try to oom reap the mm multiple times */
          if (!oom_reaper_th || !test_and_set_bit(MMF_OOM_KILLED, &mm->flags))
                  return;

          get_task_struct(tsk);

          spin_lock(&oom_reaper_lock);
          list_add(&tsk->oom_reaper_list, &oom_reaper_list);
          spin_unlock(&oom_reaper_lock);
          wake_up(&oom_reaper_wait);
  }

?

Moreover, why don't you do like

  struct mm_struct {
  	(...snipped...)
  	struct list_head oom_reaper_list;
  	(...snipped...)
  }

than

  struct task_struct {
  	(...snipped...)
  	struct list_head oom_reaper_list;
  	(...snipped...)
  }

so that we can update all ->oom_score_adj using this mm_struct for handling
crazy combo ( http://lkml.kernel.org/r/20160204163113.GF14425@dhcp22.suse.cz ) ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-02-20  2:32         ` Tetsuo Handa
@ 2016-02-22  9:41           ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-22  9:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 20-02-16 11:32:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 17-02-16 10:48:55, Michal Hocko wrote:
> > > Hi Andrew,
> > > although this can be folded into patch 5
> > > (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> > > better to have it separate and revert after we sort out the proper
> > > oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> > > level.
> > 
> > An alternative would be something like the following. It is definitely
> > less hackish but it steals one bit in mm->flags. We do not seem to be
> > in shortage there now but who knows. Does this sound better? Later
> > changes might even consider the flag for the victim selection and ignore
> > those which already have the flag set. But I didn't think about it more
> > to form a patch yet.
> 
> This sounds better than "can_oom_reap = !sysctl_oom_kill_allocating_task;".
> 
> > @@ -740,6 +740,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	/* Get a reference to safely compare mm after task_unlock(victim) */
> >  	mm = victim->mm;
> >  	atomic_inc(&mm->mm_count);
> > +
> > +	/* Make sure we do not try to oom reap the mm multiple times */
> > +	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> > +
> >  	/*
> >  	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
> >  	 * the OOM victim from depleting the memory reserves from the user
> 
> But as of this line we don't know whether this mm is reapable.

Which is not really important. We know that it is eligible only if the
mm wasn't a part of the OOM kill before. Later checks are, of course,
allowed to veto the default and disable the oom reaper.

> Shouldn't this be done like
> 
>   static void wake_oom_reaper(struct task_struct *tsk)
>   {
>           /* Make sure we do not try to oom reap the mm multiple times */
>           if (!oom_reaper_th || !test_and_set_bit(MMF_OOM_KILLED, &mm->flags))
>                   return;

We do not have the mm here. We have a task and would need the task_lock.
I find it much easier to evaluate mm while we still have it and we know
the task holding this mm will receive SIGKILL and TIF_MEMDIE.
 
>           get_task_struct(tsk);
> 
>           spin_lock(&oom_reaper_lock);
>           list_add(&tsk->oom_reaper_list, &oom_reaper_list);
>           spin_unlock(&oom_reaper_lock);
>           wake_up(&oom_reaper_wait);
>   }
> 
> ?
> 
> Moreover, why don't you do like
> 
>   struct mm_struct {
>   	(...snipped...)
>   	struct list_head oom_reaper_list;
>   	(...snipped...)
>   }

Because we would need to search all tasks sharing the same mm in order
to exit_oom_victim.

> than
> 
>   struct task_struct {
>   	(...snipped...)
>   	struct list_head oom_reaper_list;
>   	(...snipped...)
>   }
> 
> so that we can update all ->oom_score_adj using this mm_struct for handling
> crazy combo ( http://lkml.kernel.org/r/20160204163113.GF14425@dhcp22.suse.cz ) ?

I find it much easier to to simply skip over tasks with MMF_OOM_KILLED
when already selecting a victim. We won't need oom_score_adj games at
all. This needs a deeper evaluation though. I didn't get to it yet,
but the point of having MMF flag which is not oom_reaper specific
was to have it reusable in other contexts as well.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
@ 2016-02-22  9:41           ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-22  9:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 20-02-16 11:32:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 17-02-16 10:48:55, Michal Hocko wrote:
> > > Hi Andrew,
> > > although this can be folded into patch 5
> > > (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> > > better to have it separate and revert after we sort out the proper
> > > oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> > > level.
> > 
> > An alternative would be something like the following. It is definitely
> > less hackish but it steals one bit in mm->flags. We do not seem to be
> > in shortage there now but who knows. Does this sound better? Later
> > changes might even consider the flag for the victim selection and ignore
> > those which already have the flag set. But I didn't think about it more
> > to form a patch yet.
> 
> This sounds better than "can_oom_reap = !sysctl_oom_kill_allocating_task;".
> 
> > @@ -740,6 +740,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	/* Get a reference to safely compare mm after task_unlock(victim) */
> >  	mm = victim->mm;
> >  	atomic_inc(&mm->mm_count);
> > +
> > +	/* Make sure we do not try to oom reap the mm multiple times */
> > +	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> > +
> >  	/*
> >  	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
> >  	 * the OOM victim from depleting the memory reserves from the user
> 
> But as of this line we don't know whether this mm is reapable.

Which is not really important. We know that it is eligible only if the
mm wasn't a part of the OOM kill before. Later checks are, of course,
allowed to veto the default and disable the oom reaper.

> Shouldn't this be done like
> 
>   static void wake_oom_reaper(struct task_struct *tsk)
>   {
>           /* Make sure we do not try to oom reap the mm multiple times */
>           if (!oom_reaper_th || !test_and_set_bit(MMF_OOM_KILLED, &mm->flags))
>                   return;

We do not have the mm here. We have a task and would need the task_lock.
I find it much easier to evaluate mm while we still have it and we know
the task holding this mm will receive SIGKILL and TIF_MEMDIE.
 
>           get_task_struct(tsk);
> 
>           spin_lock(&oom_reaper_lock);
>           list_add(&tsk->oom_reaper_list, &oom_reaper_list);
>           spin_unlock(&oom_reaper_lock);
>           wake_up(&oom_reaper_wait);
>   }
> 
> ?
> 
> Moreover, why don't you do like
> 
>   struct mm_struct {
>   	(...snipped...)
>   	struct list_head oom_reaper_list;
>   	(...snipped...)
>   }

Because we would need to search all tasks sharing the same mm in order
to exit_oom_victim.

> than
> 
>   struct task_struct {
>   	(...snipped...)
>   	struct list_head oom_reaper_list;
>   	(...snipped...)
>   }
> 
> so that we can update all ->oom_score_adj using this mm_struct for handling
> crazy combo ( http://lkml.kernel.org/r/20160204163113.GF14425@dhcp22.suse.cz ) ?

I find it much easier to to simply skip over tasks with MMF_OOM_KILLED
when already selecting a victim. We won't need oom_score_adj games at
all. This needs a deeper evaluation though. I didn't get to it yet,
but the point of having MMF flag which is not oom_reaper specific
was to have it reusable in other contexts as well.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-03 13:13   ` Michal Hocko
@ 2016-02-23  1:36     ` David Rientjes
  -1 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-23  1:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9a0e4e5f50b4..840e03986497 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -443,13 +443,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
>  			continue;
>  
>  		/*
> -		 * mlocked VMAs require explicit munlocking before unmap.
> -		 * Let's keep it simple here and skip such VMAs.
> -		 */
> -		if (vma->vm_flags & VM_LOCKED)
> -			continue;
> -
> -		/*
>  		 * Only anonymous pages have a good chance to be dropped
>  		 * without additional steps which we cannot afford as we
>  		 * are OOM already.
> @@ -459,9 +452,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
>  		 * we do not want to block exit_mmap by keeping mm ref
>  		 * count elevated without a good reason.
>  		 */
> -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> +			if (vma->vm_flags & VM_LOCKED)
> +				munlock_vma_pages_all(vma);
>  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
>  					 &details);
> +		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
>  	up_read(&mm->mmap_sem);

Are we concerned about munlock_vma_pages_all() taking lock_page() and 
perhaps stalling forever, the same way it would stall in exit_mmap() for 
VM_LOCKED vmas, if another thread has locked the same page and is doing an 
allocation?  I'm wondering if in that case it would be better to do a 
best-effort munlock_vma_pages_all() with trylock_page() and just give up 
on releasing memory from that particular vma.  In that case, there may be 
other memory that can be freed with unmap_page_range() that would handle 
this livelock.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
@ 2016-02-23  1:36     ` David Rientjes
  0 siblings, 0 replies; 112+ messages in thread
From: David Rientjes @ 2016-02-23  1:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9a0e4e5f50b4..840e03986497 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -443,13 +443,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
>  			continue;
>  
>  		/*
> -		 * mlocked VMAs require explicit munlocking before unmap.
> -		 * Let's keep it simple here and skip such VMAs.
> -		 */
> -		if (vma->vm_flags & VM_LOCKED)
> -			continue;
> -
> -		/*
>  		 * Only anonymous pages have a good chance to be dropped
>  		 * without additional steps which we cannot afford as we
>  		 * are OOM already.
> @@ -459,9 +452,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
>  		 * we do not want to block exit_mmap by keeping mm ref
>  		 * count elevated without a good reason.
>  		 */
> -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> +			if (vma->vm_flags & VM_LOCKED)
> +				munlock_vma_pages_all(vma);
>  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
>  					 &details);
> +		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
>  	up_read(&mm->mmap_sem);

Are we concerned about munlock_vma_pages_all() taking lock_page() and 
perhaps stalling forever, the same way it would stall in exit_mmap() for 
VM_LOCKED vmas, if another thread has locked the same page and is doing an 
allocation?  I'm wondering if in that case it would be better to do a 
best-effort munlock_vma_pages_all() with trylock_page() and just give up 
on releasing memory from that particular vma.  In that case, there may be 
other memory that can be freed with unmap_page_range() that would handle 
this livelock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-23  1:36     ` David Rientjes
@ 2016-02-23 13:21       ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-23 13:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 22-02-16 17:36:07, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 9a0e4e5f50b4..840e03986497 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -443,13 +443,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
> >  			continue;
> >  
> >  		/*
> > -		 * mlocked VMAs require explicit munlocking before unmap.
> > -		 * Let's keep it simple here and skip such VMAs.
> > -		 */
> > -		if (vma->vm_flags & VM_LOCKED)
> > -			continue;
> > -
> > -		/*
> >  		 * Only anonymous pages have a good chance to be dropped
> >  		 * without additional steps which we cannot afford as we
> >  		 * are OOM already.
> > @@ -459,9 +452,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
> >  		 * we do not want to block exit_mmap by keeping mm ref
> >  		 * count elevated without a good reason.
> >  		 */
> > -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> > +			if (vma->vm_flags & VM_LOCKED)
> > +				munlock_vma_pages_all(vma);
> >  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> >  					 &details);
> > +		}
> >  	}
> >  	tlb_finish_mmu(&tlb, 0, -1);
> >  	up_read(&mm->mmap_sem);
> 
> Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> perhaps stalling forever, the same way it would stall in exit_mmap() for 
> VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> allocation?

This is a good question. I have checked for that particular case
previously and managed to convinced myself that this is OK(ish).
munlock_vma_pages_range locks only THP pages to prevent from the
parallel split-up AFAICS. And split_huge_page_to_list doesn't seem
to depend on an allocation. It can block on anon_vma lock but I didn't
see any allocation requests from there either. I might be missing
something of course. Do you have any specific path in mind?

> I'm wondering if in that case it would be better to do a 
> best-effort munlock_vma_pages_all() with trylock_page() and just give up 
> on releasing memory from that particular vma.  In that case, there may be 
> other memory that can be freed with unmap_page_range() that would handle 
> this livelock.

I have tried to code it up but I am not really sure the whole churn is
really worth it - unless I am missing something that would really make
the THP case likely to hit in the real life.

Just for the reference this is what I came up with (just compile tested).
---
diff --git a/mm/internal.h b/mm/internal.h
index cac6eb458727..63dcdd60aca8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -249,11 +249,13 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 #ifdef CONFIG_MMU
 extern long populate_vma_page_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, int *nonblocking);
-extern void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end);
-static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
+
+/* Can fail only if enforce == false */
+extern int munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, bool enforce);
+static inline int munlock_vma_pages_all(struct vm_area_struct *vma, bool enforce)
 {
-	munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+	return munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end, enforce);
 }
 
 /*
diff --git a/mm/mlock.c b/mm/mlock.c
index 96f001041928..934c0f8f8ebc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -431,8 +431,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
  * and re-mlocked by try_to_{munlock|unmap} before we unmap and
  * free them.  This will result in freeing mlocked pages.
  */
-void munlock_vma_pages_range(struct vm_area_struct *vma,
-			     unsigned long start, unsigned long end)
+int munlock_vma_pages_range(struct vm_area_struct *vma,
+			     unsigned long start, unsigned long end,
+			     bool enforce)
 {
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
 
@@ -460,7 +461,13 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 				VM_BUG_ON_PAGE(PageMlocked(page), page);
 				put_page(page); /* follow_page_mask() */
 			} else if (PageTransHuge(page)) {
-				lock_page(page);
+				if (enforce) {
+					lock_page(page);
+				} else if (!trylock_page(page)) {
+					put_page(page);
+					return -EAGAIN;
+				}
+
 				/*
 				 * Any THP page found by follow_page_mask() may
 				 * have gotten split before reaching
@@ -497,6 +504,8 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 next:
 		cond_resched();
 	}
+
+	return 0;
 }
 
 /*
@@ -561,7 +570,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	if (lock)
 		vma->vm_flags = newflags;
 	else
-		munlock_vma_pages_range(vma, start, end);
+		munlock_vma_pages_range(vma, start, end, true);
 
 out:
 	*prev = vma;
diff --git a/mm/mmap.c b/mm/mmap.c
index cfc0cdca421e..7c2ed6e7b415 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2592,7 +2592,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 		while (tmp && tmp->vm_start < end) {
 			if (tmp->vm_flags & VM_LOCKED) {
 				mm->locked_vm -= vma_pages(tmp);
-				munlock_vma_pages_all(tmp);
+				munlock_vma_pages_all(tmp, true);
 			}
 			tmp = tmp->vm_next;
 		}
@@ -2683,7 +2683,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	if (vma->vm_flags & VM_LOCKED) {
 		flags |= MAP_LOCKED;
 		/* drop PG_Mlocked flag for over-mapped range */
-		munlock_vma_pages_range(vma, start, start + size);
+		munlock_vma_pages_range(vma, start, start + size, true);
 	}
 
 	file = get_file(vma->vm_file);
@@ -2825,7 +2825,7 @@ void exit_mmap(struct mm_struct *mm)
 		vma = mm->mmap;
 		while (vma) {
 			if (vma->vm_flags & VM_LOCKED)
-				munlock_vma_pages_all(vma);
+				munlock_vma_pages_all(vma, true);
 			vma = vma->vm_next;
 		}
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 32ce05b1aa10..09e6f3211f1c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -473,7 +473,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		 */
 		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
 			if (vma->vm_flags & VM_LOCKED)
-				munlock_vma_pages_all(vma);
+				if (munlock_vma_pages_all(vma, false))
+					continue;
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 &details);
 		}

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
@ 2016-02-23 13:21       ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-23 13:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 22-02-16 17:36:07, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 9a0e4e5f50b4..840e03986497 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -443,13 +443,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
> >  			continue;
> >  
> >  		/*
> > -		 * mlocked VMAs require explicit munlocking before unmap.
> > -		 * Let's keep it simple here and skip such VMAs.
> > -		 */
> > -		if (vma->vm_flags & VM_LOCKED)
> > -			continue;
> > -
> > -		/*
> >  		 * Only anonymous pages have a good chance to be dropped
> >  		 * without additional steps which we cannot afford as we
> >  		 * are OOM already.
> > @@ -459,9 +452,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
> >  		 * we do not want to block exit_mmap by keeping mm ref
> >  		 * count elevated without a good reason.
> >  		 */
> > -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> > +			if (vma->vm_flags & VM_LOCKED)
> > +				munlock_vma_pages_all(vma);
> >  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> >  					 &details);
> > +		}
> >  	}
> >  	tlb_finish_mmu(&tlb, 0, -1);
> >  	up_read(&mm->mmap_sem);
> 
> Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> perhaps stalling forever, the same way it would stall in exit_mmap() for 
> VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> allocation?

This is a good question. I have checked for that particular case
previously and managed to convinced myself that this is OK(ish).
munlock_vma_pages_range locks only THP pages to prevent from the
parallel split-up AFAICS. And split_huge_page_to_list doesn't seem
to depend on an allocation. It can block on anon_vma lock but I didn't
see any allocation requests from there either. I might be missing
something of course. Do you have any specific path in mind?

> I'm wondering if in that case it would be better to do a 
> best-effort munlock_vma_pages_all() with trylock_page() and just give up 
> on releasing memory from that particular vma.  In that case, there may be 
> other memory that can be freed with unmap_page_range() that would handle 
> this livelock.

I have tried to code it up but I am not really sure the whole churn is
really worth it - unless I am missing something that would really make
the THP case likely to hit in the real life.

Just for the reference this is what I came up with (just compile tested).
---
diff --git a/mm/internal.h b/mm/internal.h
index cac6eb458727..63dcdd60aca8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -249,11 +249,13 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 #ifdef CONFIG_MMU
 extern long populate_vma_page_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, int *nonblocking);
-extern void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end);
-static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
+
+/* Can fail only if enforce == false */
+extern int munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, bool enforce);
+static inline int munlock_vma_pages_all(struct vm_area_struct *vma, bool enforce)
 {
-	munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+	return munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end, enforce);
 }
 
 /*
diff --git a/mm/mlock.c b/mm/mlock.c
index 96f001041928..934c0f8f8ebc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -431,8 +431,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
  * and re-mlocked by try_to_{munlock|unmap} before we unmap and
  * free them.  This will result in freeing mlocked pages.
  */
-void munlock_vma_pages_range(struct vm_area_struct *vma,
-			     unsigned long start, unsigned long end)
+int munlock_vma_pages_range(struct vm_area_struct *vma,
+			     unsigned long start, unsigned long end,
+			     bool enforce)
 {
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
 
@@ -460,7 +461,13 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 				VM_BUG_ON_PAGE(PageMlocked(page), page);
 				put_page(page); /* follow_page_mask() */
 			} else if (PageTransHuge(page)) {
-				lock_page(page);
+				if (enforce) {
+					lock_page(page);
+				} else if (!trylock_page(page)) {
+					put_page(page);
+					return -EAGAIN;
+				}
+
 				/*
 				 * Any THP page found by follow_page_mask() may
 				 * have gotten split before reaching
@@ -497,6 +504,8 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 next:
 		cond_resched();
 	}
+
+	return 0;
 }
 
 /*
@@ -561,7 +570,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	if (lock)
 		vma->vm_flags = newflags;
 	else
-		munlock_vma_pages_range(vma, start, end);
+		munlock_vma_pages_range(vma, start, end, true);
 
 out:
 	*prev = vma;
diff --git a/mm/mmap.c b/mm/mmap.c
index cfc0cdca421e..7c2ed6e7b415 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2592,7 +2592,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 		while (tmp && tmp->vm_start < end) {
 			if (tmp->vm_flags & VM_LOCKED) {
 				mm->locked_vm -= vma_pages(tmp);
-				munlock_vma_pages_all(tmp);
+				munlock_vma_pages_all(tmp, true);
 			}
 			tmp = tmp->vm_next;
 		}
@@ -2683,7 +2683,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	if (vma->vm_flags & VM_LOCKED) {
 		flags |= MAP_LOCKED;
 		/* drop PG_Mlocked flag for over-mapped range */
-		munlock_vma_pages_range(vma, start, start + size);
+		munlock_vma_pages_range(vma, start, start + size, true);
 	}
 
 	file = get_file(vma->vm_file);
@@ -2825,7 +2825,7 @@ void exit_mmap(struct mm_struct *mm)
 		vma = mm->mmap;
 		while (vma) {
 			if (vma->vm_flags & VM_LOCKED)
-				munlock_vma_pages_all(vma);
+				munlock_vma_pages_all(vma, true);
 			vma = vma->vm_next;
 		}
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 32ce05b1aa10..09e6f3211f1c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -473,7 +473,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		 */
 		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
 			if (vma->vm_flags & VM_LOCKED)
-				munlock_vma_pages_all(vma);
+				if (munlock_vma_pages_all(vma, false))
+					continue;
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 &details);
 		}

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-03 13:13   ` Michal Hocko
  (?)
  (?)
@ 2016-02-25 11:28   ` Tetsuo Handa
  2016-02-25 11:31     ` Tetsuo Handa
  2016-02-25 14:16     ` Michal Hocko
  -1 siblings, 2 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-25 11:28 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, hannes, linux-mm

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> When oom_reaper manages to unmap all the eligible vmas there shouldn't
> be much of the freable memory held by the oom victim left anymore so it
> makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> OOM killer to select another task.
> 
> The lack of TIF_MEMDIE also means that the victim cannot access memory
> reserves anymore but that shouldn't be a problem because it would get
> the access again if it needs to allocate and hits the OOM killer again
> due to the fatal_signal_pending resp. PF_EXITING check. We can safely
> hide the task from the OOM killer because it is clearly not a good
> candidate anymore as everyhing reclaimable has been torn down already.
> 
> This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
> and thus hold off further global OOM killer actions granted the oom
> reaper is able to take mmap_sem for the associated mm struct. This is
> not guaranteed now but further steps should make sure that mmap_sem
> for write should be blocked killable which will help to reduce such a
> lock contention. This is not done by this patch.

Did you decide what to do if the OOM reaper is unable to take mmap_sem
for the associated mm struct? I can not only disable the OOM reaper but
also disable further global OOM killer actions at the same time because
the OOM reaper does not clear TIF_MEMDIE. I think we need to take action
regardless of whether the OOM reaper was able to reap the associated mm
struct.

The reproducer program and kernel patch is shown at
http://lkml.kernel.org/r/201602252025.ICH57371.FMOtOOFJFQHVSL@I-love.SAKURA.ne.jp .

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160225.txt.xz .
---------- console log ----------
[   59.707294] Killed process 2020 (a.out) total-vm:8272kB, anon-rss:4160kB, file-rss:0kB, shmem-rss:0kB
[   59.714851] oom_reaper: reaped process 2020 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
[   59.739672] Out of memory: Kill process 1562 (a.out) score 1003 or sacrifice child
[   59.746018] Killed process 1562 (a.out) total-vm:8268kB, anon-rss:4024kB, file-rss:0kB, shmem-rss:0kB
[   60.003576] MemAlloc-Info: stalling=880 dying=42 exiting=0 victim=1 oom_count=113/49
(...snipped...)
[   60.757062] oom_reaper: unable to reap pid:1562 (a.out)
(...snipped...)
[   60.758417] 1 lock held by a.out/2269:
[   60.758417]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
[   60.759217] 1 lock held by a.out/2371:
[   60.759217]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
[   60.759732] 1 lock held by a.out/2600:
[   60.759732]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
[   60.760394] 1 lock held by a.out/2673:
[   60.760394]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
[   60.760528] 1 lock held by a.out/2689:
[   60.760528]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
[   60.760803] 1 lock held by a.out/2723:
[   60.760803]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
[   60.760884] 1 lock held by a.out/2731:
[   60.760884]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
[   60.761381] 1 lock held by a.out/2781:
[   60.761382]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
[   60.761466] 1 lock held by a.out/2791:
[   60.761466]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
[   60.761615] 1 lock held by a.out/2803:
[   60.761615]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
(...snipped...)
[  222.428230] MemAlloc-Info: stalling=876 dying=36 exiting=0 victim=1 oom_count=72229/1744
[  232.434607] MemAlloc-Info: stalling=1082 dying=36 exiting=0 victim=1 oom_count=75777/1855
(...snipped...)
[  232.711169] MemAlloc: kswapd0(49) flags=0xa60840 switches=59 uninterruptible
[  232.716820] kswapd0         D ffff880039f935d0     0    49      2 0x00000000
[  232.722494]  ffff880039f935d0 ffff8800366500c0 ffff880039f8c040 ffff880039f94000
[  232.728190]  ffff8800367eac80 ffff8800367eac98 0000000000000000 ffff880039f93820
[  232.733785]  ffff880039f935e8 ffffffff81672d1a ffff880039f8c040 ffff880039f93650
[  232.739694] Call Trace:
[  232.741797]  [<ffffffff81672d1a>] schedule+0x3a/0x90
[  232.745674]  [<ffffffff81676b16>] rwsem_down_read_failed+0xd6/0x140
[  232.750368]  [<ffffffff8132c324>] call_rwsem_down_read_failed+0x14/0x30
[  232.755333]  [<ffffffff8167645d>] ? down_read+0x3d/0x50
[  232.759397]  [<ffffffffa026286b>] ? xfs_log_commit_cil+0x5b/0x490 [xfs]
[  232.764628]  [<ffffffffa026286b>] xfs_log_commit_cil+0x5b/0x490 [xfs]
[  232.769470]  [<ffffffffa025d1e3>] __xfs_trans_commit+0x123/0x230 [xfs]
[  232.774341]  [<ffffffffa025d57b>] xfs_trans_commit+0xb/0x10 [xfs]
[  232.778929]  [<ffffffffa024e394>] xfs_iomap_write_allocate+0x194/0x380 [xfs]
[  232.784237]  [<ffffffffa023b3ed>] xfs_map_blocks+0x13d/0x150 [xfs]
[  232.788860]  [<ffffffffa023c2ab>] xfs_do_writepage+0x15b/0x520 [xfs]
[  232.793646]  [<ffffffffa023c6a6>] xfs_vm_writepage+0x36/0x70 [xfs]
[  232.798328]  [<ffffffff8115773f>] pageout.isra.43+0x18f/0x240
[  232.802677]  [<ffffffff811590a3>] shrink_page_list+0x803/0xae0
[  232.807247]  [<ffffffff81159ae7>] shrink_inactive_list+0x207/0x550
[  232.811912]  [<ffffffff8115a7d6>] shrink_zone_memcg+0x5b6/0x780
[  232.816585]  [<ffffffff811b38ed>] ? mem_cgroup_iter+0x15d/0x7c0
[  232.820923]  [<ffffffff8115aa74>] shrink_zone+0xd4/0x2f0
[  232.824963]  [<ffffffff8115b8fe>] kswapd+0x41e/0x800
[  232.828916]  [<ffffffff8115b4e0>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[  232.833835]  [<ffffffff810923ee>] kthread+0xee/0x110
[  232.837628]  [<ffffffff81678572>] ret_from_fork+0x22/0x50
[  232.841640]  [<ffffffff81092300>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[  247.278695] MemAlloc: a.out(2269) flags=0x400040 switches=2115 seq=2 gfp=0x26084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO|__GFP_NOTRACK) order=0 delay=181584 uninterruptible
[  247.289757] a.out           D ffff880027aafa48     0  2269   1229 0x00000080
[  247.295134]  ffff880027aafa48 ffff880027a38040 ffff880027aa8100 ffff880027ab0000
[  247.301091]  ffff880027aafa80 ffff88003d610240 00000000ffff2f67 000000000000004c
[  247.306879]  ffff880027aafa60 ffffffff81672d1a ffff88003d610240 ffff880027aafb08
[  247.312590] Call Trace:
[  247.314749]  [<ffffffff81672d1a>] schedule+0x3a/0x90
[  247.319758]  [<ffffffff8167717e>] schedule_timeout+0x11e/0x1c0
[  247.324427]  [<ffffffff810bd056>] ? mark_held_locks+0x66/0x90
[  247.329085]  [<ffffffff810e1270>] ? init_timer_key+0x40/0x40
[  247.333385]  [<ffffffff810e8197>] ? ktime_get+0xa7/0x130
[  247.337469]  [<ffffffff816720c1>] io_schedule_timeout+0xa1/0x110
[  247.342046]  [<ffffffff811650ed>] congestion_wait+0x7d/0xd0
[  247.346570]  [<ffffffff810b73e0>] ? wait_woken+0x80/0x80
[  247.350714]  [<ffffffff8114e584>] __alloc_pages_nodemask+0xd74/0xed0
[  247.355918]  [<ffffffff81198026>] alloc_pages_current+0x96/0x1b0
[  247.361015]  [<ffffffff81062162>] pte_alloc_one+0x12/0x60
[  247.365132]  [<ffffffff811740e9>] __pte_alloc+0x19/0x110
[  247.369162]  [<ffffffff8117e20a>] move_page_tables+0x5da/0x700
[  247.373560]  [<ffffffff8117b00d>] ? copy_vma+0x20d/0x260
[  247.377647]  [<ffffffff8117e41b>] move_vma+0xeb/0x2d0
[  247.381577]  [<ffffffff8117eac8>] SyS_mremap+0x4c8/0x500
[  247.385727]  [<ffffffff8100365d>] do_syscall_64+0x5d/0x180
[  247.389869]  [<ffffffff816783ff>] entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  275.592260] MemAlloc: a.out(1562) flags=0x400040 switches=17 uninterruptible dying victim
[  275.598299] a.out           D ffff88002e3c7d68     0  1562   1229 0x00100084
[  275.603752]  ffff88002e3c7d68 ffffffff818f9500 ffff88002e3c00c0 ffff88002e3c8000
[  275.609544]  ffff88003d1f74c8 0000000000000246 ffff88002e3c00c0 00000000ffffffff
[  275.615527]  ffff88002e3c7d80 ffffffff81672d1a ffff88003d1f74c0 ffff88002e3c7d90
[  275.621526] Call Trace:
[  275.623707]  [<ffffffff81672d1a>] schedule+0x3a/0x90
[  275.627629]  [<ffffffff81672fa3>] schedule_preempt_disabled+0x13/0x20
[  275.632511]  [<ffffffff81674d3e>] mutex_lock_nested+0x16e/0x430
[  275.636989]  [<ffffffff811d37c3>] ? lock_rename+0xd3/0x100
[  275.641192]  [<ffffffff811d37c3>] lock_rename+0xd3/0x100
[  275.645337]  [<ffffffff811d774d>] SyS_renameat2+0x1ed/0x530
[  275.649788]  [<ffffffff811d7ab9>] SyS_rename+0x19/0x20
[  275.653731]  [<ffffffff8100365d>] do_syscall_64+0x5d/0x180
[  275.657938]  [<ffffffff816783ff>] entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  385.466224] MemAlloc-Info: stalling=1081 dying=36 exiting=0 victim=1 oom_count=143237/3490
[  385.472491] INFO: task kworker/0:0:4 blocked for more than 120 seconds.
[  385.477667]       Not tainted 4.5.0-rc5-next-20160224+ #75
---------- console log ----------

While number of out_of_memory() calls is increasing over time ( 75777 -> 143237 ),
kswapd being stuck ( http://lkml.kernel.org/r/20160211225929.GU14668@dastard )
might be involved to this problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-25 11:28   ` [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Tetsuo Handa
@ 2016-02-25 11:31     ` Tetsuo Handa
  2016-02-25 14:16     ` Michal Hocko
  1 sibling, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-25 11:31 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, hannes, linux-mm

Tetsuo Handa wrote:
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160225.txt.xz .
Sorry. Log for this report is http://I-love.SAKURA.ne.jp/tmp/serial-20160225-2.txt.xz .
> ---------- console log ----------
> [   59.707294] Killed process 2020 (a.out) total-vm:8272kB, anon-rss:4160kB, file-rss:0kB, shmem-rss:0kB
> [   59.714851] oom_reaper: reaped process 2020 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
> [   59.739672] Out of memory: Kill process 1562 (a.out) score 1003 or sacrifice child
> [   59.746018] Killed process 1562 (a.out) total-vm:8268kB, anon-rss:4024kB, file-rss:0kB, shmem-rss:0kB
> [   60.003576] MemAlloc-Info: stalling=880 dying=42 exiting=0 victim=1 oom_count=113/49
> (...snipped...)
> [   60.757062] oom_reaper: unable to reap pid:1562 (a.out)
> (...snipped...)
> [   60.758417] 1 lock held by a.out/2269:
> [   60.758417]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> [   60.759217] 1 lock held by a.out/2371:
> [   60.759217]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> [   60.759732] 1 lock held by a.out/2600:
> [   60.759732]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> [   60.760394] 1 lock held by a.out/2673:
> [   60.760394]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> [   60.760528] 1 lock held by a.out/2689:
> [   60.760528]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> [   60.760803] 1 lock held by a.out/2723:
> [   60.760803]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> [   60.760884] 1 lock held by a.out/2731:
> [   60.760884]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> [   60.761381] 1 lock held by a.out/2781:
> [   60.761382]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> [   60.761466] 1 lock held by a.out/2791:
> [   60.761466]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> [   60.761615] 1 lock held by a.out/2803:
> [   60.761615]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8117e6aa>] SyS_mremap+0xaa/0x500
> (...snipped...)
> [  222.428230] MemAlloc-Info: stalling=876 dying=36 exiting=0 victim=1 oom_count=72229/1744
> [  232.434607] MemAlloc-Info: stalling=1082 dying=36 exiting=0 victim=1 oom_count=75777/1855
> (...snipped...)
> [  232.711169] MemAlloc: kswapd0(49) flags=0xa60840 switches=59 uninterruptible
> [  232.716820] kswapd0         D ffff880039f935d0     0    49      2 0x00000000
> [  232.722494]  ffff880039f935d0 ffff8800366500c0 ffff880039f8c040 ffff880039f94000
> [  232.728190]  ffff8800367eac80 ffff8800367eac98 0000000000000000 ffff880039f93820
> [  232.733785]  ffff880039f935e8 ffffffff81672d1a ffff880039f8c040 ffff880039f93650
> [  232.739694] Call Trace:
> [  232.741797]  [<ffffffff81672d1a>] schedule+0x3a/0x90
> [  232.745674]  [<ffffffff81676b16>] rwsem_down_read_failed+0xd6/0x140
> [  232.750368]  [<ffffffff8132c324>] call_rwsem_down_read_failed+0x14/0x30
> [  232.755333]  [<ffffffff8167645d>] ? down_read+0x3d/0x50
> [  232.759397]  [<ffffffffa026286b>] ? xfs_log_commit_cil+0x5b/0x490 [xfs]
> [  232.764628]  [<ffffffffa026286b>] xfs_log_commit_cil+0x5b/0x490 [xfs]
> [  232.769470]  [<ffffffffa025d1e3>] __xfs_trans_commit+0x123/0x230 [xfs]
> [  232.774341]  [<ffffffffa025d57b>] xfs_trans_commit+0xb/0x10 [xfs]
> [  232.778929]  [<ffffffffa024e394>] xfs_iomap_write_allocate+0x194/0x380 [xfs]
> [  232.784237]  [<ffffffffa023b3ed>] xfs_map_blocks+0x13d/0x150 [xfs]
> [  232.788860]  [<ffffffffa023c2ab>] xfs_do_writepage+0x15b/0x520 [xfs]
> [  232.793646]  [<ffffffffa023c6a6>] xfs_vm_writepage+0x36/0x70 [xfs]
> [  232.798328]  [<ffffffff8115773f>] pageout.isra.43+0x18f/0x240
> [  232.802677]  [<ffffffff811590a3>] shrink_page_list+0x803/0xae0
> [  232.807247]  [<ffffffff81159ae7>] shrink_inactive_list+0x207/0x550
> [  232.811912]  [<ffffffff8115a7d6>] shrink_zone_memcg+0x5b6/0x780
> [  232.816585]  [<ffffffff811b38ed>] ? mem_cgroup_iter+0x15d/0x7c0
> [  232.820923]  [<ffffffff8115aa74>] shrink_zone+0xd4/0x2f0
> [  232.824963]  [<ffffffff8115b8fe>] kswapd+0x41e/0x800
> [  232.828916]  [<ffffffff8115b4e0>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
> [  232.833835]  [<ffffffff810923ee>] kthread+0xee/0x110
> [  232.837628]  [<ffffffff81678572>] ret_from_fork+0x22/0x50
> [  232.841640]  [<ffffffff81092300>] ? kthread_create_on_node+0x230/0x230
> (...snipped...)
> [  247.278695] MemAlloc: a.out(2269) flags=0x400040 switches=2115 seq=2 gfp=0x26084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO|__GFP_NOTRACK) order=0 delay=181584 uninterruptible
> [  247.289757] a.out           D ffff880027aafa48     0  2269   1229 0x00000080
> [  247.295134]  ffff880027aafa48 ffff880027a38040 ffff880027aa8100 ffff880027ab0000
> [  247.301091]  ffff880027aafa80 ffff88003d610240 00000000ffff2f67 000000000000004c
> [  247.306879]  ffff880027aafa60 ffffffff81672d1a ffff88003d610240 ffff880027aafb08
> [  247.312590] Call Trace:
> [  247.314749]  [<ffffffff81672d1a>] schedule+0x3a/0x90
> [  247.319758]  [<ffffffff8167717e>] schedule_timeout+0x11e/0x1c0
> [  247.324427]  [<ffffffff810bd056>] ? mark_held_locks+0x66/0x90
> [  247.329085]  [<ffffffff810e1270>] ? init_timer_key+0x40/0x40
> [  247.333385]  [<ffffffff810e8197>] ? ktime_get+0xa7/0x130
> [  247.337469]  [<ffffffff816720c1>] io_schedule_timeout+0xa1/0x110
> [  247.342046]  [<ffffffff811650ed>] congestion_wait+0x7d/0xd0
> [  247.346570]  [<ffffffff810b73e0>] ? wait_woken+0x80/0x80
> [  247.350714]  [<ffffffff8114e584>] __alloc_pages_nodemask+0xd74/0xed0
> [  247.355918]  [<ffffffff81198026>] alloc_pages_current+0x96/0x1b0
> [  247.361015]  [<ffffffff81062162>] pte_alloc_one+0x12/0x60
> [  247.365132]  [<ffffffff811740e9>] __pte_alloc+0x19/0x110
> [  247.369162]  [<ffffffff8117e20a>] move_page_tables+0x5da/0x700
> [  247.373560]  [<ffffffff8117b00d>] ? copy_vma+0x20d/0x260
> [  247.377647]  [<ffffffff8117e41b>] move_vma+0xeb/0x2d0
> [  247.381577]  [<ffffffff8117eac8>] SyS_mremap+0x4c8/0x500
> [  247.385727]  [<ffffffff8100365d>] do_syscall_64+0x5d/0x180
> [  247.389869]  [<ffffffff816783ff>] entry_SYSCALL64_slow_path+0x25/0x25
> (...snipped...)
> [  275.592260] MemAlloc: a.out(1562) flags=0x400040 switches=17 uninterruptible dying victim
> [  275.598299] a.out           D ffff88002e3c7d68     0  1562   1229 0x00100084
> [  275.603752]  ffff88002e3c7d68 ffffffff818f9500 ffff88002e3c00c0 ffff88002e3c8000
> [  275.609544]  ffff88003d1f74c8 0000000000000246 ffff88002e3c00c0 00000000ffffffff
> [  275.615527]  ffff88002e3c7d80 ffffffff81672d1a ffff88003d1f74c0 ffff88002e3c7d90
> [  275.621526] Call Trace:
> [  275.623707]  [<ffffffff81672d1a>] schedule+0x3a/0x90
> [  275.627629]  [<ffffffff81672fa3>] schedule_preempt_disabled+0x13/0x20
> [  275.632511]  [<ffffffff81674d3e>] mutex_lock_nested+0x16e/0x430
> [  275.636989]  [<ffffffff811d37c3>] ? lock_rename+0xd3/0x100
> [  275.641192]  [<ffffffff811d37c3>] lock_rename+0xd3/0x100
> [  275.645337]  [<ffffffff811d774d>] SyS_renameat2+0x1ed/0x530
> [  275.649788]  [<ffffffff811d7ab9>] SyS_rename+0x19/0x20
> [  275.653731]  [<ffffffff8100365d>] do_syscall_64+0x5d/0x180
> [  275.657938]  [<ffffffff816783ff>] entry_SYSCALL64_slow_path+0x25/0x25
> (...snipped...)
> [  385.466224] MemAlloc-Info: stalling=1081 dying=36 exiting=0 victim=1 oom_count=143237/3490
> [  385.472491] INFO: task kworker/0:0:4 blocked for more than 120 seconds.
> [  385.477667]       Not tainted 4.5.0-rc5-next-20160224+ #75
> ---------- console log ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-25 11:28   ` [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Tetsuo Handa
  2016-02-25 11:31     ` Tetsuo Handa
@ 2016-02-25 14:16     ` Michal Hocko
  1 sibling, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-25 14:16 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, hannes, linux-mm

On Thu 25-02-16 20:28:35, Tetsuo Handa wrote:
[...]
> Did you decide what to do if the OOM reaper is unable to take mmap_sem
> for the associated mm struct?

Not yet because I consider it outside of the context of this submission.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-02-22  9:41           ` Michal Hocko
  (?)
@ 2016-02-29  1:26           ` Tetsuo Handa
  -1 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-02-29  1:26 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, linux-mm

Michal Hocko wrote:
> > Moreover, why don't you do like
> > 
> >   struct mm_struct {
> >   	(...snipped...)
> >   	struct list_head oom_reaper_list;
> >   	(...snipped...)
> >   }
> 
> Because we would need to search all tasks sharing the same mm in order
> to exit_oom_victim.

We should clear TIF_MEMDIE eventually in case reaping this mm was not sufficient.
But I'm not sure whether clearing TIF_MEMDIE immediately after reaping completed
is preferable. Clearing TIF_MEMDIE some time later might be better.

For example,

	mutex_lock(&oom_lock);
	exit_oom_victim(tsk);
	mutex_unlock(&oom_lock);

will avoid race between ongoing select_bad_process() from out_of_memory()
and oom_reap_task().

For example,

	mutex_lock(&oom_lock);
	msleep(100);
	exit_oom_victim(tsk);
	mutex_unlock(&oom_lock);

would make more unlikely to OOM-kill next victim by giving some time to
allow get_page_from_freelist() to succeed.

> 
> > than
> > 
> >   struct task_struct {
> >   	(...snipped...)
> >   	struct list_head oom_reaper_list;
> >   	(...snipped...)
> >   }
> > 
> > so that we can update all ->oom_score_adj using this mm_struct for handling
> > crazy combo ( http://lkml.kernel.org/r/20160204163113.GF14425@dhcp22.suse.cz ) ?
> 
> I find it much easier to to simply skip over tasks with MMF_OOM_KILLED
> when already selecting a victim. We won't need oom_score_adj games at
> all. This needs a deeper evaluation though. I didn't get to it yet,
> but the point of having MMF flag which is not oom_reaper specific
> was to have it reusable in other contexts as well.

I think below patch can handle crazy combo without using MMF flag.

----------
>From 47c4bba988eae2f761894c679e6f57668d4cacd3 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 29 Feb 2016 09:57:30 +0900
Subject: [PATCH 1/2] mm,oom: Mark OOM-killed processes as no longer
 OOM-unkillable.

Since OOM_SCORE_ADJ_MIN means that the OOM killer must not send SIGKILL
on that process, changing oom_score_adj to OOM_SCORE_ADJ_MIN as soon as
SIGKILL was sent on that process should be OK. This will also solve a
problem that SysRq-f chooses the same process forever.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5d5eca9..f5dddc3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -497,7 +497,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * improvements. This also means that selecting this task doesn't
 	 * make any sense.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
 	exit_oom_victim(tsk);
 out:
 	mmput(mm);
@@ -775,8 +774,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	for_each_process(p) {
 		if (!process_shares_mm(p, mm))
 			continue;
-		if (same_thread_group(p, victim))
-			continue;
 		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
 		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 			/*
@@ -788,6 +785,9 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		task_lock(p);
+		p->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+		task_unlock(p);
 	}
 	rcu_read_unlock();
 
----------

Regarding oom_kill_allocating_task = 1 case, we can do below if we want to
OOM-kill current before trying to OOM-kill children of current.

----------
>From b0f38de7f07328bca65c5d4b0361d5f73283a6e4 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 29 Feb 2016 10:03:13 +0900
Subject: [PATCH 2/2] mm,oom: Don't try to OOM-kill children if
 oom_kill_allocating_task = 1.

Trying to OOM-kill children of current can cause killing more than
needed if OOM-killed children can not release memory immediately
because current does not wait for OOM-killed children to release memory.
oom_kill_allocating_task = 1 should try to OOM-kill current first.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f5dddc3..9098d48 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -682,6 +682,13 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 	bool can_oom_reap;
+	const bool kill_current = !p;
+	
+	if (kill_current) {
+		p = current;
+		victim = p;
+		get_task_struct(p);
+	}
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -702,6 +709,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
+	if (kill_current)
+		goto kill;
 	/*
 	 * If any of p's children has a different mm and is eligible for kill,
 	 * the one with the highest oom_badness() score is sacrificed for its
@@ -730,6 +739,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	}
 	read_unlock(&tasklist_lock);
 
+kill:
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
@@ -889,8 +899,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-		get_task_struct(current);
-		oom_kill_process(oc, current, 0, totalpages, NULL,
+		oom_kill_process(oc, NULL, 0, totalpages, NULL,
 				 "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-23 13:21       ` Michal Hocko
@ 2016-02-29  3:19         ` Hugh Dickins
  -1 siblings, 0 replies; 112+ messages in thread
From: Hugh Dickins @ 2016-02-29  3:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Tue, 23 Feb 2016, Michal Hocko wrote:
> On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > 
> > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > allocation?
> 
> This is a good question. I have checked for that particular case
> previously and managed to convinced myself that this is OK(ish).
> munlock_vma_pages_range locks only THP pages to prevent from the
> parallel split-up AFAICS.

I think you're mistaken on that: there is also the lock_page()
on every page in Phase 2 of __munlock_pagevec().

> And split_huge_page_to_list doesn't seem
> to depend on an allocation. It can block on anon_vma lock but I didn't
> see any allocation requests from there either. I might be missing
> something of course. Do you have any specific path in mind?
> 
> > I'm wondering if in that case it would be better to do a 
> > best-effort munlock_vma_pages_all() with trylock_page() and just give up 
> > on releasing memory from that particular vma.  In that case, there may be 
> > other memory that can be freed with unmap_page_range() that would handle 
> > this livelock.

I agree with David, that we ought to trylock_page() throughout munlock:
just so long as it gets to do the TestClearPageMlocked without demanding
page lock, the rest is the usual sugarcoating for accurate Mlocked stats,
and leave the rest for reclaim to fix up.

> 
> I have tried to code it up but I am not really sure the whole churn is
> really worth it - unless I am missing something that would really make
> the THP case likely to hit in the real life.

Though I must have known about it forever, it was a shock to see all
those page locks demanded in exit, brought home to us a week or so ago.

The proximate cause in this case was my own change, to defer pte_alloc
to suit huge tmpfs: it had not previously occurred to me that I was
now doing the pte_alloc while __do_fault holds page lock.  Bad Hugh.
But change not yet upstream, so not so urgent for you.

>From time immemorial, free_swap_and_cache() and free_swap_cache() only
ever trylock a page, precisely so that they never hold up munmap or exit
(well, if I looked harder, I might find lock ordering reasons too).

> 
> Just for the reference this is what I came up with (just compile tested).

I tried something similar internally (on an earlier kernel).  Like
you I've set that work aside for now, there were quicker ways to fix
the issue at hand.  But it does continue to offend me that munlock
demands all those page locks: so if you don't get back to it before me,
I shall eventually.

I didn't understand why you complicated yours with the "enforce"
arg to munlock_vma_pages_range(): why not just trylock in all cases?

Hugh

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
@ 2016-02-29  3:19         ` Hugh Dickins
  0 siblings, 0 replies; 112+ messages in thread
From: Hugh Dickins @ 2016-02-29  3:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Tue, 23 Feb 2016, Michal Hocko wrote:
> On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > 
> > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > allocation?
> 
> This is a good question. I have checked for that particular case
> previously and managed to convinced myself that this is OK(ish).
> munlock_vma_pages_range locks only THP pages to prevent from the
> parallel split-up AFAICS.

I think you're mistaken on that: there is also the lock_page()
on every page in Phase 2 of __munlock_pagevec().

> And split_huge_page_to_list doesn't seem
> to depend on an allocation. It can block on anon_vma lock but I didn't
> see any allocation requests from there either. I might be missing
> something of course. Do you have any specific path in mind?
> 
> > I'm wondering if in that case it would be better to do a 
> > best-effort munlock_vma_pages_all() with trylock_page() and just give up 
> > on releasing memory from that particular vma.  In that case, there may be 
> > other memory that can be freed with unmap_page_range() that would handle 
> > this livelock.

I agree with David, that we ought to trylock_page() throughout munlock:
just so long as it gets to do the TestClearPageMlocked without demanding
page lock, the rest is the usual sugarcoating for accurate Mlocked stats,
and leave the rest for reclaim to fix up.

> 
> I have tried to code it up but I am not really sure the whole churn is
> really worth it - unless I am missing something that would really make
> the THP case likely to hit in the real life.

Though I must have known about it forever, it was a shock to see all
those page locks demanded in exit, brought home to us a week or so ago.

The proximate cause in this case was my own change, to defer pte_alloc
to suit huge tmpfs: it had not previously occurred to me that I was
now doing the pte_alloc while __do_fault holds page lock.  Bad Hugh.
But change not yet upstream, so not so urgent for you.

>From time immemorial, free_swap_and_cache() and free_swap_cache() only
ever trylock a page, precisely so that they never hold up munmap or exit
(well, if I looked harder, I might find lock ordering reasons too).

> 
> Just for the reference this is what I came up with (just compile tested).

I tried something similar internally (on an earlier kernel).  Like
you I've set that work aside for now, there were quicker ways to fix
the issue at hand.  But it does continue to offend me that munlock
demands all those page locks: so if you don't get back to it before me,
I shall eventually.

I didn't understand why you complicated yours with the "enforce"
arg to munlock_vma_pages_range(): why not just trylock in all cases?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-29  3:19         ` Hugh Dickins
@ 2016-02-29 13:41           ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-29 13:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> On Tue, 23 Feb 2016, Michal Hocko wrote:
> > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > 
> > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > allocation?
> > 
> > This is a good question. I have checked for that particular case
> > previously and managed to convinced myself that this is OK(ish).
> > munlock_vma_pages_range locks only THP pages to prevent from the
> > parallel split-up AFAICS.
> 
> I think you're mistaken on that: there is also the lock_page()
> on every page in Phase 2 of __munlock_pagevec().

Ohh, I have missed that one. Thanks for pointing it out!

[...]
> > Just for the reference this is what I came up with (just compile tested).
> 
> I tried something similar internally (on an earlier kernel).  Like
> you I've set that work aside for now, there were quicker ways to fix
> the issue at hand.  But it does continue to offend me that munlock
> demands all those page locks: so if you don't get back to it before me,
> I shall eventually.
> 
> I didn't understand why you complicated yours with the "enforce"
> arg to munlock_vma_pages_range(): why not just trylock in all cases?

Well, I have to confess that I am not really sure I understand all the
consequences of the locking here. It has always been subtle and weird
issues popping up from time to time. So I only wanted to have that
change limitted to the oom_reaper. So I would really appreciate if
somebody more knowledgeable had a look. We can drop the mlock patch for
now.

Thanks for looking into this, Hugh!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
@ 2016-02-29 13:41           ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-02-29 13:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> On Tue, 23 Feb 2016, Michal Hocko wrote:
> > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > 
> > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > allocation?
> > 
> > This is a good question. I have checked for that particular case
> > previously and managed to convinced myself that this is OK(ish).
> > munlock_vma_pages_range locks only THP pages to prevent from the
> > parallel split-up AFAICS.
> 
> I think you're mistaken on that: there is also the lock_page()
> on every page in Phase 2 of __munlock_pagevec().

Ohh, I have missed that one. Thanks for pointing it out!

[...]
> > Just for the reference this is what I came up with (just compile tested).
> 
> I tried something similar internally (on an earlier kernel).  Like
> you I've set that work aside for now, there were quicker ways to fix
> the issue at hand.  But it does continue to offend me that munlock
> demands all those page locks: so if you don't get back to it before me,
> I shall eventually.
> 
> I didn't understand why you complicated yours with the "enforce"
> arg to munlock_vma_pages_range(): why not just trylock in all cases?

Well, I have to confess that I am not really sure I understand all the
consequences of the locking here. It has always been subtle and weird
issues popping up from time to time. So I only wanted to have that
change limitted to the oom_reaper. So I would really appreciate if
somebody more knowledgeable had a look. We can drop the mlock patch for
now.

Thanks for looking into this, Hugh!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-29 13:41           ` Michal Hocko
@ 2016-03-08 13:40             ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-03-08 13:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 29-02-16 14:41:39, Michal Hocko wrote:
> On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> > On Tue, 23 Feb 2016, Michal Hocko wrote:
> > > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > > 
> > > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > > allocation?
> > > 
> > > This is a good question. I have checked for that particular case
> > > previously and managed to convinced myself that this is OK(ish).
> > > munlock_vma_pages_range locks only THP pages to prevent from the
> > > parallel split-up AFAICS.
> > 
> > I think you're mistaken on that: there is also the lock_page()
> > on every page in Phase 2 of __munlock_pagevec().
> 
> Ohh, I have missed that one. Thanks for pointing it out!
> 
> [...]
> > > Just for the reference this is what I came up with (just compile tested).
> > 
> > I tried something similar internally (on an earlier kernel).  Like
> > you I've set that work aside for now, there were quicker ways to fix
> > the issue at hand.  But it does continue to offend me that munlock
> > demands all those page locks: so if you don't get back to it before me,
> > I shall eventually.
> > 
> > I didn't understand why you complicated yours with the "enforce"
> > arg to munlock_vma_pages_range(): why not just trylock in all cases?
> 
> Well, I have to confess that I am not really sure I understand all the
> consequences of the locking here. It has always been subtle and weird
> issues popping up from time to time. So I only wanted to have that
> change limitted to the oom_reaper. So I would really appreciate if
> somebody more knowledgeable had a look. We can drop the mlock patch for
> now.

According to the rc7 announcement it seems we are approaching the merge
window. Should we drop the patch for now or the risk of the lockup is
too low to care about and keep it in for now as it might be already
useful and change the munlock path to not depend on page locks later on?

I am OK with both ways.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
@ 2016-03-08 13:40             ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-03-08 13:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 29-02-16 14:41:39, Michal Hocko wrote:
> On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> > On Tue, 23 Feb 2016, Michal Hocko wrote:
> > > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > > 
> > > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > > allocation?
> > > 
> > > This is a good question. I have checked for that particular case
> > > previously and managed to convinced myself that this is OK(ish).
> > > munlock_vma_pages_range locks only THP pages to prevent from the
> > > parallel split-up AFAICS.
> > 
> > I think you're mistaken on that: there is also the lock_page()
> > on every page in Phase 2 of __munlock_pagevec().
> 
> Ohh, I have missed that one. Thanks for pointing it out!
> 
> [...]
> > > Just for the reference this is what I came up with (just compile tested).
> > 
> > I tried something similar internally (on an earlier kernel).  Like
> > you I've set that work aside for now, there were quicker ways to fix
> > the issue at hand.  But it does continue to offend me that munlock
> > demands all those page locks: so if you don't get back to it before me,
> > I shall eventually.
> > 
> > I didn't understand why you complicated yours with the "enforce"
> > arg to munlock_vma_pages_range(): why not just trylock in all cases?
> 
> Well, I have to confess that I am not really sure I understand all the
> consequences of the locking here. It has always been subtle and weird
> issues popping up from time to time. So I only wanted to have that
> change limitted to the oom_reaper. So I would really appreciate if
> somebody more knowledgeable had a look. We can drop the mlock patch for
> now.

According to the rc7 announcement it seems we are approaching the merge
window. Should we drop the patch for now or the risk of the lockup is
too low to care about and keep it in for now as it might be already
useful and change the munlock path to not depend on page locks later on?

I am OK with both ways.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-03-08 13:40             ` Michal Hocko
@ 2016-03-08 20:07               ` Hugh Dickins
  -1 siblings, 0 replies; 112+ messages in thread
From: Hugh Dickins @ 2016-03-08 20:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, David Rientjes, Andrew Morton, Mel Gorman,
	Tetsuo Handa, Oleg Nesterov, Linus Torvalds, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Tue, 8 Mar 2016, Michal Hocko wrote:
> On Mon 29-02-16 14:41:39, Michal Hocko wrote:
> > On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> > > On Tue, 23 Feb 2016, Michal Hocko wrote:
> > > > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > > > 
> > > > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > > > allocation?
> > > > 
> > > > This is a good question. I have checked for that particular case
> > > > previously and managed to convinced myself that this is OK(ish).
> > > > munlock_vma_pages_range locks only THP pages to prevent from the
> > > > parallel split-up AFAICS.
> > > 
> > > I think you're mistaken on that: there is also the lock_page()
> > > on every page in Phase 2 of __munlock_pagevec().
> > 
> > Ohh, I have missed that one. Thanks for pointing it out!
> > 
> > [...]
> > > > Just for the reference this is what I came up with (just compile tested).
> > > 
> > > I tried something similar internally (on an earlier kernel).  Like
> > > you I've set that work aside for now, there were quicker ways to fix
> > > the issue at hand.  But it does continue to offend me that munlock
> > > demands all those page locks: so if you don't get back to it before me,
> > > I shall eventually.
> > > 
> > > I didn't understand why you complicated yours with the "enforce"
> > > arg to munlock_vma_pages_range(): why not just trylock in all cases?
> > 
> > Well, I have to confess that I am not really sure I understand all the
> > consequences of the locking here. It has always been subtle and weird
> > issues popping up from time to time. So I only wanted to have that
> > change limitted to the oom_reaper. So I would really appreciate if
> > somebody more knowledgeable had a look. We can drop the mlock patch for
> > now.
> 
> According to the rc7 announcement it seems we are approaching the merge
> window. Should we drop the patch for now or the risk of the lockup is
> too low to care about and keep it in for now as it might be already
> useful and change the munlock path to not depend on page locks later on?
> 
> I am OK with both ways.

You're asking about the Subject patch, "oom reaper: handle mlocked pages",
I presume.  Your Work-In-Progress mods to munlock_vma_pages_range() should
certainly be dropped for now, and revisited by one of us another time.

I vote for dropping "oom reaper: handle mlocked pages" for now too.
If I understand correctly, the purpose of the oom reaper is to free up
as much memory from the targeted task as possible, while avoiding getting
stuck on locks; in advance of the task actually exiting and doing the
freeing itself, but perhaps getting stuck on locks as it does so.

If that's a fair description, then it's inappropriate for the oom reaper
to call munlock_vma_pages_all(), with the risk of getting stuck on many
page locks; best leave that risk to the task when it exits as at present.
Of course we should come back to this later, fix munlock_vma_pages_range()
with trylocks (on the pages only? rmap mutexes also?), and then integrate
"oom reaper: handle mlocked pages".

(Or if we had the old mechanism for scanning unevictable lrus on demand,
perhaps simply not avoid the VM_LOCKED vmas in __oom_reap_vmas(), let
the clear_page_mlock() in page_remove_*rmap() handle all the singly
mapped and mlocked pages, and un-mlock the rest by scanning unevictables.)

Hugh

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
@ 2016-03-08 20:07               ` Hugh Dickins
  0 siblings, 0 replies; 112+ messages in thread
From: Hugh Dickins @ 2016-03-08 20:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, David Rientjes, Andrew Morton, Mel Gorman,
	Tetsuo Handa, Oleg Nesterov, Linus Torvalds, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Tue, 8 Mar 2016, Michal Hocko wrote:
> On Mon 29-02-16 14:41:39, Michal Hocko wrote:
> > On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> > > On Tue, 23 Feb 2016, Michal Hocko wrote:
> > > > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > > > 
> > > > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > > > allocation?
> > > > 
> > > > This is a good question. I have checked for that particular case
> > > > previously and managed to convinced myself that this is OK(ish).
> > > > munlock_vma_pages_range locks only THP pages to prevent from the
> > > > parallel split-up AFAICS.
> > > 
> > > I think you're mistaken on that: there is also the lock_page()
> > > on every page in Phase 2 of __munlock_pagevec().
> > 
> > Ohh, I have missed that one. Thanks for pointing it out!
> > 
> > [...]
> > > > Just for the reference this is what I came up with (just compile tested).
> > > 
> > > I tried something similar internally (on an earlier kernel).  Like
> > > you I've set that work aside for now, there were quicker ways to fix
> > > the issue at hand.  But it does continue to offend me that munlock
> > > demands all those page locks: so if you don't get back to it before me,
> > > I shall eventually.
> > > 
> > > I didn't understand why you complicated yours with the "enforce"
> > > arg to munlock_vma_pages_range(): why not just trylock in all cases?
> > 
> > Well, I have to confess that I am not really sure I understand all the
> > consequences of the locking here. It has always been subtle and weird
> > issues popping up from time to time. So I only wanted to have that
> > change limitted to the oom_reaper. So I would really appreciate if
> > somebody more knowledgeable had a look. We can drop the mlock patch for
> > now.
> 
> According to the rc7 announcement it seems we are approaching the merge
> window. Should we drop the patch for now or the risk of the lockup is
> too low to care about and keep it in for now as it might be already
> useful and change the munlock path to not depend on page locks later on?
> 
> I am OK with both ways.

You're asking about the Subject patch, "oom reaper: handle mlocked pages",
I presume.  Your Work-In-Progress mods to munlock_vma_pages_range() should
certainly be dropped for now, and revisited by one of us another time.

I vote for dropping "oom reaper: handle mlocked pages" for now too.
If I understand correctly, the purpose of the oom reaper is to free up
as much memory from the targeted task as possible, while avoiding getting
stuck on locks; in advance of the task actually exiting and doing the
freeing itself, but perhaps getting stuck on locks as it does so.

If that's a fair description, then it's inappropriate for the oom reaper
to call munlock_vma_pages_all(), with the risk of getting stuck on many
page locks; best leave that risk to the task when it exits as at present.
Of course we should come back to this later, fix munlock_vma_pages_range()
with trylocks (on the pages only? rmap mutexes also?), and then integrate
"oom reaper: handle mlocked pages".

(Or if we had the old mechanism for scanning unevictable lrus on demand,
perhaps simply not avoid the VM_LOCKED vmas in __oom_reap_vmas(), let
the clear_page_mlock() in page_remove_*rmap() handle all the singly
mapped and mlocked pages, and un-mlock the rest by scanning unevictables.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-03-08 20:07               ` Hugh Dickins
@ 2016-03-09  8:26                 ` Michal Hocko
  -1 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-03-09  8:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Tue 08-03-16 12:07:24, Hugh Dickins wrote:
> On Tue, 8 Mar 2016, Michal Hocko wrote:
> > On Mon 29-02-16 14:41:39, Michal Hocko wrote:
> > > On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> > > > On Tue, 23 Feb 2016, Michal Hocko wrote:
> > > > > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > > > > 
> > > > > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > > > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > > > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > > > > allocation?
> > > > > 
> > > > > This is a good question. I have checked for that particular case
> > > > > previously and managed to convinced myself that this is OK(ish).
> > > > > munlock_vma_pages_range locks only THP pages to prevent from the
> > > > > parallel split-up AFAICS.
> > > > 
> > > > I think you're mistaken on that: there is also the lock_page()
> > > > on every page in Phase 2 of __munlock_pagevec().
> > > 
> > > Ohh, I have missed that one. Thanks for pointing it out!
> > > 
> > > [...]
> > > > > Just for the reference this is what I came up with (just compile tested).
> > > > 
> > > > I tried something similar internally (on an earlier kernel).  Like
> > > > you I've set that work aside for now, there were quicker ways to fix
> > > > the issue at hand.  But it does continue to offend me that munlock
> > > > demands all those page locks: so if you don't get back to it before me,
> > > > I shall eventually.
> > > > 
> > > > I didn't understand why you complicated yours with the "enforce"
> > > > arg to munlock_vma_pages_range(): why not just trylock in all cases?
> > > 
> > > Well, I have to confess that I am not really sure I understand all the
> > > consequences of the locking here. It has always been subtle and weird
> > > issues popping up from time to time. So I only wanted to have that
> > > change limitted to the oom_reaper. So I would really appreciate if
> > > somebody more knowledgeable had a look. We can drop the mlock patch for
> > > now.
> > 
> > According to the rc7 announcement it seems we are approaching the merge
> > window. Should we drop the patch for now or the risk of the lockup is
> > too low to care about and keep it in for now as it might be already
> > useful and change the munlock path to not depend on page locks later on?
> > 
> > I am OK with both ways.
> 
> You're asking about the Subject patch, "oom reaper: handle mlocked pages",
> I presume.  Your Work-In-Progress mods to munlock_vma_pages_range() should
> certainly be dropped for now, and revisited by one of us another time.

I believe it hasn't landed in the mmotm yet.

> I vote for dropping "oom reaper: handle mlocked pages" for now too.

OK, Andrew, could you drop oom-reaper-handle-mlocked-pages.patch for
now. We will revisit it later on after we make the munlock path page
lock free.

> If I understand correctly, the purpose of the oom reaper is to free up
> as much memory from the targeted task as possible, while avoiding getting
> stuck on locks; in advance of the task actually exiting and doing the
> freeing itself, but perhaps getting stuck on locks as it does so.
> 
> If that's a fair description, then it's inappropriate for the oom reaper
> to call munlock_vma_pages_all(), with the risk of getting stuck on many
> page locks; best leave that risk to the task when it exits as at present.
> Of course we should come back to this later, fix munlock_vma_pages_range()
> with trylocks (on the pages only? rmap mutexes also?), and then integrate
> "oom reaper: handle mlocked pages".

Fair enough.

> (Or if we had the old mechanism for scanning unevictable lrus on demand,
> perhaps simply not avoid the VM_LOCKED vmas in __oom_reap_vmas(), let
> the clear_page_mlock() in page_remove_*rmap() handle all the singly
> mapped and mlocked pages, and un-mlock the rest by scanning unevictables.)

I will have a look at this possibility as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
@ 2016-03-09  8:26                 ` Michal Hocko
  0 siblings, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-03-09  8:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Tue 08-03-16 12:07:24, Hugh Dickins wrote:
> On Tue, 8 Mar 2016, Michal Hocko wrote:
> > On Mon 29-02-16 14:41:39, Michal Hocko wrote:
> > > On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> > > > On Tue, 23 Feb 2016, Michal Hocko wrote:
> > > > > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > > > > 
> > > > > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > > > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > > > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > > > > allocation?
> > > > > 
> > > > > This is a good question. I have checked for that particular case
> > > > > previously and managed to convinced myself that this is OK(ish).
> > > > > munlock_vma_pages_range locks only THP pages to prevent from the
> > > > > parallel split-up AFAICS.
> > > > 
> > > > I think you're mistaken on that: there is also the lock_page()
> > > > on every page in Phase 2 of __munlock_pagevec().
> > > 
> > > Ohh, I have missed that one. Thanks for pointing it out!
> > > 
> > > [...]
> > > > > Just for the reference this is what I came up with (just compile tested).
> > > > 
> > > > I tried something similar internally (on an earlier kernel).  Like
> > > > you I've set that work aside for now, there were quicker ways to fix
> > > > the issue at hand.  But it does continue to offend me that munlock
> > > > demands all those page locks: so if you don't get back to it before me,
> > > > I shall eventually.
> > > > 
> > > > I didn't understand why you complicated yours with the "enforce"
> > > > arg to munlock_vma_pages_range(): why not just trylock in all cases?
> > > 
> > > Well, I have to confess that I am not really sure I understand all the
> > > consequences of the locking here. It has always been subtle and weird
> > > issues popping up from time to time. So I only wanted to have that
> > > change limitted to the oom_reaper. So I would really appreciate if
> > > somebody more knowledgeable had a look. We can drop the mlock patch for
> > > now.
> > 
> > According to the rc7 announcement it seems we are approaching the merge
> > window. Should we drop the patch for now or the risk of the lockup is
> > too low to care about and keep it in for now as it might be already
> > useful and change the munlock path to not depend on page locks later on?
> > 
> > I am OK with both ways.
> 
> You're asking about the Subject patch, "oom reaper: handle mlocked pages",
> I presume.  Your Work-In-Progress mods to munlock_vma_pages_range() should
> certainly be dropped for now, and revisited by one of us another time.

I believe it hasn't landed in the mmotm yet.

> I vote for dropping "oom reaper: handle mlocked pages" for now too.

OK, Andrew, could you drop oom-reaper-handle-mlocked-pages.patch for
now. We will revisit it later on after we make the munlock path page
lock free.

> If I understand correctly, the purpose of the oom reaper is to free up
> as much memory from the targeted task as possible, while avoiding getting
> stuck on locks; in advance of the task actually exiting and doing the
> freeing itself, but perhaps getting stuck on locks as it does so.
> 
> If that's a fair description, then it's inappropriate for the oom reaper
> to call munlock_vma_pages_all(), with the risk of getting stuck on many
> page locks; best leave that risk to the task when it exits as at present.
> Of course we should come back to this later, fix munlock_vma_pages_range()
> with trylocks (on the pages only? rmap mutexes also?), and then integrate
> "oom reaper: handle mlocked pages".

Fair enough.

> (Or if we had the old mechanism for scanning unevictable lrus on demand,
> perhaps simply not avoid the VM_LOCKED vmas in __oom_reap_vmas(), let
> the clear_page_mlock() in page_remove_*rmap() handle all the singly
> mapped and mlocked pages, and un-mlock the rest by scanning unevictables.)

I will have a look at this possibility as well.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-02-22  9:41           ` Michal Hocko
  (?)
  (?)
@ 2016-03-15 11:15           ` Tetsuo Handa
  2016-03-15 11:43             ` Michal Hocko
  -1 siblings, 1 reply; 112+ messages in thread
From: Tetsuo Handa @ 2016-03-15 11:15 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

I was trying to test side effect of "oom, oom_reaper: disable oom_reaper for
oom_kill_allocating_task" compared to "oom: clear TIF_MEMDIE after oom_reaper
managed to unmap the address space", and I consider that both patches are
doing wrong things.



Regarding the former patch, setting MMF_OOM_KILLED before checking whether
that mm is reapable is over-killing the OOM reaper. The intent of that patch
was to avoid oom_reaper_list list corruption. We can avoid list corruption
by doing like below.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 23b8b06..0464727 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -543,7 +543,7 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
+	if (!oom_reaper_th || tsk->oom_reaper_list)
 		return;
 
 	get_task_struct(tsk);
@@ -677,7 +677,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap;
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -740,9 +740,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
 
-	/* Make sure we do not try to oom reap the mm multiple times */
-	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
-
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
----------

Two thread groups sharing the same mm can disable the OOM reaper
when all threads in the former thread group (which will be chosen
as an OOM victim by the OOM killer) can immediately call exit_mm()
via do_exit() (e.g. simply sleeping in killable state when the OOM
killer chooses that thread group) and some thread in the latter thread
group is contended on unkillable locks (e.g. inode mutex), due to

	p = find_lock_task_mm(tsk);
	if (!p)
		return true;

in __oom_reap_task() and

	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);

in oom_kill_process(). The OOM reaper is woken up in order to reap
the former thread group's memory, but it does nothing on the latter
thread group's memory because the former thread group can clear its mm
before the OOM reaper locks its mm. Even if subsequent out_of_memory()
call chose the latter thread group, the OOM reaper will not be woken up.
No memory is reaped. We need to queue all thread groups sharing that
memory if that memory should be reaped.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 23b8b06..93867d2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -543,7 +543,7 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
+	if (!oom_reaper_th || tsk->oom_reaper_list)
 		return;
 
 	get_task_struct(tsk);
@@ -677,7 +677,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap;
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -740,9 +740,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
 
-	/* Make sure we do not try to oom reap the mm multiple times */
-	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
-
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
@@ -757,6 +754,25 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
 	task_unlock(victim);
 
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (same_thread_group(p, victim))
+			continue;
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
 	 * any.  They don't get access to memory reserves, though, to avoid
@@ -773,16 +789,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		if (same_thread_group(p, victim))
 			continue;
 		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 			continue;
-		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		if (can_oom_reap)
+			wake_oom_reaper(p);
 	}
 	rcu_read_unlock();
 
----------

Michal Hocko wrote:
> I find it much easier to to simply skip over tasks with MMF_OOM_KILLED
> when already selecting a victim. We won't need oom_score_adj games at
> all. This needs a deeper evaluation though. I didn't get to it yet,
> but the point of having MMF flag which is not oom_reaper specific
> was to have it reusable in other contexts as well.

Using MMF_OOM_KILLED for excluding from OOM victim candidates is also wrong.
It gives immunity against the OOM reaper when some thread group sharing the
OOM victim's mm switches its oom_score_adj value depending on its current
role. For example, when its role is to follow up OOM killed thread group
(as if init process that respawns applications), it would set oom_score_adj
to OOM_SCORE_ADJ_MIN. It can update oom_score_adj to !OOM_SCORE_ADJ_MIN when
its role changes. Well, what is the to-be-OOM-killed thread group's role?
I don't know. Such thread group might be living as a canary for detecting
OOM condition. Such thread group might be doing memory intensive operations
using pipe. What is important is that the kernel should not make assumptions
on what the userspace programs will do.

Like I wrote at http://lkml.kernel.org/r/201602291026.CJB64059.FtSLOOFFJHMOQV@I-love.SAKURA.ne.jp ,
we can correct sysctl_oom_kill_allocating_task = 1 to literally kill
the allocating task rather than needlessly disable the OOM reaper using
MMF_OOM_KILLED.



Regarding the latter patch (assuming that the former patch is reverted),
updating oom_score_adj to OOM_SCORE_ADJ_MIN only when reaping succeeded
is confusing the OOM killer.

Two thread groups sharing the same mm can disable the OOM reaper when
some thread's memory in the former thread group is reaped and the latter
thread group is contended on unkillable locks (e.g. inode mutex), due to

	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;

in __oom_reap_task(). The OOM reaper is woken up and reaps the former
thread group's memory, but it does not update the latter thread group's
oom_score_adj to OOM_SCORE_ADJ_MIN. Even if subsequent out_of_memory()
call chose the latter thread group, the OOM reaper will not be woken up
due to p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN check. Therefore, no
more memory is reaped. We should update all thread groups' oom_score_adj
sharing that memory to OOM_SCORE_ADJ_MIN as of calling wake_oom_reaper()
or as of leaving oom_reap_task().

If we queue all thread groups sharing that memory (as suggested above),
we will be able to update all thread groups' oom_score_adj sharing that
memory at oom_reap_task() regardless of whether the OOM reaper succeeded
to reap that memory. But I prefer updating oom_score_adj as of calling
wake_oom_reaper() as shown below.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 23b8b06..fc389b8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -493,7 +493,6 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	 * improvements. This also means that selecting this task doesn't
 	 * make any sense.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
 	exit_oom_victim(tsk);
 out:
 	mmput(mm);
@@ -543,7 +542,7 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	if (!oom_reaper_th)
+	if (!oom_reaper_th || tsk->oom_reaper_list)
 		return;
 
 	get_task_struct(tsk);
@@ -677,7 +676,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap;
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -740,9 +739,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
 
-	/* Make sure we do not try to oom reap the mm multiple times */
-	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
-
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
@@ -756,16 +752,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
 		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
 	task_unlock(victim);
+	put_task_struct(victim);
 
-	/*
-	 * Kill all user processes sharing victim->mm in other thread groups, if
-	 * any.  They don't get access to memory reserves, though, to avoid
-	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
-	 * oom killed thread cannot exit because it requires the semaphore and
-	 * its contended by another thread trying to allocate memory itself.
-	 * That thread will now get access to memory reserves since it has a
-	 * pending fatal signal.
-	 */
 	rcu_read_lock();
 	for_each_process(p) {
 		if (!process_shares_mm(p, mm))
@@ -780,17 +768,70 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			 * memory might be still used.
 			 */
 			can_oom_reap = false;
-			continue;
+			break;
 		}
-		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
+	/*
+	 * Kill all OOM-killable user threads sharing victim->mm.
+	 *
+	 * This mitigates mm->mmap_sem livelock caused by one thread being
+	 * unable to release its mm due to being blocked at
+	 * down_read(&mm->mmap_sem) in exit_mm() while another thread is doing
+	 * __GFP_FS allocation with mm->mmap_sem held for write.
+	 *
+	 * They all get access to memory reserves.
+	 *
+	 * This prevents the OOM killer from choosing next OOM victim as soon
+	 * as current victim thread released its mm. This also mitigates kernel
+	 * log buffer being spammed by OOM killer messages due to choosing next
+	 * OOM victim thread sharing the current OOM victim's memory.
+	 *
+	 * This mitigates the problem that a thread doing __GFP_FS allocation
+	 * with mmap_sem held for write cannot call out_of_memory() for
+	 * unpredictable duration due to oom_lock contention and/or scheduling
+	 * priority, for the OOM reaper will not wait forever until such thread
+	 * leaves memory allocating loop by calling out_of_memory(). This also
+	 * mitigates the problem that a thread doing !__GFP_FS && !__GFP_NOFAIL
+	 * allocation cannot leave memory allocating loop because it cannot
+	 * call out_of_memory() even after it is killed.
+	 *
+	 * They are marked as OOM-unkillable. While it would be possible that
+	 * somebody marks them as OOM-killable again, it does not matter much
+	 * because they are already killed (and scheduled for OOM reap if
+	 * possible).
+	 *
+	 * This mitigates the problem that SysRq-f continues choosing the same
+	 * process. We still need SysRq-f because it is possible that victim
+	 * threads are blocked at unkillable locks inside memory allocating
+	 * loop (e.g. fs writeback from direct reclaim) even after they got
+	 * SIGKILL and TIF_MEMDIE.
+	 */
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		for_each_thread(p, t) {
+			task_lock(t);
+			if (t->mm) {
+				mark_oom_victim(t);
+				if (can_oom_reap)
+					wake_oom_reaper(t);
+			}
+			task_unlock(t);
+		}
+		task_lock(p);
+		p->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+		task_unlock(p);
+	}
+	rcu_read_unlock();
 
 	mmdrop(mm);
-	put_task_struct(victim);
 }
 #undef K
 
----------

Like I wrote at http://lkml.kernel.org/r/201602182021.EEH86916.JOLtFFVHOOMQFS@I-love.SAKURA.ne.jp ,
TIF_MEMDIE heuristics are per a task_struct basis but OOM-kill operation
is per a signal_struct basis or per a mm_struct basis. If you don't like
oom_score_adj games, I'm fine with per a signal_struct flag (like
OOM_FLAG_ORIGIN) that acts as adj == OOM_SCORE_ADJ_MIN test in oom_badness().
I don't think over-killing the OOM reaper by using per a mm_struct flag (i.e.
MMF_OOM_KILLED) is a good approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-15 11:15           ` Tetsuo Handa
@ 2016-03-15 11:43             ` Michal Hocko
  2016-03-15 11:50               ` Michal Hocko
  0 siblings, 1 reply; 112+ messages in thread
From: Michal Hocko @ 2016-03-15 11:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Tue 15-03-16 20:15:24, Tetsuo Handa wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 23b8b06..0464727 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -543,7 +543,7 @@ static int oom_reaper(void *unused)
>  
>  static void wake_oom_reaper(struct task_struct *tsk)
>  {
> -	if (!oom_reaper_th)
> +	if (!oom_reaper_th || tsk->oom_reaper_list)
>  		return;
>  
>  	get_task_struct(tsk);

OK, this is indeed much simpler. Back then when the list_head was used
this would be harder and I didn't realize moving away from the list_head
would simplify this as well. Care to send a patch to replace the
oom-oom_reaper-disable-oom_reaper-for-oom_kill_allocating_task.patch in
the mmotm tree?

> Two thread groups sharing the same mm can disable the OOM reaper
> when all threads in the former thread group (which will be chosen
> as an OOM victim by the OOM killer) can immediately call exit_mm()
> via do_exit() (e.g. simply sleeping in killable state when the OOM
> killer chooses that thread group) and some thread in the latter thread
> group is contended on unkillable locks (e.g. inode mutex), due to
> 
> 	p = find_lock_task_mm(tsk);
> 	if (!p)
> 		return true;
> 
> in __oom_reap_task() and
> 
> 	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> 
> in oom_kill_process(). The OOM reaper is woken up in order to reap
> the former thread group's memory, but it does nothing on the latter
> thread group's memory because the former thread group can clear its mm
> before the OOM reaper locks its mm. Even if subsequent out_of_memory()
> call chose the latter thread group, the OOM reaper will not be woken up.
> No memory is reaped. We need to queue all thread groups sharing that
> memory if that memory should be reaped.

Why it wouldn't be enough to wake the oom reaper only for the oom
victims? If the oom reaper races with the victims exit path then
the next round of the out_of_memory will select a different thread
sharing the same mm.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-15 11:43             ` Michal Hocko
@ 2016-03-15 11:50               ` Michal Hocko
  2016-03-16 11:16                 ` Tetsuo Handa
  0 siblings, 1 reply; 112+ messages in thread
From: Michal Hocko @ 2016-03-15 11:50 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Tue 15-03-16 12:43:00, Michal Hocko wrote:
> On Tue 15-03-16 20:15:24, Tetsuo Handa wrote:
[...]
> > Two thread groups sharing the same mm can disable the OOM reaper
> > when all threads in the former thread group (which will be chosen
> > as an OOM victim by the OOM killer) can immediately call exit_mm()
> > via do_exit() (e.g. simply sleeping in killable state when the OOM
> > killer chooses that thread group) and some thread in the latter thread
> > group is contended on unkillable locks (e.g. inode mutex), due to
> > 
> > 	p = find_lock_task_mm(tsk);
> > 	if (!p)
> > 		return true;
> > 
> > in __oom_reap_task() and
> > 
> > 	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> > 
> > in oom_kill_process(). The OOM reaper is woken up in order to reap
> > the former thread group's memory, but it does nothing on the latter
> > thread group's memory because the former thread group can clear its mm
> > before the OOM reaper locks its mm. Even if subsequent out_of_memory()
> > call chose the latter thread group, the OOM reaper will not be woken up.
> > No memory is reaped. We need to queue all thread groups sharing that
> > memory if that memory should be reaped.
> 
> Why it wouldn't be enough to wake the oom reaper only for the oom
> victims? If the oom reaper races with the victims exit path then
> the next round of the out_of_memory will select a different thread
> sharing the same mm.

And just to prevent from a confusion. I mean waking up also when
fatal_signal_pending and we do not really go down to selecting an oom
victim. Which would be worth a separate patch on top of course.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-15 11:50               ` Michal Hocko
@ 2016-03-16 11:16                 ` Tetsuo Handa
  2016-03-17 10:49                   ` Tetsuo Handa
  2016-03-17 12:14                   ` Michal Hocko
  0 siblings, 2 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-03-16 11:16 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Tue 15-03-16 20:15:24, Tetsuo Handa wrote:
> [...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 23b8b06..0464727 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -543,7 +543,7 @@ static int oom_reaper(void *unused)
> >  
> >  static void wake_oom_reaper(struct task_struct *tsk)
> >  {
> > -	if (!oom_reaper_th)
> > +	if (!oom_reaper_th || tsk->oom_reaper_list)
> >  		return;
> >  
> >  	get_task_struct(tsk);
> 
> OK, this is indeed much simpler. Back then when the list_head was used
> this would be harder and I didn't realize moving away from the list_head
> would simplify this as well. Care to send a patch to replace the
> oom-oom_reaper-disable-oom_reaper-for-oom_kill_allocating_task.patch in
> the mmotm tree?

Sent as 1458124527-5441-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .

> > Two thread groups sharing the same mm can disable the OOM reaper
> > when all threads in the former thread group (which will be chosen
> > as an OOM victim by the OOM killer) can immediately call exit_mm()
> > via do_exit() (e.g. simply sleeping in killable state when the OOM
> > killer chooses that thread group) and some thread in the latter thread
> > group is contended on unkillable locks (e.g. inode mutex), due to
> >
> > 	p = find_lock_task_mm(tsk);
> > 	if (!p)
> > 		return true;
> >
> > in __oom_reap_task() and
> >
> > 	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> >
> > in oom_kill_process(). The OOM reaper is woken up in order to reap
> > the former thread group's memory, but it does nothing on the latter
> > thread group's memory because the former thread group can clear its mm
> > before the OOM reaper locks its mm. Even if subsequent out_of_memory()
> > call chose the latter thread group, the OOM reaper will not be woken up.
> > No memory is reaped. We need to queue all thread groups sharing that
> > memory if that memory should be reaped.
>
> Why it wouldn't be enough to wake the oom reaper only for the oom
> victims? If the oom reaper races with the victims exit path then
> the next round of the out_of_memory will select a different thread
> sharing the same mm.

If we use MMF_OOM_KILLED, we can hit below sequence.

(1) first round of out_of_memory() sends SIGKILL to both p0 and p1
    and sets TIF_MEMDIE on p0 and and sets MMF_OOM_KILLED on p0's mm.

(2) p0 releases its mm.

(3) oom_reap_task() does nothing on p1 because p0 released its mm.

(4) second round of out_of_memory() sends SIGKILL to p1 and sets
    TIF_MEMDIE on p1.

(5) oom_reap_task() is not called because p1's mm already has MMF_OOM_KILLED.

(6) oom_reap_task() won't clear TIF_MEMDIE on p1 even if p1 got stuck at
    unkillable lock.

If we get rid of MMF_OOM_KILLED, oom_reap_task() will be called upon
next round of out_of_memory() unless that hits the shortcuts.

>
> And just to prevent from a confusion. I mean waking up also when
> fatal_signal_pending and we do not really go down to selecting an oom
> victim. Which would be worth a separate patch on top of course.

I couldn't understand this part. The shortcut

        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

is not used for !__GFP_FS && !__GFP_NOFAIL allocation requests. I think
we might go down to selecting an oom victim by out_of_memory() calls by
not-yet-killed processes.



Anyway, why do we want to let out_of_memory() try to choose p1 without
giving oom_reap_task() a chance to reap p1's mm?

Here is a pathological situation:

  There are 100 thread groups (p[0] to p[99]) sharing the same memory.
  They all have *p->signal->oom_score_adj == OOM_SCORE_ADJ_MAX.
  Any p[n+1] except p[0] is waiting for p[n] to call mutex_unlock() in

      mutex_lock(&lock);
      ptr = kmalloc(sizeof(*ptr), GFP_NOFS);
      if (ptr)
          do_something(ptr);
      mutex_unlock(&lock);

  sequence, and p[0] is doing kmalloc() before mutex_unlock().

  First round of out_of_memory() selects p[0].
  p[0..99] get SIGKILL and p[0] gets TIF_MEMDIE.
  p[0] releases unkillable lock and terminates and does p[0]->mm = NULL.
  oom_reap_task() did nothing due to p[0]->mm == NULL.
  Next round of out_of_memory() selects p[1].
  p[1..99] get SIGKILL and p[1] gets TIF_MEMDIE.
  p[1] releases unkillable lock and terminates and does p[1]->mm = NULL.
  oom_reap_task() did nothing due to p[1]->mm == NULL.
  Next round of out_of_memory() selects p[2].
  p[2..99] get SIGKILL and p[2] gets TIF_MEMDIE.
  p[2] releases unkillable lock and terminates and does p[2]->mm = NULL.
  oom_reap_task() did nothing due to p[2]->mm == NULL.
  (...snipped...)
  Next round of out_of_memory() selects p[99].
  p[99] gets SIGKILL and p[99] gets TIF_MEMDIE.
  p[99] releases unkillable lock and terminates and does p[99]->mm = NULL.
  oom_reap_task() did nothing due to p[99]->mm == NULL.
  p's memory is released, and out_of_memory() will no longer be called if OOM situation was resolved.

If we give p[0..99] a chance to terminate as soon as possible using
TIF_MEMDIE when first round of out_of_memory() selected p[0], we can
save a lot of out_of_memory() warning messages. I think that setting
TIF_MEMDIE on 100 threads does not increase possibility of depletion
of memory reserves because TIF_MEMDIE helps only if that thread is
doing memory allocation. If 100 threads were concurrently doing __GFP_FS
or __GFP_NOFAIL memory allocation, they will get TIF_MEMDIE after all.
(We can reduce possibility of depletion of memory reserves than now
if __GFP_KILLABLE were available.) Also, by queuing p[0..99] when p[0]
is selected, oom_reap_task() will likely be able to reap p's memory
before 100 times of failing oom_reap_task() attempt complete.

Therefore, I think we can do something like below patch (partial revert
of "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address
space" with sysctl_oom_kill_allocating_task / oom_task_origin() / SysRq-f
fixes included).

----------
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..03e6257 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8fb187f..99b60ca 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -786,6 +786,18 @@ struct signal_struct {
 	short oom_score_adj;		/* OOM kill score adjustment */
 	short oom_score_adj_min;	/* OOM kill score adjustment min value.
 					 * Only settable by CAP_SYS_RESOURCE. */
+	/*
+	 * OOM-killer should ignore this process when selecting candidates
+	 * because this process was already OOM-killed. OOM-killer can wait
+	 * for existing TIF_MEMDIE threads unless SysRq-f is requested.
+	 */
+	bool oom_killed;
+	/*
+	 * OOM-killer should not wait for TIF_MEMDIE thread when selecting
+	 * candidates because OOM-reaper already reaped memory used by this
+	 * process.
+	 */
+	bool oom_reap_done;
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
diff --git a/kernel/exit.c b/kernel/exit.c
index fd90195..953d1a1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -435,7 +435,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 31bbdc6..3261d4a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -149,6 +149,12 @@ static bool oom_unkillable_task(struct task_struct *p,
 	if (!has_intersects_mems_allowed(p, nodemask))
 		return true;
 
+#ifdef CONFIG_MMU
+	/* See __oom_reap_task(). */
+	if (p->signal->oom_reap_done)
+		return true;
+#endif
+
 	return false;
 }
 
@@ -167,7 +173,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	long points;
 	long adj;
 
-	if (oom_unkillable_task(p, memcg, nodemask))
+	if (oom_unkillable_task(p, memcg, nodemask) || p->signal->oom_killed)
 		return 0;
 
 	p = find_lock_task_mm(p);
@@ -487,14 +493,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
+	 * This task shouldn't be sitting on a reasonably reclaimable memory
+	 * anymore. OOM killer should consider selecting other victim if
+	 * unmapping hasn't led to any improvements.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
+	tsk->signal->oom_reap_done = true;
 out:
 	mmput(mm);
 	return ret;
@@ -521,8 +524,6 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
-	set_freezable();
-
 	while (true) {
 		struct task_struct *tsk = NULL;
 
@@ -598,10 +599,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -662,6 +662,142 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
+#ifdef CONFIG_MMU
+/*
+ * We cannot use oom_reaper if the memory is shared by OOM-unkillable process
+ * because OOM-unkillable process wouldn't get killed and so the memory might
+ * be still used.
+ */
+static bool oom_can_reap_mm(struct mm_struct *mm)
+{
+	bool can_oom_reap = true;
+	struct task_struct *p;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			can_oom_reap = false;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	return can_oom_reap;
+}
+#else
+static bool oom_can_reap_mm(struct mm_struct *mm)
+{
+	return false;
+}
+#endif
+
+/*
+ * Kill all OOM-killable user threads sharing this victim->mm.
+ *
+ * This mitigates mm->mmap_sem livelock caused by one thread being
+ * unable to release its mm due to being blocked at
+ * down_read(&mm->mmap_sem) in exit_mm() while another thread is doing
+ * __GFP_FS allocation with mm->mmap_sem held for write.
+ *
+ * They all get access to memory reserves.
+ *
+ * This prevents the OOM killer from choosing next OOM victim as soon
+ * as current victim thread released its mm. This also mitigates kernel
+ * log buffer being spammed by OOM killer messages due to choosing next
+ * OOM victim thread sharing the current OOM victim's memory.
+ *
+ * This mitigates the problem that a thread doing __GFP_FS allocation
+ * with mmap_sem held for write cannot call out_of_memory() for
+ * unpredictable duration due to oom_lock contention and/or scheduling
+ * priority, for the OOM reaper will not wait forever until such thread
+ * leaves memory allocating loop by calling out_of_memory(). This also
+ * mitigates the problem that a thread doing !__GFP_FS && !__GFP_NOFAIL
+ * allocation cannot leave memory allocating loop because it cannot
+ * call out_of_memory() even after it is killed.
+ *
+ * They all get excluded from OOM victim selection.
+ *
+ * This mitigates the problem that SysRq-f continues choosing the same
+ * process. We still need SysRq-f because it is possible that victim
+ * threads are blocked at unkillable locks inside memory allocating
+ * loop (e.g. fs writeback from direct reclaim) even after they got
+ * SIGKILL and TIF_MEMDIE.
+ */
+static void oom_kill_mm_users(struct mm_struct *mm)
+{
+	const bool can_oom_reap = oom_can_reap_mm(mm);
+	struct task_struct *p;
+	struct task_struct *t;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+		/*
+		 * We should send SIGKILL before setting TIF_MEMDIE in order to
+		 * prevent the OOM victim from depleting the memory reserves
+		 * from the user space under its control.
+		 */
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+		p->signal->oom_killed = true;
+		for_each_thread(p, t) {
+			task_lock(t);
+			if (t->mm)
+				mark_oom_victim(t);
+			task_unlock(t);
+		}
+		if (can_oom_reap)
+			wake_oom_reaper(p);
+	}
+	rcu_read_unlock();
+}
+
+/*
+ * If any of p's children has a different mm and is eligible for kill,
+ * the one with the highest oom_badness() score is sacrificed for its
+ * parent. This attempts to lose the minimal amount of work done while
+ * still freeing memory.
+ */
+static struct task_struct *oom_scan_children(struct task_struct *victim,
+					     struct oom_control *oc,
+					     unsigned long totalpages,
+					     struct mem_cgroup *memcg)
+{
+	struct task_struct *p = victim;
+	struct mm_struct *mm = p->mm;
+	struct task_struct *t;
+	struct task_struct *child;
+	unsigned int victim_points = 0;
+
+	read_lock(&tasklist_lock);
+	for_each_thread(p, t) {
+		list_for_each_entry(child, &t->children, sibling) {
+			unsigned int child_points;
+
+			if (process_shares_mm(child, mm))
+				continue;
+			/*
+			 * oom_badness() returns 0 if the thread is unkillable
+			 */
+			child_points = oom_badness(child, memcg, oc->nodemask,
+						   totalpages);
+			if (child_points > victim_points) {
+				put_task_struct(victim);
+				victim = child;
+				victim_points = child_points;
+				get_task_struct(victim);
+			}
+		}
+	}
+	read_unlock(&tasklist_lock);
+	return victim;
+}
+
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
@@ -671,13 +807,16 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		      struct mem_cgroup *memcg, const char *message)
 {
 	struct task_struct *victim = p;
-	struct task_struct *child;
-	struct task_struct *t;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	const bool kill_current = !p;
+
+	if (kill_current) {
+		p = current;
+		victim = p;
+		get_task_struct(p);
+	}
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -698,33 +837,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
 		message, task_pid_nr(p), p->comm, points);
 
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (process_shares_mm(child, p->mm))
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, memcg, oc->nodemask,
-								totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
+	if (!kill_current && !oom_task_origin(victim))
+		victim = oom_scan_children(victim, oc, totalpages, memcg);
 
 	p = find_lock_task_mm(victim);
 	if (!p) {
@@ -739,54 +853,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
-	/*
-	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
-	 * the OOM victim from depleting the memory reserves from the user
-	 * space under its control.
-	 */
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
-	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
 		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
 	task_unlock(victim);
+	put_task_struct(victim);
 
-	/*
-	 * Kill all user processes sharing victim->mm in other thread groups, if
-	 * any.  They don't get access to memory reserves, though, to avoid
-	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
-	 * oom killed thread cannot exit because it requires the semaphore and
-	 * its contended by another thread trying to allocate memory itself.
-	 * That thread will now get access to memory reserves since it has a
-	 * pending fatal signal.
-	 */
-	rcu_read_lock();
-	for_each_process(p) {
-		if (!process_shares_mm(p, mm))
-			continue;
-		if (same_thread_group(p, victim))
-			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
-			continue;
-		}
-		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
-	}
-	rcu_read_unlock();
-
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
+	oom_kill_mm_users(mm);
 
 	mmdrop(mm);
-	put_task_struct(victim);
 }
 #undef K
 
@@ -880,8 +957,7 @@ bool out_of_memory(struct oom_control *oc)
 	if (sysctl_oom_kill_allocating_task && current->mm &&
 	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-		get_task_struct(current);
-		oom_kill_process(oc, current, 0, totalpages, NULL,
+		oom_kill_process(oc, NULL, 0, totalpages, NULL,
 				 "Out of memory (oom_kill_allocating_task)");
 		return true;
 	}
----------

If we can tolerate lack of process name and its pid when reporting
success/failure (or we pass them via mm_struct or walk the process list or
whatever else), I think we can do something like below patch (most revert of
"oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space").

----------
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 944b2b3..a52ada1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -509,6 +509,14 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 	atomic_long_t hugetlb_usage;
 #endif
+#ifdef CONFIG_MMU
+	/*
+	 * If this field refers self, OOM-killer should not wait for TIF_MEMDIE
+	 * thread using this mm_struct when selecting candidates, for
+	 * OOM-reaper already reaped memory used by this mm_struct.
+	 */
+	struct mm_struct *oom_reaper_list;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f30cae..d13163a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -792,12 +792,6 @@ struct signal_struct {
 	 * for existing TIF_MEMDIE threads unless SysRq-f is requested.
 	 */
 	bool oom_killed;
-	/*
-	 * OOM-killer should not wait for TIF_MEMDIE thread when selecting
-	 * candidates because OOM-reaper already reaped memory used by this
-	 * process.
-	 */
-	bool oom_reap_done;
 
 	struct mutex cred_guard_mutex;	/* guard against foreign influences on
 					 * credential calculations
@@ -1861,9 +1855,6 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct task_struct *oom_reaper_list;
-#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3261d4a..2199c71 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -150,9 +150,15 @@ static bool oom_unkillable_task(struct task_struct *p,
 		return true;
 
 #ifdef CONFIG_MMU
-	/* See __oom_reap_task(). */
-	if (p->signal->oom_reap_done)
-		return true;
+	/* p's memory might be already reclaimed by the OOM reaper. */
+	p = find_lock_task_mm(p);
+	if (p) {
+		struct mm_struct *mm = p->mm;
+		const bool unkillable = (mm->oom_reaper_list == mm);
+
+		task_unlock(p);
+		return unkillable;
+	}
 #endif
 
 	return false;
@@ -425,37 +431,21 @@ bool oom_killer_disabled __read_mostly;
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static struct mm_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool __oom_reap_vmas(struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/*
-	 * Make sure we find the associated mm_struct even when the particular
-	 * thread has already terminated and cleared its mm.
-	 * We might have race with exit path so consider our work done if there
-	 * is no mm.
-	 */
-	p = find_lock_task_mm(tsk);
-	if (!p)
-		return true;
-
-	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
-		task_unlock(p);
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
 		return true;
-	}
-
-	task_unlock(p);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -485,73 +475,75 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
-	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
-			task_pid_nr(tsk), tsk->comm,
-			K(get_mm_counter(mm, MM_ANONPAGES)),
-			K(get_mm_counter(mm, MM_FILEPAGES)),
-			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+	pr_info("oom_reaper: now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+		K(get_mm_counter(mm, MM_ANONPAGES)),
+		K(get_mm_counter(mm, MM_FILEPAGES)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * This task shouldn't be sitting on a reasonably reclaimable memory
-	 * anymore. OOM killer should consider selecting other victim if
+	 * This mm no longer has reasonably reclaimable memory.
+	 * OOM killer should consider selecting other victim if
 	 * unmapping hasn't led to any improvements.
 	 */
-	tsk->signal->oom_reap_done = true;
+	mm->oom_reaper_list = mm;
 out:
 	mmput(mm);
 	return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
+static void oom_reap_vmas(struct mm_struct *mm)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_vmas(mm))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
+		pr_info("oom_reaper: unable to reap memory\n");
 		debug_show_all_locks();
 	}
 
 	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
+	mmdrop(mm);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
+		struct mm_struct *mm = NULL;
 
 		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
 		spin_lock(&oom_reaper_lock);
 		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
+			mm = oom_reaper_list;
+			oom_reaper_list = mm->oom_reaper_list;
 		}
 		spin_unlock(&oom_reaper_lock);
 
-		if (tsk)
-			oom_reap_task(tsk);
+		if (mm)
+			oom_reap_vmas(mm);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct mm_struct *mm)
 {
-	if (!oom_reaper_th || tsk->oom_reaper_list)
+	if (!oom_reaper_th || mm->oom_reaper_list)
 		return;
 
-	get_task_struct(tsk);
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
 
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	mm->oom_reaper_list = oom_reaper_list;
+	oom_reaper_list = mm;
 	spin_unlock(&oom_reaper_lock);
 	wake_up(&oom_reaper_wait);
 }
@@ -568,7 +560,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct mm_struct *mm)
 {
 }
 #endif
@@ -662,37 +654,6 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-#ifdef CONFIG_MMU
-/*
- * We cannot use oom_reaper if the memory is shared by OOM-unkillable process
- * because OOM-unkillable process wouldn't get killed and so the memory might
- * be still used.
- */
-static bool oom_can_reap_mm(struct mm_struct *mm)
-{
-	bool can_oom_reap = true;
-	struct task_struct *p;
-
-	rcu_read_lock();
-	for_each_process(p) {
-		if (!process_shares_mm(p, mm))
-			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			can_oom_reap = false;
-			break;
-		}
-	}
-	rcu_read_unlock();
-	return can_oom_reap;
-}
-#else
-static bool oom_can_reap_mm(struct mm_struct *mm)
-{
-	return false;
-}
-#endif
-
 /*
  * Kill all OOM-killable user threads sharing this victim->mm.
  *
@@ -727,7 +688,7 @@ static bool oom_can_reap_mm(struct mm_struct *mm)
  */
 static void oom_kill_mm_users(struct mm_struct *mm)
 {
-	const bool can_oom_reap = oom_can_reap_mm(mm);
+	bool can_oom_reap = true;
 	struct task_struct *p;
 	struct task_struct *t;
 
@@ -736,8 +697,16 @@ static void oom_kill_mm_users(struct mm_struct *mm)
 		if (!process_shares_mm(p, mm))
 			continue;
 		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper if the memory is shared by
+			 * OOM-unkillable process because OOM-unkillable
+			 * process wouldn't get killed and so the memory might
+			 * be still used.
+			 */
+			can_oom_reap = false;
 			continue;
+		}
 		/*
 		 * We should send SIGKILL before setting TIF_MEMDIE in order to
 		 * prevent the OOM victim from depleting the memory reserves
@@ -751,10 +720,10 @@ static void oom_kill_mm_users(struct mm_struct *mm)
 				mark_oom_victim(t);
 			task_unlock(t);
 		}
-		if (can_oom_reap)
-			wake_oom_reaper(p);
 	}
 	rcu_read_unlock();
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
 }
 
 /*
----------

Well, I wanted to split oom_kill_process() into several small
functions before we start applying OOM reaper patches...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-16 11:16                 ` Tetsuo Handa
@ 2016-03-17 10:49                   ` Tetsuo Handa
  2016-03-17 12:17                     ` Michal Hocko
  2016-03-17 12:14                   ` Michal Hocko
  1 sibling, 1 reply; 112+ messages in thread
From: Tetsuo Handa @ 2016-03-17 10:49 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Tetsuo Handa wrote:
> If we can tolerate lack of process name and its pid when reporting
> success/failure (or we pass them via mm_struct or walk the process list or
> whatever else), I think we can do something like below patch (most revert of
> "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space").
>
>  	if (attempts > MAX_OOM_REAP_RETRIES) {
> -		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
> -				task_pid_nr(tsk), tsk->comm);
> +		pr_info("oom_reaper: unable to reap memory\n");
>  		debug_show_all_locks();
>  	}
>

Since possible cause of unable to reap memory for oom_reap_vmas() is limited to

  Somebody was waiting at down_write(&mm->mmap_sem)
  (where converting to down_write_killable(&mm->mmap_sem) helps).

or

  Somebody was waiting on unkillable lock between
  down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem)
  (where we will need to convert such locks killable).

or

  Somebody was doing !__GFP_FS && !__GFP_NOFAIL allocation between
  down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem), and unable
  to call out_of_memory() in order to acquire TIF_MEMDIE (where setting
  TIF_MEMDIE to all threads using that mm by oom_kill_process() helps).

or

  Somebody was doing __GFP_FS || __GFP_NOFAIL allocation between
  down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem), but unable
  to call out_of_memory() for more than one second due to oom_lock
  contention and/or scheduling priority (where setting TIF_MEMDIE
  to all threads using that mm by oom_kill_process() helps).

, we want to check traces of threads using that mm rather than locks
held by all threads. In addition to that, CONFIG_PROVE_LOCKING is not
enabled in most production systems.

I think below patch is more helpful than debug_show_all_locks().
(Though kmallocwd patch will report "unable to reap mm" case and
"unable to leave too_many_isolated() loop" case and any other
not-yet-identified cases which stall memory allocation.)

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2199c71..affbb79 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -502,8 +502,26 @@ static void oom_reap_vmas(struct mm_struct *mm)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
+		struct task_struct *p;
+		struct task_struct *t;
+
 		pr_info("oom_reaper: unable to reap memory\n");
-		debug_show_all_locks();
+		rcu_read_lock();
+		for_each_process_thread(p, t) {
+			if (likely(t->mm != mm))
+				continue;
+			pr_info("oom_reaper: %s(%u) flags=0x%x%s%s%s%s\n",
+				t->comm, t->pid, t->flags,
+				(t->state & TASK_UNINTERRUPTIBLE) ?
+				" uninterruptible" : "",
+				(t->flags & PF_EXITING) ? " exiting" : "",
+				fatal_signal_pending(t) ? " dying" : "",
+				test_tsk_thread_flag(t, TIF_MEMDIE) ?
+				" victim" : "");
+			sched_show_task(t);
+			debug_show_held_locks(t);
+		}
+		rcu_read_unlock();
 	}
 
 	/* Drop a reference taken by wake_oom_reaper */
----------

Well, I think we can define CONFIG_OOM_REAPER which defaults to y
and depends on CONFIG_MMU, rather than scatter around CONFIG_MMU.
That will help catching build failure on CONFIG_MMU=n case...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-16 11:16                 ` Tetsuo Handa
  2016-03-17 10:49                   ` Tetsuo Handa
@ 2016-03-17 12:14                   ` Michal Hocko
  1 sibling, 0 replies; 112+ messages in thread
From: Michal Hocko @ 2016-03-17 12:14 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Wed 16-03-16 20:16:47, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > And just to prevent from a confusion. I mean waking up also when
> > fatal_signal_pending and we do not really go down to selecting an oom
> > victim. Which would be worth a separate patch on top of course.
> 
> I couldn't understand this part. The shortcut
> 
>         if (current->mm &&
>             (fatal_signal_pending(current) || task_will_free_mem(current))) {
>                 mark_oom_victim(current);
>                 return true;
>         }
> 
> is not used for !__GFP_FS && !__GFP_NOFAIL allocation requests. I think
> we might go down to selecting an oom victim by out_of_memory() calls by
> not-yet-killed processes.

I meant something like the following. It would need some more tweaks
of course but here is the idea at least.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 23b8b06152be..09e54bc0976c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -686,6 +686,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	task_lock(p);
 	if (p->mm && task_will_free_mem(p)) {
 		mark_oom_victim(p);
+		wake_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -869,10 +870,22 @@ bool out_of_memory(struct oom_control *oc)
 	if (current->mm &&
 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
 		mark_oom_victim(current);
+		wake_oom_reaper(current);
 		return true;
 	}
 
 	/*
+	 * XXX: This is a weak reclaim context when FS metadata couldn't be
+	 * reclaimed and so triggering the OOM killer could be really pre
+	 * mature at this point. Traditionally have been looping in the page
+	 * allocator and hoping for somebody else to make a forward progress
+	 * for us. It would be better to simply fail those requests but we
+	 * are not yet there so keep the tradition
+	 */
+	if (!(gfp_mask & __GFP_FS))
+		return true;
+
+	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4d574dd0408..01121a89eb52 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2854,20 +2854,11 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 *
-			 * But do not keep looping if oom_killer_disable()
-			 * was already called, for the system is trying to
-			 * enter a quiescent state during suspend.
-			 */
-			*did_some_progress = !oom_killer_disabled;
-			goto out;
-		}
+		/*
+		 * TODO once we are able to cope with GFP_NOFS allocation
+		 * failures more gracefully just return and fail the allocation
+		 * rather than trigger OOM
+		 */
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-17 10:49                   ` Tetsuo Handa
@ 2016-03-17 12:17                     ` Michal Hocko
  2016-03-17 13:00                       ` Tetsuo Handa
  0 siblings, 1 reply; 112+ messages in thread
From: Michal Hocko @ 2016-03-17 12:17 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Thu 17-03-16 19:49:01, Tetsuo Handa wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 2199c71..affbb79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -502,8 +502,26 @@ static void oom_reap_vmas(struct mm_struct *mm)
>  		schedule_timeout_idle(HZ/10);
>  
>  	if (attempts > MAX_OOM_REAP_RETRIES) {
> +		struct task_struct *p;
> +		struct task_struct *t;
> +
>  		pr_info("oom_reaper: unable to reap memory\n");
> -		debug_show_all_locks();
> +		rcu_read_lock();
> +		for_each_process_thread(p, t) {
> +			if (likely(t->mm != mm))
> +				continue;
> +			pr_info("oom_reaper: %s(%u) flags=0x%x%s%s%s%s\n",
> +				t->comm, t->pid, t->flags,
> +				(t->state & TASK_UNINTERRUPTIBLE) ?
> +				" uninterruptible" : "",
> +				(t->flags & PF_EXITING) ? " exiting" : "",
> +				fatal_signal_pending(t) ? " dying" : "",
> +				test_tsk_thread_flag(t, TIF_MEMDIE) ?
> +				" victim" : "");
> +			sched_show_task(t);
> +			debug_show_held_locks(t);
> +		}
> +		rcu_read_unlock();

Isn't this way too much work for a single RCU lock? Also wouldn't it
generate way too much output in the pathological situations a so hide
other potentially more important log messages?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-17 12:17                     ` Michal Hocko
@ 2016-03-17 13:00                       ` Tetsuo Handa
  2016-03-17 13:23                         ` Michal Hocko
  0 siblings, 1 reply; 112+ messages in thread
From: Tetsuo Handa @ 2016-03-17 13:00 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Thu 17-03-16 19:49:01, Tetsuo Handa wrote:
> [...]
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 2199c71..affbb79 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -502,8 +502,26 @@ static void oom_reap_vmas(struct mm_struct *mm)
> >  		schedule_timeout_idle(HZ/10);
> >  
> >  	if (attempts > MAX_OOM_REAP_RETRIES) {
> > +		struct task_struct *p;
> > +		struct task_struct *t;
> > +
> >  		pr_info("oom_reaper: unable to reap memory\n");
> > -		debug_show_all_locks();
> > +		rcu_read_lock();
> > +		for_each_process_thread(p, t) {
> > +			if (likely(t->mm != mm))
> > +				continue;
> > +			pr_info("oom_reaper: %s(%u) flags=0x%x%s%s%s%s\n",
> > +				t->comm, t->pid, t->flags,
> > +				(t->state & TASK_UNINTERRUPTIBLE) ?
> > +				" uninterruptible" : "",
> > +				(t->flags & PF_EXITING) ? " exiting" : "",
> > +				fatal_signal_pending(t) ? " dying" : "",
> > +				test_tsk_thread_flag(t, TIF_MEMDIE) ?
> > +				" victim" : "");
> > +			sched_show_task(t);
> > +			debug_show_held_locks(t);
> > +		}
> > +		rcu_read_unlock();
> 
> Isn't this way too much work for a single RCU lock? Also wouldn't it
> generate way too much output in the pathological situations a so hide
> other potentially more important log messages?
> 
I don't think we can compare it. It is possible that 50 out of 10000
threads' traces and locks are reported with this change, but it is also
possible that 10000 threads' locks are reported without this change.

If you worry about too much work for a single RCU, you can do like
what kmallocwd does. kmallocwd adds a marker to task_struct so that
kmallocwd can reliably resume reporting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-17 13:00                       ` Tetsuo Handa
@ 2016-03-17 13:23                         ` Michal Hocko
  2016-03-17 14:34                           ` Tetsuo Handa
  0 siblings, 1 reply; 112+ messages in thread
From: Michal Hocko @ 2016-03-17 13:23 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Thu 17-03-16 22:00:34, Tetsuo Handa wrote:
[...]
> If you worry about too much work for a single RCU, you can do like
> what kmallocwd does. kmallocwd adds a marker to task_struct so that
> kmallocwd can reliably resume reporting.

It is you who is trying to add a different debugging output so you
should better make sure you won't swamp the user by something that might
be not helpful after all by _default_. I would care much less if this
was hidden by the debugging option like the current
debug_show_all_locks.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-17 13:23                         ` Michal Hocko
@ 2016-03-17 14:34                           ` Tetsuo Handa
  2016-03-17 14:54                             ` Michal Hocko
  0 siblings, 1 reply; 112+ messages in thread
From: Tetsuo Handa @ 2016-03-17 14:34 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Thu 17-03-16 22:00:34, Tetsuo Handa wrote:
> [...]
> > If you worry about too much work for a single RCU, you can do like
> > what kmallocwd does. kmallocwd adds a marker to task_struct so that
> > kmallocwd can reliably resume reporting.
> 
> It is you who is trying to add a different debugging output so you
> should better make sure you won't swamp the user by something that might
> be not helpful after all by _default_. I would care much less if this
> was hidden by the debugging option like the current
> debug_show_all_locks.

Then, we can do something like this.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index affbb79..76b5c67 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -502,26 +502,20 @@ static void oom_reap_vmas(struct mm_struct *mm)
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts > MAX_OOM_REAP_RETRIES) {
+#ifdef CONFIG_PROVE_LOCKING
 		struct task_struct *p;
 		struct task_struct *t;
+#endif
 
 		pr_info("oom_reaper: unable to reap memory\n");
-		rcu_read_lock();
+#ifdef CONFIG_PROVE_LOCKING
+		read_lock(&tasklist_lock);
 		for_each_process_thread(p, t) {
-			if (likely(t->mm != mm))
-				continue;
-			pr_info("oom_reaper: %s(%u) flags=0x%x%s%s%s%s\n",
-				t->comm, t->pid, t->flags,
-				(t->state & TASK_UNINTERRUPTIBLE) ?
-				" uninterruptible" : "",
-				(t->flags & PF_EXITING) ? " exiting" : "",
-				fatal_signal_pending(t) ? " dying" : "",
-				test_tsk_thread_flag(t, TIF_MEMDIE) ?
-				" victim" : "");
-			sched_show_task(t);
-			debug_show_held_locks(t);
+			if (t->mm == mm && t->state != TASK_RUNNING)
+				debug_show_held_locks(t);
 		}
-		rcu_read_unlock();
+		read_unlock(&tasklist_lock);
+#endif
 	}
 
 	/* Drop a reference taken by wake_oom_reaper */
----------

Strictly speaking, neither debug_show_all_locks() nor debug_show_held_locks()
are safe enough to guarantee that the system won't crash.

  commit 856848737bd944c1 "lockdep: fix debug_show_all_locks()"
  commit 82a1fcb90287052a "softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks"

They are convenient but we should avoid using them if we care about
possibility of crash.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-17 14:34                           ` Tetsuo Handa
@ 2016-03-17 14:54                             ` Michal Hocko
  2016-03-17 15:20                               ` Tetsuo Handa
  0 siblings, 1 reply; 112+ messages in thread
From: Michal Hocko @ 2016-03-17 14:54 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Thu 17-03-16 23:34:13, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 17-03-16 22:00:34, Tetsuo Handa wrote:
> > [...]
> > > If you worry about too much work for a single RCU, you can do like
> > > what kmallocwd does. kmallocwd adds a marker to task_struct so that
> > > kmallocwd can reliably resume reporting.
> > 
> > It is you who is trying to add a different debugging output so you
> > should better make sure you won't swamp the user by something that might
> > be not helpful after all by _default_. I would care much less if this
> > was hidden by the debugging option like the current
> > debug_show_all_locks.
> 
> Then, we can do something like this.
> 
> ----------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index affbb79..76b5c67 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -502,26 +502,20 @@ static void oom_reap_vmas(struct mm_struct *mm)
>  		schedule_timeout_idle(HZ/10);
>  
>  	if (attempts > MAX_OOM_REAP_RETRIES) {
> +#ifdef CONFIG_PROVE_LOCKING
>  		struct task_struct *p;
>  		struct task_struct *t;
> +#endif
>  
>  		pr_info("oom_reaper: unable to reap memory\n");
> -		rcu_read_lock();
> +#ifdef CONFIG_PROVE_LOCKING
> +		read_lock(&tasklist_lock);
>  		for_each_process_thread(p, t) {
> -			if (likely(t->mm != mm))
> -				continue;
> -			pr_info("oom_reaper: %s(%u) flags=0x%x%s%s%s%s\n",
> -				t->comm, t->pid, t->flags,
> -				(t->state & TASK_UNINTERRUPTIBLE) ?
> -				" uninterruptible" : "",
> -				(t->flags & PF_EXITING) ? " exiting" : "",
> -				fatal_signal_pending(t) ? " dying" : "",
> -				test_tsk_thread_flag(t, TIF_MEMDIE) ?
> -				" victim" : "");
> -			sched_show_task(t);
> -			debug_show_held_locks(t);
> +			if (t->mm == mm && t->state != TASK_RUNNING)
> +				debug_show_held_locks(t);
>  		}
> -		rcu_read_unlock();
> +		read_unlock(&tasklist_lock);
> +#endif
>  	}
>  
>  	/* Drop a reference taken by wake_oom_reaper */
> ----------

Please send a separate patch with a full description, ideally with the
example output and clarification why you think this is an improvement
over the current situation. I do not think the patch you are replying to
needs to be changed in any way. It seems correct and provides a useful
information already. If you believe you can provide something more
useful do it in an incremental change.

Making a lot of fuzz with something that doesn't point to a _real_ issue
in the patch to be merged is not particularly useful when we are in the
merge window already.

> Strictly speaking, neither debug_show_all_locks() nor debug_show_held_locks()
> are safe enough to guarantee that the system won't crash.
> 
>   commit 856848737bd944c1 "lockdep: fix debug_show_all_locks()"
>   commit 82a1fcb90287052a "softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks"
> 
> They are convenient but we should avoid using them if we care about
> possibility of crash.

I really fail to see your point. debug_show_all_locks doesn't mention
any restriction of the risk nor it is restricted to a particular
context. Were there some bugs in that area? Probably yes, so what?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-03-17 14:54                             ` Michal Hocko
@ 2016-03-17 15:20                               ` Tetsuo Handa
  0 siblings, 0 replies; 112+ messages in thread
From: Tetsuo Handa @ 2016-03-17 15:20 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> > Strictly speaking, neither debug_show_all_locks() nor debug_show_held_locks()
> > are safe enough to guarantee that the system won't crash.
> > 
> >   commit 856848737bd944c1 "lockdep: fix debug_show_all_locks()"
> >   commit 82a1fcb90287052a "softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks"
> > 
> > They are convenient but we should avoid using them if we care about
> > possibility of crash.
> 
> I really fail to see your point. debug_show_all_locks doesn't mention
> any restriction of the risk nor it is restricted to a particular
> context. Were there some bugs in that area? Probably yes, so what?

commit 856848737bd944c1 changed that locks held by TASK_RUNNING tasks
are not reported. We might fail to report the task which is holding mmap_sem
for write when we call debug_show_all_locks() in order to find such task.
Therefore, I think guessing from sched_show_task() output can be used.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 112+ messages in thread

end of thread, other threads:[~2016-03-17 15:20 UTC | newest]

Thread overview: 112+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-03 13:13 [PATCH 0/5] oom reaper v5 Michal Hocko
2016-02-03 13:13 ` Michal Hocko
2016-02-03 13:13 ` [PATCH 1/5] mm, oom: introduce oom reaper Michal Hocko
2016-02-03 13:13   ` Michal Hocko
2016-02-03 23:48   ` David Rientjes
2016-02-03 23:48     ` David Rientjes
2016-02-04  6:41     ` Michal Hocko
2016-02-04  6:41       ` Michal Hocko
2016-02-06 13:22   ` Tetsuo Handa
2016-02-06 13:22     ` Tetsuo Handa
2016-02-15 20:50     ` Michal Hocko
2016-02-15 20:50       ` Michal Hocko
2016-02-03 13:13 ` [PATCH 2/5] oom reaper: handle mlocked pages Michal Hocko
2016-02-03 13:13   ` Michal Hocko
2016-02-03 23:57   ` David Rientjes
2016-02-03 23:57     ` David Rientjes
2016-02-23  1:36   ` David Rientjes
2016-02-23  1:36     ` David Rientjes
2016-02-23 13:21     ` Michal Hocko
2016-02-23 13:21       ` Michal Hocko
2016-02-29  3:19       ` Hugh Dickins
2016-02-29  3:19         ` Hugh Dickins
2016-02-29 13:41         ` Michal Hocko
2016-02-29 13:41           ` Michal Hocko
2016-03-08 13:40           ` Michal Hocko
2016-03-08 13:40             ` Michal Hocko
2016-03-08 20:07             ` Hugh Dickins
2016-03-08 20:07               ` Hugh Dickins
2016-03-09  8:26               ` Michal Hocko
2016-03-09  8:26                 ` Michal Hocko
2016-02-03 13:13 ` [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Michal Hocko
2016-02-03 13:13   ` Michal Hocko
2016-02-04 14:22   ` Tetsuo Handa
2016-02-04 14:22     ` Tetsuo Handa
2016-02-04 14:43     ` Michal Hocko
2016-02-04 14:43       ` Michal Hocko
2016-02-04 15:08       ` Tetsuo Handa
2016-02-04 15:08         ` Tetsuo Handa
2016-02-04 16:31         ` Michal Hocko
2016-02-04 16:31           ` Michal Hocko
2016-02-05 11:14           ` Tetsuo Handa
2016-02-05 11:14             ` Tetsuo Handa
2016-02-06  8:30             ` Michal Hocko
2016-02-06  8:30               ` Michal Hocko
2016-02-06 11:23               ` Tetsuo Handa
2016-02-06 11:23                 ` Tetsuo Handa
2016-02-15 20:47                 ` Michal Hocko
2016-02-15 20:47                   ` Michal Hocko
2016-02-06  6:45       ` Michal Hocko
2016-02-06  6:45         ` Michal Hocko
2016-02-06 14:33         ` Tetsuo Handa
2016-02-06 14:33           ` Tetsuo Handa
2016-02-15 20:40           ` [PATCH 3.1/5] oom: make oom_reaper freezable Michal Hocko
2016-02-15 20:40             ` Michal Hocko
2016-02-25 11:28   ` [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Tetsuo Handa
2016-02-25 11:31     ` Tetsuo Handa
2016-02-25 14:16     ` Michal Hocko
2016-02-03 13:13 ` [PATCH 4/5] mm, oom_reaper: report success/failure Michal Hocko
2016-02-03 13:13   ` Michal Hocko
2016-02-03 23:10   ` David Rientjes
2016-02-03 23:10     ` David Rientjes
2016-02-04  6:46     ` Michal Hocko
2016-02-04  6:46       ` Michal Hocko
2016-02-04 22:31       ` David Rientjes
2016-02-04 22:31         ` David Rientjes
2016-02-05  9:26         ` Michal Hocko
2016-02-05  9:26           ` Michal Hocko
2016-02-06  6:34           ` Michal Hocko
2016-02-06  6:34             ` Michal Hocko
2016-02-03 13:14 ` [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing Michal Hocko
2016-02-03 13:14   ` Michal Hocko
2016-02-04 10:49   ` Tetsuo Handa
2016-02-04 10:49     ` Tetsuo Handa
2016-02-04 14:53     ` Michal Hocko
2016-02-04 14:53       ` Michal Hocko
2016-02-06  5:54       ` Tetsuo Handa
2016-02-06  5:54         ` Tetsuo Handa
2016-02-06  8:37         ` Michal Hocko
2016-02-06  8:37           ` Michal Hocko
2016-02-06 15:33           ` Tetsuo Handa
2016-02-06 15:33             ` Tetsuo Handa
2016-02-15 20:15             ` Michal Hocko
2016-02-15 20:15               ` Michal Hocko
2016-02-16 11:11               ` Tetsuo Handa
2016-02-16 11:11                 ` Tetsuo Handa
2016-02-16 15:53                 ` Michal Hocko
2016-02-16 15:53                   ` Michal Hocko
2016-02-17  9:48   ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for Michal Hocko
2016-02-17  9:48     ` Michal Hocko
2016-02-17 10:41     ` Tetsuo Handa
2016-02-17 10:41       ` Tetsuo Handa
2016-02-17 11:33       ` Michal Hocko
2016-02-17 11:33         ` Michal Hocko
2016-02-19 18:34     ` Michal Hocko
2016-02-19 18:34       ` Michal Hocko
2016-02-20  2:32       ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task Tetsuo Handa
2016-02-20  2:32         ` Tetsuo Handa
2016-02-22  9:41         ` Michal Hocko
2016-02-22  9:41           ` Michal Hocko
2016-02-29  1:26           ` Tetsuo Handa
2016-03-15 11:15           ` Tetsuo Handa
2016-03-15 11:43             ` Michal Hocko
2016-03-15 11:50               ` Michal Hocko
2016-03-16 11:16                 ` Tetsuo Handa
2016-03-17 10:49                   ` Tetsuo Handa
2016-03-17 12:17                     ` Michal Hocko
2016-03-17 13:00                       ` Tetsuo Handa
2016-03-17 13:23                         ` Michal Hocko
2016-03-17 14:34                           ` Tetsuo Handa
2016-03-17 14:54                             ` Michal Hocko
2016-03-17 15:20                               ` Tetsuo Handa
2016-03-17 12:14                   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.