All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm, oom: introduce oom reaper
@ 2015-11-25 15:56 ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-25 15:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Rik van Riel, Hugh Dickins, Oleg Nesterov,
	Andrea Argangeli, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked
for write while write but the probability is reduced considerably wrt.
basically any lock blocking forward progress as described above. In
order to prevent from blocking on the lock without any forward progress
we are using only a trylock and retry 10 times with a short sleep
in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NUll->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is another step into making OOM killing more reliable. We are still
not 100% of course because we still depend on mmap_sem for read and the
oom victim might not be holding a lot of private anonymous memory. But I
think this is an improvement over the current situation already without
too much of additional cost/complexity. There is a room for improvements
I guess but I wanted to start as easy as possible. This has survived
my oom hammering but I am not claiming it is 100% safe. There might be
side effects I have never thought about so this really needs a _careful_
review (it doesn't help that changes outside of oom_kill.c are few
lines, right ;).

Any feedback is welcome.

 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  10 ++--
 mm/oom_kill.c      | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 143 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..4750d7e942a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/* oom_repear cannot tear down dirty pages */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1123,7 +1126,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			continue;
 		}
 		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		if (unlikely(details || !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1231,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1239,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5314b206caa5..47c9f584038b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,108 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	DEFINE_WAIT(wait);
+
+	while (!kthread_should_stop()) {
+		struct mm_struct *mm;
+
+		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
+		mm = READ_ONCE(mm_to_reap);
+		if (!mm) {
+			freezable_schedule();
+			finish_wait(&oom_reaper_wait, &wait);
+		} else {
+			finish_wait(&oom_reaper_wait, &wait);
+			oom_reap_vmas(mm);
+			WRITE_ONCE(mm_to_reap, NULL);
+		}
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm) {
+		/*
+		 * Pin the given mm. Use mm_count instead of mm_users because
+		 * we do not want to delay the address space tear down.
+		 */
+		atomic_inc(&mm->mm_count);
+		wake_up(&oom_reaper_wait);
+	}
+}
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -421,6 +528,11 @@ void mark_oom_victim(struct task_struct *tsk)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
+	/* Kick oom reaper to help us release some memory */
+	if (tsk->mm)
+		wake_oom_reaper(tsk->mm);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -767,3 +879,22 @@ void pagefault_out_of_memory(void)
 
 	mutex_unlock(&oom_lock);
 }
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+module_init(oom_init)
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH] mm, oom: introduce oom reaper
@ 2015-11-25 15:56 ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-25 15:56 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Rik van Riel, Hugh Dickins, Oleg Nesterov,
	Andrea Argangeli, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked
for write while write but the probability is reduced considerably wrt.
basically any lock blocking forward progress as described above. In
order to prevent from blocking on the lock without any forward progress
we are using only a trylock and retry 10 times with a short sleep
in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NUll->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is another step into making OOM killing more reliable. We are still
not 100% of course because we still depend on mmap_sem for read and the
oom victim might not be holding a lot of private anonymous memory. But I
think this is an improvement over the current situation already without
too much of additional cost/complexity. There is a room for improvements
I guess but I wanted to start as easy as possible. This has survived
my oom hammering but I am not claiming it is 100% safe. There might be
side effects I have never thought about so this really needs a _careful_
review (it doesn't help that changes outside of oom_kill.c are few
lines, right ;).

Any feedback is welcome.

 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  10 ++--
 mm/oom_kill.c      | 131 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 143 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..4750d7e942a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/* oom_repear cannot tear down dirty pages */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1123,7 +1126,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			continue;
 		}
 		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		if (unlikely(details || !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1231,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1239,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5314b206caa5..47c9f584038b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,108 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	DEFINE_WAIT(wait);
+
+	while (!kthread_should_stop()) {
+		struct mm_struct *mm;
+
+		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
+		mm = READ_ONCE(mm_to_reap);
+		if (!mm) {
+			freezable_schedule();
+			finish_wait(&oom_reaper_wait, &wait);
+		} else {
+			finish_wait(&oom_reaper_wait, &wait);
+			oom_reap_vmas(mm);
+			WRITE_ONCE(mm_to_reap, NULL);
+		}
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm) {
+		/*
+		 * Pin the given mm. Use mm_count instead of mm_users because
+		 * we do not want to delay the address space tear down.
+		 */
+		atomic_inc(&mm->mm_count);
+		wake_up(&oom_reaper_wait);
+	}
+}
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -421,6 +528,11 @@ void mark_oom_victim(struct task_struct *tsk)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
+	/* Kick oom reaper to help us release some memory */
+	if (tsk->mm)
+		wake_oom_reaper(tsk->mm);
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -767,3 +879,22 @@ void pagefault_out_of_memory(void)
 
 	mutex_unlock(&oom_lock);
 }
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+module_init(oom_init)
-- 
2.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
  2015-11-25 15:56 ` Michal Hocko
@ 2015-11-25 20:08   ` Johannes Weiner
  -1 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2015-11-25 20:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Linus Torvalds, Mel Gorman,
	David Rientjes, Tetsuo Handa, Rik van Riel, Hugh Dickins,
	Oleg Nesterov, Andrea Argangeli, LKML, Michal Hocko

Hi Michal,

I think whatever we end up doing to smoothen things for the "common
case" (as much as OOM kills can be considered common), we need a plan
to resolve the memory deadlock situations in a finite amount of time.

Eventually we have to attempt killing another task. Or kill all of
them to save the kernel.

It just strikes me as odd to start with smoothening the common case,
rather than making it functionally correct first.

On Wed, Nov 25, 2015 at 04:56:58PM +0100, Michal Hocko wrote:
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.

Why not do it directly from the allocating context? I.e. when entering
the OOM killer and finding a lingering TIF_MEMDIE from a previous kill
just reap its memory directly then and there. It's not like the
allocating task has anything else to do in the meantime...

> @@ -1123,7 +1126,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			continue;
>  		}
>  		/* If details->check_mapping, we leave swap entries. */
> -		if (unlikely(details))
> +		if (unlikely(details || !details->check_swap_entries))
>  			continue;

&&

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
@ 2015-11-25 20:08   ` Johannes Weiner
  0 siblings, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2015-11-25 20:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Linus Torvalds, Mel Gorman,
	David Rientjes, Tetsuo Handa, Rik van Riel, Hugh Dickins,
	Oleg Nesterov, Andrea Argangeli, LKML, Michal Hocko

Hi Michal,

I think whatever we end up doing to smoothen things for the "common
case" (as much as OOM kills can be considered common), we need a plan
to resolve the memory deadlock situations in a finite amount of time.

Eventually we have to attempt killing another task. Or kill all of
them to save the kernel.

It just strikes me as odd to start with smoothening the common case,
rather than making it functionally correct first.

On Wed, Nov 25, 2015 at 04:56:58PM +0100, Michal Hocko wrote:
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.

Why not do it directly from the allocating context? I.e. when entering
the OOM killer and finding a lingering TIF_MEMDIE from a previous kill
just reap its memory directly then and there. It's not like the
allocating task has anything else to do in the meantime...

> @@ -1123,7 +1126,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  			continue;
>  		}
>  		/* If details->check_mapping, we leave swap entries. */
> -		if (unlikely(details))
> +		if (unlikely(details || !details->check_swap_entries))
>  			continue;

&&

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
  2015-11-25 20:08   ` Johannes Weiner
@ 2015-11-26 11:08     ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-26 11:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Linus Torvalds, Mel Gorman,
	David Rientjes, Tetsuo Handa, Rik van Riel, Hugh Dickins,
	Oleg Nesterov, Andrea Argangeli, LKML

On Wed 25-11-15 15:08:06, Johannes Weiner wrote:
> Hi Michal,
> 
> I think whatever we end up doing to smoothen things for the "common
> case" (as much as OOM kills can be considered common), we need a plan
> to resolve the memory deadlock situations in a finite amount of time.
> 
> Eventually we have to attempt killing another task. Or kill all of
> them to save the kernel.
> 
> It just strikes me as odd to start with smoothening the common case,
> rather than making it functionally correct first.

I believe there is not an universally correct solution for this
problem. OOM killer is a heuristic and a destructive one so I think we
should limit it as much as possible. I do agree that we should allow an
administrator to define a policy when things go terribly wrong - e.g.
panic/emerg. reboot after the system is trashing on OOM for more than
a defined amount of time. But I think that this is orthogonal to this
patch. This patch should remove one large class of potential deadlocks
and corner cases without too much cost or maintenance burden. It doesn't
remove a need for the last resort solution though.
 
> On Wed, Nov 25, 2015 at 04:56:58PM +0100, Michal Hocko wrote:
> > A kernel thread has been chosen because we need a reliable way of
> > invocation so workqueue context is not appropriate because all the
> > workers might be busy (e.g. allocating memory). Kswapd which sounds
> > like another good fit is not appropriate as well because it might get
> > blocked on locks during reclaim as well.
> 
> Why not do it directly from the allocating context? I.e. when entering
> the OOM killer and finding a lingering TIF_MEMDIE from a previous kill
> just reap its memory directly then and there. It's not like the
> allocating task has anything else to do in the meantime...

One reason is that we have to exclude race with exit_mmap so we have to
increase mm_users but we cannot mmput in this context because we might
deadlock. So we have to tear down from a different context. Another
reason is that address space of the victim might be really large and
reaping from on behalf of one (random) task might be really unfair
wrt. others. Doing that from a kernel threads sounds like an easy and
relatively cheap way to workaround both issues.

> 
> > @@ -1123,7 +1126,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  			continue;
> >  		}
> >  		/* If details->check_mapping, we leave swap entries. */
> > -		if (unlikely(details))
> > +		if (unlikely(details || !details->check_swap_entries))
> >  			continue;
> 
> &&

Ups, thanks for catching this! I was playing with the condition and
rearranged the code multiple times before posting.

Thanks!
---
diff --git a/mm/memory.c b/mm/memory.c
index 4750d7e942a3..49cafa195527 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1125,8 +1125,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details || !details->check_swap_entries))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
@ 2015-11-26 11:08     ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-26 11:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Andrew Morton, Linus Torvalds, Mel Gorman,
	David Rientjes, Tetsuo Handa, Rik van Riel, Hugh Dickins,
	Oleg Nesterov, Andrea Argangeli, LKML

On Wed 25-11-15 15:08:06, Johannes Weiner wrote:
> Hi Michal,
> 
> I think whatever we end up doing to smoothen things for the "common
> case" (as much as OOM kills can be considered common), we need a plan
> to resolve the memory deadlock situations in a finite amount of time.
> 
> Eventually we have to attempt killing another task. Or kill all of
> them to save the kernel.
> 
> It just strikes me as odd to start with smoothening the common case,
> rather than making it functionally correct first.

I believe there is not an universally correct solution for this
problem. OOM killer is a heuristic and a destructive one so I think we
should limit it as much as possible. I do agree that we should allow an
administrator to define a policy when things go terribly wrong - e.g.
panic/emerg. reboot after the system is trashing on OOM for more than
a defined amount of time. But I think that this is orthogonal to this
patch. This patch should remove one large class of potential deadlocks
and corner cases without too much cost or maintenance burden. It doesn't
remove a need for the last resort solution though.
 
> On Wed, Nov 25, 2015 at 04:56:58PM +0100, Michal Hocko wrote:
> > A kernel thread has been chosen because we need a reliable way of
> > invocation so workqueue context is not appropriate because all the
> > workers might be busy (e.g. allocating memory). Kswapd which sounds
> > like another good fit is not appropriate as well because it might get
> > blocked on locks during reclaim as well.
> 
> Why not do it directly from the allocating context? I.e. when entering
> the OOM killer and finding a lingering TIF_MEMDIE from a previous kill
> just reap its memory directly then and there. It's not like the
> allocating task has anything else to do in the meantime...

One reason is that we have to exclude race with exit_mmap so we have to
increase mm_users but we cannot mmput in this context because we might
deadlock. So we have to tear down from a different context. Another
reason is that address space of the victim might be really large and
reaping from on behalf of one (random) task might be really unfair
wrt. others. Doing that from a kernel threads sounds like an easy and
relatively cheap way to workaround both issues.

> 
> > @@ -1123,7 +1126,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >  			continue;
> >  		}
> >  		/* If details->check_mapping, we leave swap entries. */
> > -		if (unlikely(details))
> > +		if (unlikely(details || !details->check_swap_entries))
> >  			continue;
> 
> &&

Ups, thanks for catching this! I was playing with the condition and
rearranged the code multiple times before posting.

Thanks!
---
diff --git a/mm/memory.c b/mm/memory.c
index 4750d7e942a3..49cafa195527 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1125,8 +1125,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details || !details->check_swap_entries))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
  2015-11-26 11:08     ` Michal Hocko
@ 2015-11-26 15:24       ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-11-26 15:24 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

Michal Hocko wrote:
> On Wed 25-11-15 15:08:06, Johannes Weiner wrote:
> > Hi Michal,
> > 
> > I think whatever we end up doing to smoothen things for the "common
> > case" (as much as OOM kills can be considered common), we need a plan
> > to resolve the memory deadlock situations in a finite amount of time.
> > 
> > Eventually we have to attempt killing another task. Or kill all of
> > them to save the kernel.
> > 
> > It just strikes me as odd to start with smoothening the common case,
> > rather than making it functionally correct first.
> 

Me too.

> I believe there is not an universally correct solution for this
> problem. OOM killer is a heuristic and a destructive one so I think we
> should limit it as much as possible. I do agree that we should allow an
> administrator to define a policy when things go terribly wrong - e.g.
> panic/emerg. reboot after the system is trashing on OOM for more than
> a defined amount of time. But I think that this is orthogonal to this
> patch. This patch should remove one large class of potential deadlocks
> and corner cases without too much cost or maintenance burden. It doesn't
> remove a need for the last resort solution though.
>  

 From the point of view of a technical staff at support center, offering
the last resort solution has higher priority than smoothening the common
case. We cannot test all memory pressure patterns before distributor's
kernel is released. We are too late to workaround unhandled patterns
after distributor's kernel is deployed to customer's systems.

Yet another report was posted in a different thread
"[PATCH] vmscan: do not throttle kthreads due to too_many_isolated".
I think I showed you several times including "mm,oom: The reason why
I continue proposing timeout based approach."
(uptime > 100 of http://I-love.SAKURA.ne.jp/tmp/serial-20150920.txt.xz )
that we are already seeing infinite loop at too_many_isolated() even
without using out of tree drivers. I hope that my patch
http://lkml.kernel.org/r/201505232339.DAB00557.VFFLHMSOJFOOtQ@I-love.SAKURA.ne.jp
included all necessary changes for serving as a base of the last resort.
Please don't loop forever when unexpected memory pressure is given.
Please don't assume somebody else can make forward progress.

Please do consider offering the last resort solution first.
That will help reducing unexplained hangup/reboot troubles.

Given that said, several comments on your patch.

> @@ -408,6 +413,108 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
>  
>  bool oom_killer_disabled __read_mostly;
>  
> +/*
> + * OOM Reaper kernel thread which tries to reap the memory used by the OOM
> + * victim (if that is possible) to help the OOM killer to move on.
> + */
> +static struct task_struct *oom_reaper_th;
> +static struct mm_struct *mm_to_reap;
> +static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> +
> +static bool __oom_reap_vmas(struct mm_struct *mm)
> +{
> +	struct mmu_gather tlb;
> +	struct vm_area_struct *vma;
> +	struct zap_details details = {.check_swap_entries = true,
> +				      .ignore_dirty = true};
> +	bool ret = true;
> +
> +	/* We might have raced with exit path */
> +	if (!atomic_inc_not_zero(&mm->mm_users))
> +		return true;
> +
> +	if (!down_read_trylock(&mm->mmap_sem)) {
> +		ret = false;
> +		goto out;
> +	}
> +
> +	tlb_gather_mmu(&tlb, mm, 0, -1);
> +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +
> +		/*
> +		 * Only anonymous pages have a good chance to be dropped
> +		 * without additional steps which we cannot afford as we
> +		 * are OOM already.
> +		 */
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> +					 &details);

How do you plan to make sure that reclaimed pages are used by
fatal_signal_pending() tasks?
http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp

> +	}
> +	tlb_finish_mmu(&tlb, 0, -1);
> +	up_read(&mm->mmap_sem);
> +out:
> +	mmput(mm);
> +	return ret;
> +}

> +static int oom_reaper(void *unused)
> +{
> +	DEFINE_WAIT(wait);
> +
> +	while (!kthread_should_stop()) {

Is there a situation where this kernel thread should stop?
I think this kernel thread should not stop because restarting
this kernel thread might invoke the OOM killer.

But if there is such situation, leaving this function with
oom_reaper_th != NULL is not handled by wake_oom_reaper().

> +		struct mm_struct *mm;
> +
> +		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
> +		mm = READ_ONCE(mm_to_reap);

Why not simply mm = xchg(&mm_to_reap, NULL) and free the slot for
next OOM victim (though xchg() may not be sufficient)?

> +		if (!mm) {
> +			freezable_schedule();
> +			finish_wait(&oom_reaper_wait, &wait);
> +		} else {
> +			finish_wait(&oom_reaper_wait, &wait);
> +			oom_reap_vmas(mm);
> +			WRITE_ONCE(mm_to_reap, NULL);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +	struct mm_struct *old_mm;
> +
> +	if (!oom_reaper_th)
> +		return;
> +
> +	/*
> +	 * Make sure that only a single mm is ever queued for the reaper
> +	 * because multiple are not necessary and the operation might be
> +	 * disruptive so better reduce it to the bare minimum.
> +	 */
> +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);

I think we should not skip queuing next OOM victim, for it is possible
that first OOM victim is chosen by one memory cgroup OOM, and next OOM
victim is chosen by another memory cgroup OOM or system wide OOM before
oom_reap_vmas() for first OOM victim completes.

To handle such case, we would need to do something like

 struct mm_struct {
     (...snipped...)
+    struct list_head *memdie; /* Set to non-NULL when chosen by OOM killer */
 }

and add to a list of OOM victims.

> +	if (!old_mm) {
> +		/*
> +		 * Pin the given mm. Use mm_count instead of mm_users because
> +		 * we do not want to delay the address space tear down.
> +		 */
> +		atomic_inc(&mm->mm_count);
> +		wake_up(&oom_reaper_wait);
> +	}
> +}
> +
>  /**
>   * mark_oom_victim - mark the given task as OOM victim
>   * @tsk: task to mark
> @@ -421,6 +528,11 @@ void mark_oom_victim(struct task_struct *tsk)
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> +
> +	/* Kick oom reaper to help us release some memory */
> +	if (tsk->mm)
> +		wake_oom_reaper(tsk->mm);
> +

We cannot wake up at this moment. It is too early because there may be
other threads sharing tsk->mm but SIGKILL not yet received. Also, there
may be unkillable threads sharing tsk->mm. I think appropriate location
to wake the reaper up is

                  do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
          }
          rcu_read_unlock();
          /* here */
          mmdrop(mm);
          put_task_struct(victim);
  }

in oom_kill_process().

>  	/*
>  	 * Make sure that the task is woken up from uninterruptible sleep
>  	 * if it is frozen because OOM killer wouldn't be able to free
> @@ -767,3 +879,22 @@ void pagefault_out_of_memory(void)
>  
>  	mutex_unlock(&oom_lock);
>  }
> +
> +static int __init oom_init(void)
> +{
> +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> +	if (IS_ERR(oom_reaper_th)) {
> +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> +				PTR_ERR(oom_reaper_th));

BUG_ON(IS_ERR(oom_reaper_th)) or panic() should be OK.
Continuing with IS_ERR(oom_reaper_th) is not handled by wake_oom_reaper().

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
@ 2015-11-26 15:24       ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-11-26 15:24 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

Michal Hocko wrote:
> On Wed 25-11-15 15:08:06, Johannes Weiner wrote:
> > Hi Michal,
> > 
> > I think whatever we end up doing to smoothen things for the "common
> > case" (as much as OOM kills can be considered common), we need a plan
> > to resolve the memory deadlock situations in a finite amount of time.
> > 
> > Eventually we have to attempt killing another task. Or kill all of
> > them to save the kernel.
> > 
> > It just strikes me as odd to start with smoothening the common case,
> > rather than making it functionally correct first.
> 

Me too.

> I believe there is not an universally correct solution for this
> problem. OOM killer is a heuristic and a destructive one so I think we
> should limit it as much as possible. I do agree that we should allow an
> administrator to define a policy when things go terribly wrong - e.g.
> panic/emerg. reboot after the system is trashing on OOM for more than
> a defined amount of time. But I think that this is orthogonal to this
> patch. This patch should remove one large class of potential deadlocks
> and corner cases without too much cost or maintenance burden. It doesn't
> remove a need for the last resort solution though.
>  

 From the point of view of a technical staff at support center, offering
the last resort solution has higher priority than smoothening the common
case. We cannot test all memory pressure patterns before distributor's
kernel is released. We are too late to workaround unhandled patterns
after distributor's kernel is deployed to customer's systems.

Yet another report was posted in a different thread
"[PATCH] vmscan: do not throttle kthreads due to too_many_isolated".
I think I showed you several times including "mm,oom: The reason why
I continue proposing timeout based approach."
(uptime > 100 of http://I-love.SAKURA.ne.jp/tmp/serial-20150920.txt.xz )
that we are already seeing infinite loop at too_many_isolated() even
without using out of tree drivers. I hope that my patch
http://lkml.kernel.org/r/201505232339.DAB00557.VFFLHMSOJFOOtQ@I-love.SAKURA.ne.jp
included all necessary changes for serving as a base of the last resort.
Please don't loop forever when unexpected memory pressure is given.
Please don't assume somebody else can make forward progress.

Please do consider offering the last resort solution first.
That will help reducing unexplained hangup/reboot troubles.

Given that said, several comments on your patch.

> @@ -408,6 +413,108 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
>  
>  bool oom_killer_disabled __read_mostly;
>  
> +/*
> + * OOM Reaper kernel thread which tries to reap the memory used by the OOM
> + * victim (if that is possible) to help the OOM killer to move on.
> + */
> +static struct task_struct *oom_reaper_th;
> +static struct mm_struct *mm_to_reap;
> +static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> +
> +static bool __oom_reap_vmas(struct mm_struct *mm)
> +{
> +	struct mmu_gather tlb;
> +	struct vm_area_struct *vma;
> +	struct zap_details details = {.check_swap_entries = true,
> +				      .ignore_dirty = true};
> +	bool ret = true;
> +
> +	/* We might have raced with exit path */
> +	if (!atomic_inc_not_zero(&mm->mm_users))
> +		return true;
> +
> +	if (!down_read_trylock(&mm->mmap_sem)) {
> +		ret = false;
> +		goto out;
> +	}
> +
> +	tlb_gather_mmu(&tlb, mm, 0, -1);
> +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +
> +		/*
> +		 * Only anonymous pages have a good chance to be dropped
> +		 * without additional steps which we cannot afford as we
> +		 * are OOM already.
> +		 */
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> +					 &details);

How do you plan to make sure that reclaimed pages are used by
fatal_signal_pending() tasks?
http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp

> +	}
> +	tlb_finish_mmu(&tlb, 0, -1);
> +	up_read(&mm->mmap_sem);
> +out:
> +	mmput(mm);
> +	return ret;
> +}

> +static int oom_reaper(void *unused)
> +{
> +	DEFINE_WAIT(wait);
> +
> +	while (!kthread_should_stop()) {

Is there a situation where this kernel thread should stop?
I think this kernel thread should not stop because restarting
this kernel thread might invoke the OOM killer.

But if there is such situation, leaving this function with
oom_reaper_th != NULL is not handled by wake_oom_reaper().

> +		struct mm_struct *mm;
> +
> +		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
> +		mm = READ_ONCE(mm_to_reap);

Why not simply mm = xchg(&mm_to_reap, NULL) and free the slot for
next OOM victim (though xchg() may not be sufficient)?

> +		if (!mm) {
> +			freezable_schedule();
> +			finish_wait(&oom_reaper_wait, &wait);
> +		} else {
> +			finish_wait(&oom_reaper_wait, &wait);
> +			oom_reap_vmas(mm);
> +			WRITE_ONCE(mm_to_reap, NULL);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void wake_oom_reaper(struct mm_struct *mm)
> +{
> +	struct mm_struct *old_mm;
> +
> +	if (!oom_reaper_th)
> +		return;
> +
> +	/*
> +	 * Make sure that only a single mm is ever queued for the reaper
> +	 * because multiple are not necessary and the operation might be
> +	 * disruptive so better reduce it to the bare minimum.
> +	 */
> +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);

I think we should not skip queuing next OOM victim, for it is possible
that first OOM victim is chosen by one memory cgroup OOM, and next OOM
victim is chosen by another memory cgroup OOM or system wide OOM before
oom_reap_vmas() for first OOM victim completes.

To handle such case, we would need to do something like

 struct mm_struct {
     (...snipped...)
+    struct list_head *memdie; /* Set to non-NULL when chosen by OOM killer */
 }

and add to a list of OOM victims.

> +	if (!old_mm) {
> +		/*
> +		 * Pin the given mm. Use mm_count instead of mm_users because
> +		 * we do not want to delay the address space tear down.
> +		 */
> +		atomic_inc(&mm->mm_count);
> +		wake_up(&oom_reaper_wait);
> +	}
> +}
> +
>  /**
>   * mark_oom_victim - mark the given task as OOM victim
>   * @tsk: task to mark
> @@ -421,6 +528,11 @@ void mark_oom_victim(struct task_struct *tsk)
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> +
> +	/* Kick oom reaper to help us release some memory */
> +	if (tsk->mm)
> +		wake_oom_reaper(tsk->mm);
> +

We cannot wake up at this moment. It is too early because there may be
other threads sharing tsk->mm but SIGKILL not yet received. Also, there
may be unkillable threads sharing tsk->mm. I think appropriate location
to wake the reaper up is

                  do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
          }
          rcu_read_unlock();
          /* here */
          mmdrop(mm);
          put_task_struct(victim);
  }

in oom_kill_process().

>  	/*
>  	 * Make sure that the task is woken up from uninterruptible sleep
>  	 * if it is frozen because OOM killer wouldn't be able to free
> @@ -767,3 +879,22 @@ void pagefault_out_of_memory(void)
>  
>  	mutex_unlock(&oom_lock);
>  }
> +
> +static int __init oom_init(void)
> +{
> +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> +	if (IS_ERR(oom_reaper_th)) {
> +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> +				PTR_ERR(oom_reaper_th));

BUG_ON(IS_ERR(oom_reaper_th)) or panic() should be OK.
Continuing with IS_ERR(oom_reaper_th) is not handled by wake_oom_reaper().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
  2015-11-26 15:24       ` Tetsuo Handa
@ 2015-11-26 16:34         ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-26 16:34 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd,
	oleg, andrea, linux-kernel

On Fri 27-11-15 00:24:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 25-11-15 15:08:06, Johannes Weiner wrote:
> > > Hi Michal,
> > > 
> > > I think whatever we end up doing to smoothen things for the "common
> > > case" (as much as OOM kills can be considered common), we need a plan
> > > to resolve the memory deadlock situations in a finite amount of time.
> > > 
> > > Eventually we have to attempt killing another task. Or kill all of
> > > them to save the kernel.
> > > 
> > > It just strikes me as odd to start with smoothening the common case,
> > > rather than making it functionally correct first.
> > 
> 
> Me too.
> 
> > I believe there is not an universally correct solution for this
> > problem. OOM killer is a heuristic and a destructive one so I think we
> > should limit it as much as possible. I do agree that we should allow an
> > administrator to define a policy when things go terribly wrong - e.g.
> > panic/emerg. reboot after the system is trashing on OOM for more than
> > a defined amount of time. But I think that this is orthogonal to this
> > patch. This patch should remove one large class of potential deadlocks
> > and corner cases without too much cost or maintenance burden. It doesn't
> > remove a need for the last resort solution though.
> >  
> 
>  From the point of view of a technical staff at support center, offering
> the last resort solution has higher priority than smoothening the common
> case. We cannot test all memory pressure patterns before distributor's
> kernel is released. We are too late to workaround unhandled patterns
> after distributor's kernel is deployed to customer's systems.

This has been posted months ago without any additional tracktion.
 
> Yet another report was posted in a different thread
> "[PATCH] vmscan: do not throttle kthreads due to too_many_isolated".
> I think I showed you several times including "mm,oom: The reason why
> I continue proposing timeout based approach."
> (uptime > 100 of http://I-love.SAKURA.ne.jp/tmp/serial-20150920.txt.xz )
> that we are already seeing infinite loop at too_many_isolated() even
> without using out of tree drivers. I hope that my patch
> http://lkml.kernel.org/r/201505232339.DAB00557.VFFLHMSOJFOOtQ@I-love.SAKURA.ne.jp
> included all necessary changes for serving as a base of the last resort.
> Please don't loop forever when unexpected memory pressure is given.
> Please don't assume somebody else can make forward progress.
> 
> Please do consider offering the last resort solution first.
> That will help reducing unexplained hangup/reboot troubles.

And the answer alwas has been that the proposed last resort solutions
have their own issues. I am not going repeat them here because I really
do not want to make this thread yet another mess of unrelated topics. I
am suggesting another reclaim technique so please stick to the topic
here.  I do not see any reason why we should convolute this with last
resort solutions.
 
> Given that said, several comments on your patch.
> 
> > @@ -408,6 +413,108 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
> >  
> >  bool oom_killer_disabled __read_mostly;
> >  
> > +/*
> > + * OOM Reaper kernel thread which tries to reap the memory used by the OOM
> > + * victim (if that is possible) to help the OOM killer to move on.
> > + */
> > +static struct task_struct *oom_reaper_th;
> > +static struct mm_struct *mm_to_reap;
> > +static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> > +
> > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +	struct mmu_gather tlb;
> > +	struct vm_area_struct *vma;
> > +	struct zap_details details = {.check_swap_entries = true,
> > +				      .ignore_dirty = true};
> > +	bool ret = true;
> > +
> > +	/* We might have raced with exit path */
> > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > +		return true;
> > +
> > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > +		ret = false;
> > +		goto out;
> > +	}
> > +
> > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +
> > +		/*
> > +		 * Only anonymous pages have a good chance to be dropped
> > +		 * without additional steps which we cannot afford as we
> > +		 * are OOM already.
> > +		 */
> > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > +					 &details);
> 
> How do you plan to make sure that reclaimed pages are used by
> fatal_signal_pending() tasks?
> http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
> http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp

Well the wake_oom_reaper is responsible to hand over mm of the OOM
victim and as such it should be a killed process.  I guess you mean that
the mm might be shared with another process which is hidden from the OOM
killer, right? Well I think this is not something to care about at this
layer. We shouldn't select a tasks which can lead to this situation in
the first place. Such an oom victim is basically selected incorrectly. I
think we can handle that by a flag in mm_struct.

I guess we have never cared too much about this case because it sounds
like a misconfiguration. If you want to shoot your own head the do it.
It is true that this patch will make such a case more prominent because
we can cause a side effect now. I can cook up a patch to help to sort
this out.

Thanks for pointing this out.

> 
> > +	}
> > +	tlb_finish_mmu(&tlb, 0, -1);
> > +	up_read(&mm->mmap_sem);
> > +out:
> > +	mmput(mm);
> > +	return ret;
> > +}
> 
> > +static int oom_reaper(void *unused)
> > +{
> > +	DEFINE_WAIT(wait);
> > +
> > +	while (!kthread_should_stop()) {
> 
> Is there a situation where this kernel thread should stop?

No, but this seems to be a generic kthread pattern so I've reused it.
I thought that every kernel thread is supposed to do that because
they will be torn down on powerdown from somewhere but I cannot seem to
find it right now. So for now I will go with while (true) here.

> I think this kernel thread should not stop because restarting
> this kernel thread might invoke the OOM killer.
> But if there is such situation, leaving this function with
> oom_reaper_th != NULL is not handled by wake_oom_reaper().
>
> 
> > +		struct mm_struct *mm;
> > +
> > +		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
> > +		mm = READ_ONCE(mm_to_reap);
> 
> Why not simply mm = xchg(&mm_to_reap, NULL) and free the slot for
> next OOM victim (though xchg() may not be sufficient)?

Because it is not clear whether that is necessary. I wanted to minimize
the effect of the external reaping to bare minumum. oom_reap_vmas will
bail out early when it sees that it is racing with exit_mmap so we would
release mm_to_reap early if we are not going to release anything.

> 
> > +		if (!mm) {
> > +			freezable_schedule();
> > +			finish_wait(&oom_reaper_wait, &wait);
> > +		} else {
> > +			finish_wait(&oom_reaper_wait, &wait);
> > +			oom_reap_vmas(mm);
> > +			WRITE_ONCE(mm_to_reap, NULL);
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void wake_oom_reaper(struct mm_struct *mm)
> > +{
> > +	struct mm_struct *old_mm;
> > +
> > +	if (!oom_reaper_th)
> > +		return;
> > +
> > +	/*
> > +	 * Make sure that only a single mm is ever queued for the reaper
> > +	 * because multiple are not necessary and the operation might be
> > +	 * disruptive so better reduce it to the bare minimum.
> > +	 */
> > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> 
> I think we should not skip queuing next OOM victim, for it is possible
> that first OOM victim is chosen by one memory cgroup OOM, and next OOM
> victim is chosen by another memory cgroup OOM or system wide OOM before
> oom_reap_vmas() for first OOM victim completes.

Does that matter though. Be it a memcg OOM or a global OOM victim, we
will still release a memory which should help the global case which we
care about the most. Memcg OOM killer handling is easier because we do
not hold any locks while waiting for the OOM to be handled.

> To handle such case, we would need to do something like
> 
>  struct mm_struct {
>      (...snipped...)
> +    struct list_head *memdie; /* Set to non-NULL when chosen by OOM killer */
>  }
> 
> and add to a list of OOM victims.

I really wanted to prevent from additional memory footprint for a highly
unlikely case. Why should everybody pay for a case which is rarely hit?

Also if this turns out to be a real problem then it can be added on top
of the existing code. I would really like this to be as easy as
possible.

[...]
> > @@ -421,6 +528,11 @@ void mark_oom_victim(struct task_struct *tsk)
> >  	/* OOM killer might race with memcg OOM */
> >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> >  		return;
> > +
> > +	/* Kick oom reaper to help us release some memory */
> > +	if (tsk->mm)
> > +		wake_oom_reaper(tsk->mm);
> > +
> 
> We cannot wake up at this moment. It is too early because there may be
> other threads sharing tsk->mm but SIGKILL not yet received.

OK, I can see your point and I was considering that. I have settled with
conclusion that we shouldn't care that much because they would page fault
and get stuck in the allocation because we are OOM. There might be some
corner cases but I guess you are right that this should be as
concervative as possible so I will move the wake up to oom_kill_process.

> Also, there
> may be unkillable threads sharing tsk->mm. I think appropriate location
> to wake the reaper up is

I have commented on this already.

> 
>                   do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>           }
>           rcu_read_unlock();
>           /* here */
>           mmdrop(mm);
>           put_task_struct(victim);
>   }
> 
> in oom_kill_process().

OK I will move it here. We will lose other cases where nobody is killed
because those would have to guarantee the same thing as mentioned above
but that sounds like good for now solution.

> 
> >  	/*
> >  	 * Make sure that the task is woken up from uninterruptible sleep
> >  	 * if it is frozen because OOM killer wouldn't be able to free
> > @@ -767,3 +879,22 @@ void pagefault_out_of_memory(void)
> >  
> >  	mutex_unlock(&oom_lock);
> >  }
> > +
> > +static int __init oom_init(void)
> > +{
> > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > +	if (IS_ERR(oom_reaper_th)) {
> > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > +				PTR_ERR(oom_reaper_th));
> 
> BUG_ON(IS_ERR(oom_reaper_th)) or panic() should be OK.
> Continuing with IS_ERR(oom_reaper_th) is not handled by wake_oom_reaper().

Yes, but we can live without this kernel thread, right? I do not think
this will ever happen but why should we panic the system?

Thanks for the review.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
@ 2015-11-26 16:34         ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-26 16:34 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd,
	oleg, andrea, linux-kernel

On Fri 27-11-15 00:24:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 25-11-15 15:08:06, Johannes Weiner wrote:
> > > Hi Michal,
> > > 
> > > I think whatever we end up doing to smoothen things for the "common
> > > case" (as much as OOM kills can be considered common), we need a plan
> > > to resolve the memory deadlock situations in a finite amount of time.
> > > 
> > > Eventually we have to attempt killing another task. Or kill all of
> > > them to save the kernel.
> > > 
> > > It just strikes me as odd to start with smoothening the common case,
> > > rather than making it functionally correct first.
> > 
> 
> Me too.
> 
> > I believe there is not an universally correct solution for this
> > problem. OOM killer is a heuristic and a destructive one so I think we
> > should limit it as much as possible. I do agree that we should allow an
> > administrator to define a policy when things go terribly wrong - e.g.
> > panic/emerg. reboot after the system is trashing on OOM for more than
> > a defined amount of time. But I think that this is orthogonal to this
> > patch. This patch should remove one large class of potential deadlocks
> > and corner cases without too much cost or maintenance burden. It doesn't
> > remove a need for the last resort solution though.
> >  
> 
>  From the point of view of a technical staff at support center, offering
> the last resort solution has higher priority than smoothening the common
> case. We cannot test all memory pressure patterns before distributor's
> kernel is released. We are too late to workaround unhandled patterns
> after distributor's kernel is deployed to customer's systems.

This has been posted months ago without any additional tracktion.
 
> Yet another report was posted in a different thread
> "[PATCH] vmscan: do not throttle kthreads due to too_many_isolated".
> I think I showed you several times including "mm,oom: The reason why
> I continue proposing timeout based approach."
> (uptime > 100 of http://I-love.SAKURA.ne.jp/tmp/serial-20150920.txt.xz )
> that we are already seeing infinite loop at too_many_isolated() even
> without using out of tree drivers. I hope that my patch
> http://lkml.kernel.org/r/201505232339.DAB00557.VFFLHMSOJFOOtQ@I-love.SAKURA.ne.jp
> included all necessary changes for serving as a base of the last resort.
> Please don't loop forever when unexpected memory pressure is given.
> Please don't assume somebody else can make forward progress.
> 
> Please do consider offering the last resort solution first.
> That will help reducing unexplained hangup/reboot troubles.

And the answer alwas has been that the proposed last resort solutions
have their own issues. I am not going repeat them here because I really
do not want to make this thread yet another mess of unrelated topics. I
am suggesting another reclaim technique so please stick to the topic
here.  I do not see any reason why we should convolute this with last
resort solutions.
 
> Given that said, several comments on your patch.
> 
> > @@ -408,6 +413,108 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
> >  
> >  bool oom_killer_disabled __read_mostly;
> >  
> > +/*
> > + * OOM Reaper kernel thread which tries to reap the memory used by the OOM
> > + * victim (if that is possible) to help the OOM killer to move on.
> > + */
> > +static struct task_struct *oom_reaper_th;
> > +static struct mm_struct *mm_to_reap;
> > +static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> > +
> > +static bool __oom_reap_vmas(struct mm_struct *mm)
> > +{
> > +	struct mmu_gather tlb;
> > +	struct vm_area_struct *vma;
> > +	struct zap_details details = {.check_swap_entries = true,
> > +				      .ignore_dirty = true};
> > +	bool ret = true;
> > +
> > +	/* We might have raced with exit path */
> > +	if (!atomic_inc_not_zero(&mm->mm_users))
> > +		return true;
> > +
> > +	if (!down_read_trylock(&mm->mmap_sem)) {
> > +		ret = false;
> > +		goto out;
> > +	}
> > +
> > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +
> > +		/*
> > +		 * Only anonymous pages have a good chance to be dropped
> > +		 * without additional steps which we cannot afford as we
> > +		 * are OOM already.
> > +		 */
> > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > +					 &details);
> 
> How do you plan to make sure that reclaimed pages are used by
> fatal_signal_pending() tasks?
> http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
> http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp

Well the wake_oom_reaper is responsible to hand over mm of the OOM
victim and as such it should be a killed process.  I guess you mean that
the mm might be shared with another process which is hidden from the OOM
killer, right? Well I think this is not something to care about at this
layer. We shouldn't select a tasks which can lead to this situation in
the first place. Such an oom victim is basically selected incorrectly. I
think we can handle that by a flag in mm_struct.

I guess we have never cared too much about this case because it sounds
like a misconfiguration. If you want to shoot your own head the do it.
It is true that this patch will make such a case more prominent because
we can cause a side effect now. I can cook up a patch to help to sort
this out.

Thanks for pointing this out.

> 
> > +	}
> > +	tlb_finish_mmu(&tlb, 0, -1);
> > +	up_read(&mm->mmap_sem);
> > +out:
> > +	mmput(mm);
> > +	return ret;
> > +}
> 
> > +static int oom_reaper(void *unused)
> > +{
> > +	DEFINE_WAIT(wait);
> > +
> > +	while (!kthread_should_stop()) {
> 
> Is there a situation where this kernel thread should stop?

No, but this seems to be a generic kthread pattern so I've reused it.
I thought that every kernel thread is supposed to do that because
they will be torn down on powerdown from somewhere but I cannot seem to
find it right now. So for now I will go with while (true) here.

> I think this kernel thread should not stop because restarting
> this kernel thread might invoke the OOM killer.
> But if there is such situation, leaving this function with
> oom_reaper_th != NULL is not handled by wake_oom_reaper().
>
> 
> > +		struct mm_struct *mm;
> > +
> > +		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
> > +		mm = READ_ONCE(mm_to_reap);
> 
> Why not simply mm = xchg(&mm_to_reap, NULL) and free the slot for
> next OOM victim (though xchg() may not be sufficient)?

Because it is not clear whether that is necessary. I wanted to minimize
the effect of the external reaping to bare minumum. oom_reap_vmas will
bail out early when it sees that it is racing with exit_mmap so we would
release mm_to_reap early if we are not going to release anything.

> 
> > +		if (!mm) {
> > +			freezable_schedule();
> > +			finish_wait(&oom_reaper_wait, &wait);
> > +		} else {
> > +			finish_wait(&oom_reaper_wait, &wait);
> > +			oom_reap_vmas(mm);
> > +			WRITE_ONCE(mm_to_reap, NULL);
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void wake_oom_reaper(struct mm_struct *mm)
> > +{
> > +	struct mm_struct *old_mm;
> > +
> > +	if (!oom_reaper_th)
> > +		return;
> > +
> > +	/*
> > +	 * Make sure that only a single mm is ever queued for the reaper
> > +	 * because multiple are not necessary and the operation might be
> > +	 * disruptive so better reduce it to the bare minimum.
> > +	 */
> > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> 
> I think we should not skip queuing next OOM victim, for it is possible
> that first OOM victim is chosen by one memory cgroup OOM, and next OOM
> victim is chosen by another memory cgroup OOM or system wide OOM before
> oom_reap_vmas() for first OOM victim completes.

Does that matter though. Be it a memcg OOM or a global OOM victim, we
will still release a memory which should help the global case which we
care about the most. Memcg OOM killer handling is easier because we do
not hold any locks while waiting for the OOM to be handled.

> To handle such case, we would need to do something like
> 
>  struct mm_struct {
>      (...snipped...)
> +    struct list_head *memdie; /* Set to non-NULL when chosen by OOM killer */
>  }
> 
> and add to a list of OOM victims.

I really wanted to prevent from additional memory footprint for a highly
unlikely case. Why should everybody pay for a case which is rarely hit?

Also if this turns out to be a real problem then it can be added on top
of the existing code. I would really like this to be as easy as
possible.

[...]
> > @@ -421,6 +528,11 @@ void mark_oom_victim(struct task_struct *tsk)
> >  	/* OOM killer might race with memcg OOM */
> >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> >  		return;
> > +
> > +	/* Kick oom reaper to help us release some memory */
> > +	if (tsk->mm)
> > +		wake_oom_reaper(tsk->mm);
> > +
> 
> We cannot wake up at this moment. It is too early because there may be
> other threads sharing tsk->mm but SIGKILL not yet received.

OK, I can see your point and I was considering that. I have settled with
conclusion that we shouldn't care that much because they would page fault
and get stuck in the allocation because we are OOM. There might be some
corner cases but I guess you are right that this should be as
concervative as possible so I will move the wake up to oom_kill_process.

> Also, there
> may be unkillable threads sharing tsk->mm. I think appropriate location
> to wake the reaper up is

I have commented on this already.

> 
>                   do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>           }
>           rcu_read_unlock();
>           /* here */
>           mmdrop(mm);
>           put_task_struct(victim);
>   }
> 
> in oom_kill_process().

OK I will move it here. We will lose other cases where nobody is killed
because those would have to guarantee the same thing as mentioned above
but that sounds like good for now solution.

> 
> >  	/*
> >  	 * Make sure that the task is woken up from uninterruptible sleep
> >  	 * if it is frozen because OOM killer wouldn't be able to free
> > @@ -767,3 +879,22 @@ void pagefault_out_of_memory(void)
> >  
> >  	mutex_unlock(&oom_lock);
> >  }
> > +
> > +static int __init oom_init(void)
> > +{
> > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > +	if (IS_ERR(oom_reaper_th)) {
> > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > +				PTR_ERR(oom_reaper_th));
> 
> BUG_ON(IS_ERR(oom_reaper_th)) or panic() should be OK.
> Continuing with IS_ERR(oom_reaper_th) is not handled by wake_oom_reaper().

Yes, but we can live without this kernel thread, right? I do not think
this will ever happen but why should we panic the system?

Thanks for the review.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
  2015-11-26 16:34         ` Michal Hocko
@ 2015-11-26 17:31           ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-26 17:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd,
	oleg, andrea, linux-kernel

On Thu 26-11-15 17:34:56, Michal Hocko wrote:
> On Fri 27-11-15 00:24:43, Tetsuo Handa wrote:
> > Michal Hocko wrote:
[...]
> > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > +		if (is_vm_hugetlb_page(vma))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * Only anonymous pages have a good chance to be dropped
> > > +		 * without additional steps which we cannot afford as we
> > > +		 * are OOM already.
> > > +		 */
> > > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > > +					 &details);
> > 
> > How do you plan to make sure that reclaimed pages are used by
> > fatal_signal_pending() tasks?
> > http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
> > http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp
> 
> Well the wake_oom_reaper is responsible to hand over mm of the OOM
> victim and as such it should be a killed process.  I guess you mean that
> the mm might be shared with another process which is hidden from the OOM
> killer, right? Well I think this is not something to care about at this
> layer. We shouldn't select a tasks which can lead to this situation in
> the first place. Such an oom victim is basically selected incorrectly. I
> think we can handle that by a flag in mm_struct.
> 
> I guess we have never cared too much about this case because it sounds
> like a misconfiguration. If you want to shoot your own head the do it.
> It is true that this patch will make such a case more prominent because
> we can cause a side effect now. I can cook up a patch to help to sort
> this out.
> 
> Thanks for pointing this out.

OK, so I was thinking about that some more and came to the conclusion
that we cannot use per mm struct flag. This would be basically
equivalent to moving oom_score_adj to mm_struct which has shown to be a
problem in the past (especially for vfork(); set_oom_score; execve()
loads). So I've ended up with the following. It would mean we will not
use the reaper in some cases but those should be marginal and some of
them even dubious at best.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 043c0fe146a5..bceeebe96a1b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -646,6 +646,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -736,15 +737,22 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot usee oom_reaper for the mm shared by this process
+			 * because it wouldn't get killed and so the memory might be
+			 * still used.
+			 */
+			can_oom_reap = false;
 			continue;
+		}
 
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-	wake_oom_reaper(mm);
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
@ 2015-11-26 17:31           ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-26 17:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd,
	oleg, andrea, linux-kernel

On Thu 26-11-15 17:34:56, Michal Hocko wrote:
> On Fri 27-11-15 00:24:43, Tetsuo Handa wrote:
> > Michal Hocko wrote:
[...]
> > > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > +		if (is_vm_hugetlb_page(vma))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * Only anonymous pages have a good chance to be dropped
> > > +		 * without additional steps which we cannot afford as we
> > > +		 * are OOM already.
> > > +		 */
> > > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > > +					 &details);
> > 
> > How do you plan to make sure that reclaimed pages are used by
> > fatal_signal_pending() tasks?
> > http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
> > http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp
> 
> Well the wake_oom_reaper is responsible to hand over mm of the OOM
> victim and as such it should be a killed process.  I guess you mean that
> the mm might be shared with another process which is hidden from the OOM
> killer, right? Well I think this is not something to care about at this
> layer. We shouldn't select a tasks which can lead to this situation in
> the first place. Such an oom victim is basically selected incorrectly. I
> think we can handle that by a flag in mm_struct.
> 
> I guess we have never cared too much about this case because it sounds
> like a misconfiguration. If you want to shoot your own head the do it.
> It is true that this patch will make such a case more prominent because
> we can cause a side effect now. I can cook up a patch to help to sort
> this out.
> 
> Thanks for pointing this out.

OK, so I was thinking about that some more and came to the conclusion
that we cannot use per mm struct flag. This would be basically
equivalent to moving oom_score_adj to mm_struct which has shown to be a
problem in the past (especially for vfork(); set_oom_score; execve()
loads). So I've ended up with the following. It would mean we will not
use the reaper in some cases but those should be marginal and some of
them even dubious at best.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 043c0fe146a5..bceeebe96a1b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -646,6 +646,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -736,15 +737,22 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot usee oom_reaper for the mm shared by this process
+			 * because it wouldn't get killed and so the memory might be
+			 * still used.
+			 */
+			can_oom_reap = false;
 			continue;
+		}
 
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
-	wake_oom_reaper(mm);
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
  2015-11-26 16:34         ` Michal Hocko
@ 2015-11-27 11:29           ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-11-27 11:29 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd,
	oleg, andrea, linux-kernel

Michal Hocko wrote:
> > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > +		if (is_vm_hugetlb_page(vma))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * Only anonymous pages have a good chance to be dropped
> > > +		 * without additional steps which we cannot afford as we
> > > +		 * are OOM already.
> > > +		 */
> > > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > > +					 &details);
> > 
> > How do you plan to make sure that reclaimed pages are used by
> > fatal_signal_pending() tasks?
> > http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
> > http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp
> 
> Well the wake_oom_reaper is responsible to hand over mm of the OOM
> victim and as such it should be a killed process.  I guess you mean that
> the mm might be shared with another process which is hidden from the OOM
> killer, right?

Right, but that is not what I wanted to say here. What I wanted to say
here is, there can be other tasks that are looping inside
__alloc_pages_slowpath(). I'm worrying that without a mechanism for
priority allocating (e.g. "[PATCH] mm/page_alloc: Favor kthread and dying
threads over normal threads" at
http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp ),
many !fatal_signal_pending() tasks could steal all reclaimed pages before
a few fatal_signal_pending() tasks acquire them. Since oom_kill_process()
tries to kill child tasks, reaping the child task could reclaim only a few
pages. Since I think that the purpose of reclaiming OOM victim's mm pages
implies help fatal_signal_pending() tasks which are blocking TIF_MEMDIE
tasks to terminate more quickly so that TIF_MEMDIE tasks can also terminate,
I think that priority allocating is needed.

>                Well I think this is not something to care about at this
> layer. We shouldn't select a tasks which can lead to this situation in
> the first place. Such an oom victim is basically selected incorrectly. I
> think we can handle that by a flag in mm_struct.

You mean updating select_bad_process() and oom_kill_process() not to select
an OOM victim with mm shared by unkillable tasks?

> > > +		if (!mm) {
> > > +			freezable_schedule();
> > > +			finish_wait(&oom_reaper_wait, &wait);
> > > +		} else {
> > > +			finish_wait(&oom_reaper_wait, &wait);
> > > +			oom_reap_vmas(mm);
> > > +			WRITE_ONCE(mm_to_reap, NULL);
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > +{
> > > +	struct mm_struct *old_mm;
> > > +
> > > +	if (!oom_reaper_th)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > +	 * because multiple are not necessary and the operation might be
> > > +	 * disruptive so better reduce it to the bare minimum.
> > > +	 */
> > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > 
> > I think we should not skip queuing next OOM victim, for it is possible
> > that first OOM victim is chosen by one memory cgroup OOM, and next OOM
> > victim is chosen by another memory cgroup OOM or system wide OOM before
> > oom_reap_vmas() for first OOM victim completes.
> 

I forgot that currently a global OOM will not choose next OOM victim
when a memcg OOM chose first victim. Thus, currently possible cases
will be limited to "first OOM victim is chosen by memcg1 OOM, then
next OOM victim is chosen by memcg2 OOM before oom_reap_vmas() for
first OOM victim completes".

> Does that matter though. Be it a memcg OOM or a global OOM victim, we
> will still release a memory which should help the global case which we
> care about the most. Memcg OOM killer handling is easier because we do
> not hold any locks while waiting for the OOM to be handled.
> 

Why easier? Memcg OOM is triggered when current thread triggered a page
fault, right? I don't know whether current thread really holds no locks
when a page fault is triggered. But mem_cgroup_out_of_memory() choose
an OOM victim which can be different from current thread. Therefore,
I think problems caused by invisible dependency exist for memcg OOM.

> > To handle such case, we would need to do something like
> > 
> >  struct mm_struct {
> >      (...snipped...)
> > +    struct list_head *memdie; /* Set to non-NULL when chosen by OOM killer */
> >  }
> > 
> > and add to a list of OOM victims.
> 
> I really wanted to prevent from additional memory footprint for a highly
> unlikely case. Why should everybody pay for a case which is rarely hit?
> 
> Also if this turns out to be a real problem then it can be added on top
> of the existing code. I would really like this to be as easy as
> possible.

The sequences I think we could hit is,

  (1) a memcg1 OOM is invoked and a task in memcg1 is chosen as first OOM
      victim.

  (2) wake_oom_reaper() passes first victim's mm to oom_reaper().

  (3) oom_reaper() starts reclaiming first victim's mm.

  (4) a memcg2 OOM is invoked and a task in memcg2 is chosen as second
      OOM victim.

  (5) wake_oom_reaper() does not pass second victim's mm to oom_reaper()
      because mm_to_reap holds first victim's mm.

  (6) oom_reaper() finishes reclaiming first victim's mm, and sets
      mm_to_reap = NULL.

  (7) Second victim's mm (chosen at step (4)) is not passed to oom_reaper()
      because wake_oom_reaper() is not called again.

Invisible dependency problem in first victim's mm can prevent second
victim's mm from reaping. This is an unexpected thing for second victim.
If oom_reaper() scans like kmallocwd() does, any victim's mm will be
reaped, as well as doing different things like emitting warning messages,
choosing next OOM victims and eventually calling panic(). Therefore, I wrote

  So, I thought that a dedicated kernel thread makes it easy to call memory
  unmapping code periodically again and again.

at http://lkml.kernel.org/r/201510022206.BHF13585.MSOHOFFLQtVOJF@I-love.SAKURA.ne.jp .

> > 
> > >  	/*
> > >  	 * Make sure that the task is woken up from uninterruptible sleep
> > >  	 * if it is frozen because OOM killer wouldn't be able to free
> > > @@ -767,3 +879,22 @@ void pagefault_out_of_memory(void)
> > >  
> > >  	mutex_unlock(&oom_lock);
> > >  }
> > > +
> > > +static int __init oom_init(void)
> > > +{
> > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > +	if (IS_ERR(oom_reaper_th)) {
> > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > +				PTR_ERR(oom_reaper_th));
> > 
> > BUG_ON(IS_ERR(oom_reaper_th)) or panic() should be OK.
> > Continuing with IS_ERR(oom_reaper_th) is not handled by wake_oom_reaper().
> 
> Yes, but we can live without this kernel thread, right? I do not think
> this will ever happen but why should we panic the system?

I do not think this will ever happen. Only for documentation purpose.

oom_init() is called before the global init process in userspace is
started. Therefore, if kthread_run() fails, the global init process
unlikely be able to start. Since you did not check for failure for

  vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);

case while you checked for failure for

  oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");

case, I think that BUG_ON() or panic() should be OK if you check for
failure.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
@ 2015-11-27 11:29           ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-11-27 11:29 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd,
	oleg, andrea, linux-kernel

Michal Hocko wrote:
> > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > +		if (is_vm_hugetlb_page(vma))
> > > +			continue;
> > > +
> > > +		/*
> > > +		 * Only anonymous pages have a good chance to be dropped
> > > +		 * without additional steps which we cannot afford as we
> > > +		 * are OOM already.
> > > +		 */
> > > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > > +					 &details);
> > 
> > How do you plan to make sure that reclaimed pages are used by
> > fatal_signal_pending() tasks?
> > http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
> > http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp
> 
> Well the wake_oom_reaper is responsible to hand over mm of the OOM
> victim and as such it should be a killed process.  I guess you mean that
> the mm might be shared with another process which is hidden from the OOM
> killer, right?

Right, but that is not what I wanted to say here. What I wanted to say
here is, there can be other tasks that are looping inside
__alloc_pages_slowpath(). I'm worrying that without a mechanism for
priority allocating (e.g. "[PATCH] mm/page_alloc: Favor kthread and dying
threads over normal threads" at
http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp ),
many !fatal_signal_pending() tasks could steal all reclaimed pages before
a few fatal_signal_pending() tasks acquire them. Since oom_kill_process()
tries to kill child tasks, reaping the child task could reclaim only a few
pages. Since I think that the purpose of reclaiming OOM victim's mm pages
implies help fatal_signal_pending() tasks which are blocking TIF_MEMDIE
tasks to terminate more quickly so that TIF_MEMDIE tasks can also terminate,
I think that priority allocating is needed.

>                Well I think this is not something to care about at this
> layer. We shouldn't select a tasks which can lead to this situation in
> the first place. Such an oom victim is basically selected incorrectly. I
> think we can handle that by a flag in mm_struct.

You mean updating select_bad_process() and oom_kill_process() not to select
an OOM victim with mm shared by unkillable tasks?

> > > +		if (!mm) {
> > > +			freezable_schedule();
> > > +			finish_wait(&oom_reaper_wait, &wait);
> > > +		} else {
> > > +			finish_wait(&oom_reaper_wait, &wait);
> > > +			oom_reap_vmas(mm);
> > > +			WRITE_ONCE(mm_to_reap, NULL);
> > > +		}
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static void wake_oom_reaper(struct mm_struct *mm)
> > > +{
> > > +	struct mm_struct *old_mm;
> > > +
> > > +	if (!oom_reaper_th)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * Make sure that only a single mm is ever queued for the reaper
> > > +	 * because multiple are not necessary and the operation might be
> > > +	 * disruptive so better reduce it to the bare minimum.
> > > +	 */
> > > +	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> > 
> > I think we should not skip queuing next OOM victim, for it is possible
> > that first OOM victim is chosen by one memory cgroup OOM, and next OOM
> > victim is chosen by another memory cgroup OOM or system wide OOM before
> > oom_reap_vmas() for first OOM victim completes.
> 

I forgot that currently a global OOM will not choose next OOM victim
when a memcg OOM chose first victim. Thus, currently possible cases
will be limited to "first OOM victim is chosen by memcg1 OOM, then
next OOM victim is chosen by memcg2 OOM before oom_reap_vmas() for
first OOM victim completes".

> Does that matter though. Be it a memcg OOM or a global OOM victim, we
> will still release a memory which should help the global case which we
> care about the most. Memcg OOM killer handling is easier because we do
> not hold any locks while waiting for the OOM to be handled.
> 

Why easier? Memcg OOM is triggered when current thread triggered a page
fault, right? I don't know whether current thread really holds no locks
when a page fault is triggered. But mem_cgroup_out_of_memory() choose
an OOM victim which can be different from current thread. Therefore,
I think problems caused by invisible dependency exist for memcg OOM.

> > To handle such case, we would need to do something like
> > 
> >  struct mm_struct {
> >      (...snipped...)
> > +    struct list_head *memdie; /* Set to non-NULL when chosen by OOM killer */
> >  }
> > 
> > and add to a list of OOM victims.
> 
> I really wanted to prevent from additional memory footprint for a highly
> unlikely case. Why should everybody pay for a case which is rarely hit?
> 
> Also if this turns out to be a real problem then it can be added on top
> of the existing code. I would really like this to be as easy as
> possible.

The sequences I think we could hit is,

  (1) a memcg1 OOM is invoked and a task in memcg1 is chosen as first OOM
      victim.

  (2) wake_oom_reaper() passes first victim's mm to oom_reaper().

  (3) oom_reaper() starts reclaiming first victim's mm.

  (4) a memcg2 OOM is invoked and a task in memcg2 is chosen as second
      OOM victim.

  (5) wake_oom_reaper() does not pass second victim's mm to oom_reaper()
      because mm_to_reap holds first victim's mm.

  (6) oom_reaper() finishes reclaiming first victim's mm, and sets
      mm_to_reap = NULL.

  (7) Second victim's mm (chosen at step (4)) is not passed to oom_reaper()
      because wake_oom_reaper() is not called again.

Invisible dependency problem in first victim's mm can prevent second
victim's mm from reaping. This is an unexpected thing for second victim.
If oom_reaper() scans like kmallocwd() does, any victim's mm will be
reaped, as well as doing different things like emitting warning messages,
choosing next OOM victims and eventually calling panic(). Therefore, I wrote

  So, I thought that a dedicated kernel thread makes it easy to call memory
  unmapping code periodically again and again.

at http://lkml.kernel.org/r/201510022206.BHF13585.MSOHOFFLQtVOJF@I-love.SAKURA.ne.jp .

> > 
> > >  	/*
> > >  	 * Make sure that the task is woken up from uninterruptible sleep
> > >  	 * if it is frozen because OOM killer wouldn't be able to free
> > > @@ -767,3 +879,22 @@ void pagefault_out_of_memory(void)
> > >  
> > >  	mutex_unlock(&oom_lock);
> > >  }
> > > +
> > > +static int __init oom_init(void)
> > > +{
> > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > +	if (IS_ERR(oom_reaper_th)) {
> > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > +				PTR_ERR(oom_reaper_th));
> > 
> > BUG_ON(IS_ERR(oom_reaper_th)) or panic() should be OK.
> > Continuing with IS_ERR(oom_reaper_th) is not handled by wake_oom_reaper().
> 
> Yes, but we can live without this kernel thread, right? I do not think
> this will ever happen but why should we panic the system?

I do not think this will ever happen. Only for documentation purpose.

oom_init() is called before the global init process in userspace is
started. Therefore, if kthread_run() fails, the global init process
unlikely be able to start. Since you did not check for failure for

  vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);

case while you checked for failure for

  oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");

case, I think that BUG_ON() or panic() should be OK if you check for
failure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
  2015-11-27 11:29           ` Tetsuo Handa
@ 2015-11-27 12:35             ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-27 12:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd,
	oleg, andrea, linux-kernel

On Fri 27-11-15 20:29:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > > +		if (is_vm_hugetlb_page(vma))
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * Only anonymous pages have a good chance to be dropped
> > > > +		 * without additional steps which we cannot afford as we
> > > > +		 * are OOM already.
> > > > +		 */
> > > > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > > > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > > > +					 &details);
> > > 
> > > How do you plan to make sure that reclaimed pages are used by
> > > fatal_signal_pending() tasks?
> > > http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
> > > http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp
> > 
> > Well the wake_oom_reaper is responsible to hand over mm of the OOM
> > victim and as such it should be a killed process.  I guess you mean that
> > the mm might be shared with another process which is hidden from the OOM
> > killer, right?
> 
> Right, but that is not what I wanted to say here. What I wanted to say
> here is, there can be other tasks that are looping inside
> __alloc_pages_slowpath(). I'm worrying that without a mechanism for
> priority allocating (e.g. "[PATCH] mm/page_alloc: Favor kthread and dying
> threads over normal threads" at
> http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp ),
> many !fatal_signal_pending() tasks could steal all reclaimed pages before
> a few fatal_signal_pending() tasks acquire them.

I fail to see how is this directly related to the patch though. Yes,
this is something that might happen also during regular direct reclaim
without OOM killer. Higher priority or just luckier tasks might
piggyback on reclaimers. This is not something this patch aims for.

> Since oom_kill_process()
> tries to kill child tasks, reaping the child task could reclaim only a few
> pages. Since I think that the purpose of reclaiming OOM victim's mm pages
> implies help fatal_signal_pending() tasks which are blocking TIF_MEMDIE
> tasks to terminate more quickly so that TIF_MEMDIE tasks can also terminate,
> I think that priority allocating is needed.

The purpose of this patch is to pro-actively free some memory. It is not
targeted for the OOM victims because that would require much more
changes and I believe we should start somewhere and improve from there.
I can imagine that we can help OOM victims better - e.g. by creating an
OOM memory reserve - something that Johannes was proposing few months
back.

> >                Well I think this is not something to care about at this
> > layer. We shouldn't select a tasks which can lead to this situation in
> > the first place. Such an oom victim is basically selected incorrectly. I
> > think we can handle that by a flag in mm_struct.
> 
> You mean updating select_bad_process() and oom_kill_process() not to select
> an OOM victim with mm shared by unkillable tasks?

See the follow up email.

[...]
> > Does that matter though. Be it a memcg OOM or a global OOM victim, we
> > will still release a memory which should help the global case which we
> > care about the most. Memcg OOM killer handling is easier because we do
> > not hold any locks while waiting for the OOM to be handled.
> > 
> 
> Why easier? Memcg OOM is triggered when current thread triggered a page
> fault, right? I don't know whether current thread really holds no locks
> when a page fault is triggered. But mem_cgroup_out_of_memory() choose
> an OOM victim which can be different from current thread. Therefore,
> I think problems caused by invisible dependency exist for memcg OOM.

memcg charge doesn't loop endless like the page allocation. We simply
fail with ENOMEM. It is true that the victim might be blocked and not
reacting on SIGKILL but that waiting cannot wait on the memcg oom path
because of the above. Sure the victim might depend on a page allocation
looping because of the global OOM but then this is reduced to the global
case.

> > > To handle such case, we would need to do something like
> > > 
> > >  struct mm_struct {
> > >      (...snipped...)
> > > +    struct list_head *memdie; /* Set to non-NULL when chosen by OOM killer */
> > >  }
> > > 
> > > and add to a list of OOM victims.
> > 
> > I really wanted to prevent from additional memory footprint for a highly
> > unlikely case. Why should everybody pay for a case which is rarely hit?
> > 
> > Also if this turns out to be a real problem then it can be added on top
> > of the existing code. I would really like this to be as easy as
> > possible.
> 
> The sequences I think we could hit is,
> 
>   (1) a memcg1 OOM is invoked and a task in memcg1 is chosen as first OOM
>       victim.
> 
>   (2) wake_oom_reaper() passes first victim's mm to oom_reaper().
> 
>   (3) oom_reaper() starts reclaiming first victim's mm.
> 
>   (4) a memcg2 OOM is invoked and a task in memcg2 is chosen as second
>       OOM victim.
> 
>   (5) wake_oom_reaper() does not pass second victim's mm to oom_reaper()
>       because mm_to_reap holds first victim's mm.
> 
>   (6) oom_reaper() finishes reclaiming first victim's mm, and sets
>       mm_to_reap = NULL.
> 
>   (7) Second victim's mm (chosen at step (4)) is not passed to oom_reaper()
>       because wake_oom_reaper() is not called again.
> 
> Invisible dependency problem in first victim's mm can prevent second
> victim's mm from reaping. This is an unexpected thing for second victim.
> If oom_reaper() scans like kmallocwd() does, any victim's mm will be
> reaped, as well as doing different things like emitting warning messages,
> choosing next OOM victims and eventually calling panic(). Therefore, I wrote

As I've said I am not really worried about the memcg case. At least not
now with this RFC. Extending the current mechanism to a queue seems like
an implementation detail to me. Is this really something that is so
important to have from the very beginning? I mean, don't get me wrong,
but I would really like to make the basic case - the global OOM killer
right and only build additional changes on top.

[...]
> > > > +static int __init oom_init(void)
> > > > +{
> > > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > > +	if (IS_ERR(oom_reaper_th)) {
> > > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > > +				PTR_ERR(oom_reaper_th));
> > > 
> > > BUG_ON(IS_ERR(oom_reaper_th)) or panic() should be OK.
> > > Continuing with IS_ERR(oom_reaper_th) is not handled by wake_oom_reaper().
> > 
> > Yes, but we can live without this kernel thread, right? I do not think
> > this will ever happen but why should we panic the system?
> 
> I do not think this will ever happen. Only for documentation purpose.

I do not think we add BUG_ONs for documentation purposes. The code and
the message sounds like a sufficient documentation that this is not
really a critical part of the system.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH] mm, oom: introduce oom reaper
@ 2015-11-27 12:35             ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-27 12:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd,
	oleg, andrea, linux-kernel

On Fri 27-11-15 20:29:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > > > +		if (is_vm_hugetlb_page(vma))
> > > > +			continue;
> > > > +
> > > > +		/*
> > > > +		 * Only anonymous pages have a good chance to be dropped
> > > > +		 * without additional steps which we cannot afford as we
> > > > +		 * are OOM already.
> > > > +		 */
> > > > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > > > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > > > +					 &details);
> > > 
> > > How do you plan to make sure that reclaimed pages are used by
> > > fatal_signal_pending() tasks?
> > > http://lkml.kernel.org/r/201509242050.EHE95837.FVFOOtMQHLJOFS@I-love.SAKURA.ne.jp
> > > http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp
> > 
> > Well the wake_oom_reaper is responsible to hand over mm of the OOM
> > victim and as such it should be a killed process.  I guess you mean that
> > the mm might be shared with another process which is hidden from the OOM
> > killer, right?
> 
> Right, but that is not what I wanted to say here. What I wanted to say
> here is, there can be other tasks that are looping inside
> __alloc_pages_slowpath(). I'm worrying that without a mechanism for
> priority allocating (e.g. "[PATCH] mm/page_alloc: Favor kthread and dying
> threads over normal threads" at
> http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp ),
> many !fatal_signal_pending() tasks could steal all reclaimed pages before
> a few fatal_signal_pending() tasks acquire them.

I fail to see how is this directly related to the patch though. Yes,
this is something that might happen also during regular direct reclaim
without OOM killer. Higher priority or just luckier tasks might
piggyback on reclaimers. This is not something this patch aims for.

> Since oom_kill_process()
> tries to kill child tasks, reaping the child task could reclaim only a few
> pages. Since I think that the purpose of reclaiming OOM victim's mm pages
> implies help fatal_signal_pending() tasks which are blocking TIF_MEMDIE
> tasks to terminate more quickly so that TIF_MEMDIE tasks can also terminate,
> I think that priority allocating is needed.

The purpose of this patch is to pro-actively free some memory. It is not
targeted for the OOM victims because that would require much more
changes and I believe we should start somewhere and improve from there.
I can imagine that we can help OOM victims better - e.g. by creating an
OOM memory reserve - something that Johannes was proposing few months
back.

> >                Well I think this is not something to care about at this
> > layer. We shouldn't select a tasks which can lead to this situation in
> > the first place. Such an oom victim is basically selected incorrectly. I
> > think we can handle that by a flag in mm_struct.
> 
> You mean updating select_bad_process() and oom_kill_process() not to select
> an OOM victim with mm shared by unkillable tasks?

See the follow up email.

[...]
> > Does that matter though. Be it a memcg OOM or a global OOM victim, we
> > will still release a memory which should help the global case which we
> > care about the most. Memcg OOM killer handling is easier because we do
> > not hold any locks while waiting for the OOM to be handled.
> > 
> 
> Why easier? Memcg OOM is triggered when current thread triggered a page
> fault, right? I don't know whether current thread really holds no locks
> when a page fault is triggered. But mem_cgroup_out_of_memory() choose
> an OOM victim which can be different from current thread. Therefore,
> I think problems caused by invisible dependency exist for memcg OOM.

memcg charge doesn't loop endless like the page allocation. We simply
fail with ENOMEM. It is true that the victim might be blocked and not
reacting on SIGKILL but that waiting cannot wait on the memcg oom path
because of the above. Sure the victim might depend on a page allocation
looping because of the global OOM but then this is reduced to the global
case.

> > > To handle such case, we would need to do something like
> > > 
> > >  struct mm_struct {
> > >      (...snipped...)
> > > +    struct list_head *memdie; /* Set to non-NULL when chosen by OOM killer */
> > >  }
> > > 
> > > and add to a list of OOM victims.
> > 
> > I really wanted to prevent from additional memory footprint for a highly
> > unlikely case. Why should everybody pay for a case which is rarely hit?
> > 
> > Also if this turns out to be a real problem then it can be added on top
> > of the existing code. I would really like this to be as easy as
> > possible.
> 
> The sequences I think we could hit is,
> 
>   (1) a memcg1 OOM is invoked and a task in memcg1 is chosen as first OOM
>       victim.
> 
>   (2) wake_oom_reaper() passes first victim's mm to oom_reaper().
> 
>   (3) oom_reaper() starts reclaiming first victim's mm.
> 
>   (4) a memcg2 OOM is invoked and a task in memcg2 is chosen as second
>       OOM victim.
> 
>   (5) wake_oom_reaper() does not pass second victim's mm to oom_reaper()
>       because mm_to_reap holds first victim's mm.
> 
>   (6) oom_reaper() finishes reclaiming first victim's mm, and sets
>       mm_to_reap = NULL.
> 
>   (7) Second victim's mm (chosen at step (4)) is not passed to oom_reaper()
>       because wake_oom_reaper() is not called again.
> 
> Invisible dependency problem in first victim's mm can prevent second
> victim's mm from reaping. This is an unexpected thing for second victim.
> If oom_reaper() scans like kmallocwd() does, any victim's mm will be
> reaped, as well as doing different things like emitting warning messages,
> choosing next OOM victims and eventually calling panic(). Therefore, I wrote

As I've said I am not really worried about the memcg case. At least not
now with this RFC. Extending the current mechanism to a queue seems like
an implementation detail to me. Is this really something that is so
important to have from the very beginning? I mean, don't get me wrong,
but I would really like to make the basic case - the global OOM killer
right and only build additional changes on top.

[...]
> > > > +static int __init oom_init(void)
> > > > +{
> > > > +	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
> > > > +	if (IS_ERR(oom_reaper_th)) {
> > > > +		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
> > > > +				PTR_ERR(oom_reaper_th));
> > > 
> > > BUG_ON(IS_ERR(oom_reaper_th)) or panic() should be OK.
> > > Continuing with IS_ERR(oom_reaper_th) is not handled by wake_oom_reaper().
> > 
> > Yes, but we can live without this kernel thread, right? I do not think
> > this will ever happen but why should we panic the system?
> 
> I do not think this will ever happen. Only for documentation purpose.

I do not think we add BUG_ONs for documentation purposes. The code and
the message sounds like a sufficient documentation that this is not
really a critical part of the system.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-11-25 15:56 ` Michal Hocko
@ 2015-11-27 16:12   ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-27 16:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Rik van Riel, Hugh Dickins, Oleg Nesterov,
	Andrea Argangeli, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked
for write while write but the probability is reduced considerably wrt.
basically any lock blocking forward progress as described above. In
order to prevent from blocking on the lock without any forward progress
we are using only a trylock and retry 10 times with a short sleep
in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NUll->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  12 ++---
 mm/oom_kill.c      | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 152 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..ccfd48c7286e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/* oom_reaper cannot tear down dirty pages */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1125,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1231,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1239,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5314b206caa5..3f22efccc0a1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,108 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	DEFINE_WAIT(wait);
+
+	while (true) {
+		struct mm_struct *mm;
+
+		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
+		mm = READ_ONCE(mm_to_reap);
+		if (!mm) {
+			freezable_schedule();
+			finish_wait(&oom_reaper_wait, &wait);
+		} else {
+			finish_wait(&oom_reaper_wait, &wait);
+			oom_reap_vmas(mm);
+			WRITE_ONCE(mm_to_reap, NULL);
+		}
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm) {
+		/*
+		 * Pin the given mm. Use mm_count instead of mm_users because
+		 * we do not want to delay the address space tear down.
+		 */
+		atomic_inc(&mm->mm_count);
+		wake_up(&oom_reaper_wait);
+	}
+}
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -421,6 +528,7 @@ void mark_oom_victim(struct task_struct *tsk)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -517,6 +625,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,15 +716,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot usee oom_reaper for the mm shared by this process
+			 * because it wouldn't get killed and so the memory might be
+			 * still used.
+			 */
+			can_oom_reap = false;
 			continue;
+		}
 
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -767,3 +884,22 @@ void pagefault_out_of_memory(void)
 
 	mutex_unlock(&oom_lock);
 }
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+module_init(oom_init)
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-11-27 16:12   ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-11-27 16:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Rik van Riel, Hugh Dickins, Oleg Nesterov,
	Andrea Argangeli, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked
for write while write but the probability is reduced considerably wrt.
basically any lock blocking forward progress as described above. In
order to prevent from blocking on the lock without any forward progress
we are using only a trylock and retry 10 times with a short sleep
in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NUll->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  12 ++---
 mm/oom_kill.c      | 142 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 152 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..ccfd48c7286e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/* oom_reaper cannot tear down dirty pages */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1122,8 +1125,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1231,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1236,9 +1239,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5314b206caa5..3f22efccc0a1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -408,6 +413,108 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	DEFINE_WAIT(wait);
+
+	while (true) {
+		struct mm_struct *mm;
+
+		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
+		mm = READ_ONCE(mm_to_reap);
+		if (!mm) {
+			freezable_schedule();
+			finish_wait(&oom_reaper_wait, &wait);
+		} else {
+			finish_wait(&oom_reaper_wait, &wait);
+			oom_reap_vmas(mm);
+			WRITE_ONCE(mm_to_reap, NULL);
+		}
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm) {
+		/*
+		 * Pin the given mm. Use mm_count instead of mm_users because
+		 * we do not want to delay the address space tear down.
+		 */
+		atomic_inc(&mm->mm_count);
+		wake_up(&oom_reaper_wait);
+	}
+}
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -421,6 +528,7 @@ void mark_oom_victim(struct task_struct *tsk)
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
 		return;
+
 	/*
 	 * Make sure that the task is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
@@ -517,6 +625,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,15 +716,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot usee oom_reaper for the mm shared by this process
+			 * because it wouldn't get killed and so the memory might be
+			 * still used.
+			 */
+			can_oom_reap = false;
 			continue;
+		}
 
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
 	mmdrop(mm);
 	put_task_struct(victim);
 }
@@ -767,3 +884,22 @@ void pagefault_out_of_memory(void)
 
 	mutex_unlock(&oom_lock);
 }
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+	} else {
+		struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+		/*
+		 * Make sure our oom reaper thread will get scheduled when
+		 * ASAP and that it won't get preempted by malicious userspace.
+		 */
+		sched_setscheduler(oom_reaper_th, SCHED_FIFO, &param);
+	}
+	return 0;
+}
+module_init(oom_init)
-- 
2.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-11-27 16:12   ` Michal Hocko
@ 2015-11-27 16:39     ` Mel Gorman
  -1 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2015-11-27 16:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Linus Torvalds, David Rientjes,
	Tetsuo Handa, Rik van Riel, Hugh Dickins, Oleg Nesterov,
	Andrea Argangeli, LKML, Michal Hocko

On Fri, Nov 27, 2015 at 05:12:52PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 
> <SNIP>
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Other than a few small issues below, I didn't spot anything out of the
ordinary so

Acked-by: Mel Gorman <mgorman@suse.de>

> +	tlb_gather_mmu(&tlb, mm, 0, -1);
> +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +
> +		/*
> +		 * Only anonymous pages have a good chance to be dropped
> +		 * without additional steps which we cannot afford as we
> +		 * are OOM already.
> +		 */
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> +					 &details);
> +	}

Care to add a comment why clean file pages should not be discarded? I'm
assuming it's because you assume they were discarded already by normal
reclaim before OOM. There is a slightly possibility they are been kept
alive because the OOM victim is constantly referencing them so they get
activated or that there might be additional work to discard buffers but
I'm not 100% sure that's your logic.

> @@ -421,6 +528,7 @@ void mark_oom_victim(struct task_struct *tsk)
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> +
>  	/*
>  	 * Make sure that the task is woken up from uninterruptible sleep
>  	 * if it is frozen because OOM killer wouldn't be able to free

Unnecessary whitespace change.

> @@ -607,15 +716,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot usee oom_reaper for the mm shared by this process
> +			 * because it wouldn't get killed and so the memory might be
> +			 * still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;

s/usee/use/

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-11-27 16:39     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2015-11-27 16:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Linus Torvalds, David Rientjes,
	Tetsuo Handa, Rik van Riel, Hugh Dickins, Oleg Nesterov,
	Andrea Argangeli, LKML, Michal Hocko

On Fri, Nov 27, 2015 at 05:12:52PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 
> <SNIP>
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Other than a few small issues below, I didn't spot anything out of the
ordinary so

Acked-by: Mel Gorman <mgorman@suse.de>

> +	tlb_gather_mmu(&tlb, mm, 0, -1);
> +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> +		if (is_vm_hugetlb_page(vma))
> +			continue;
> +
> +		/*
> +		 * Only anonymous pages have a good chance to be dropped
> +		 * without additional steps which we cannot afford as we
> +		 * are OOM already.
> +		 */
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> +					 &details);
> +	}

Care to add a comment why clean file pages should not be discarded? I'm
assuming it's because you assume they were discarded already by normal
reclaim before OOM. There is a slightly possibility they are been kept
alive because the OOM victim is constantly referencing them so they get
activated or that there might be additional work to discard buffers but
I'm not 100% sure that's your logic.

> @@ -421,6 +528,7 @@ void mark_oom_victim(struct task_struct *tsk)
>  	/* OOM killer might race with memcg OOM */
>  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
>  		return;
> +
>  	/*
>  	 * Make sure that the task is woken up from uninterruptible sleep
>  	 * if it is frozen because OOM killer wouldn't be able to free

Unnecessary whitespace change.

> @@ -607,15 +716,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD))
> -			continue;
> -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +		if (unlikely(p->flags & PF_KTHREAD) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> +			/*
> +			 * We cannot usee oom_reaper for the mm shared by this process
> +			 * because it wouldn't get killed and so the memory might be
> +			 * still used.
> +			 */
> +			can_oom_reap = false;
>  			continue;

s/usee/use/

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-11-27 16:12   ` Michal Hocko
@ 2015-11-28  4:39     ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-11-28  4:39 UTC (permalink / raw)
  To: mhocko, linux-mm
  Cc: akpm, torvalds, mgorman, rientjes, riel, hughd, oleg, andrea,
	linux-kernel, mhocko

Michal Hocko wrote:
> for write while write but the probability is reduced considerably wrt.

Is this "while write" garbage?

> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.

It will be nice if we can have down_write_killable()/down_read_killable().

> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NUll->mm transition

NULL->mm

> and oom_reaper clear this atomically once it is done with the work.

Can't oom_reaper() become as compact as below?

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3f22efc..c2ab7f9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -472,21 +472,10 @@ static void oom_reap_vmas(struct mm_struct *mm)
 
 static int oom_reaper(void *unused)
 {
-	DEFINE_WAIT(wait);
-
 	while (true) {
-		struct mm_struct *mm;
-
-		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
-		mm = READ_ONCE(mm_to_reap);
-		if (!mm) {
-			freezable_schedule();
-			finish_wait(&oom_reaper_wait, &wait);
-		} else {
-			finish_wait(&oom_reaper_wait, &wait);
-			oom_reap_vmas(mm);
-			WRITE_ONCE(mm_to_reap, NULL);
-		}
+		wait_event_freezable(oom_reaper_wait, mm_to_reap);
+		oom_reap_vmas(mm_to_reap);
+		mm_to_reap = NULL;
 	}
 
 	return 0;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-11-28  4:39     ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-11-28  4:39 UTC (permalink / raw)
  To: mhocko, linux-mm
  Cc: akpm, torvalds, mgorman, rientjes, riel, hughd, oleg, andrea,
	linux-kernel, mhocko

Michal Hocko wrote:
> for write while write but the probability is reduced considerably wrt.

Is this "while write" garbage?

> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.

It will be nice if we can have down_write_killable()/down_read_killable().

> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NUll->mm transition

NULL->mm

> and oom_reaper clear this atomically once it is done with the work.

Can't oom_reaper() become as compact as below?

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3f22efc..c2ab7f9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -472,21 +472,10 @@ static void oom_reap_vmas(struct mm_struct *mm)
 
 static int oom_reaper(void *unused)
 {
-	DEFINE_WAIT(wait);
-
 	while (true) {
-		struct mm_struct *mm;
-
-		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
-		mm = READ_ONCE(mm_to_reap);
-		if (!mm) {
-			freezable_schedule();
-			finish_wait(&oom_reaper_wait, &wait);
-		} else {
-			finish_wait(&oom_reaper_wait, &wait);
-			oom_reap_vmas(mm);
-			WRITE_ONCE(mm_to_reap, NULL);
-		}
+		wait_event_freezable(oom_reaper_wait, mm_to_reap);
+		oom_reap_vmas(mm_to_reap);
+		mm_to_reap = NULL;
 	}
 
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-11-28  4:39     ` Tetsuo Handa
@ 2015-11-28 16:10       ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-11-28 16:10 UTC (permalink / raw)
  To: mhocko, linux-mm
  Cc: akpm, torvalds, mgorman, rientjes, riel, hughd, oleg, andrea,
	linux-kernel, mhocko

Tetsuo Handa wrote:
> > Users of mmap_sem which need it for write should be carefully reviewed
> > to use _killable waiting as much as possible and reduce allocations
> > requests done with the lock held to absolute minimum to reduce the risk
> > even further.
> 
> It will be nice if we can have down_write_killable()/down_read_killable().

It will be nice if we can also have __GFP_KILLABLE. Although currently it can't
be perfect because reclaim functions called from __alloc_pages_slowpath() use
unkillable waits, starting from just bail out as with __GFP_NORETRY when
fatal_signal_pending(current) is true will be helpful.

So far I'm hitting no problem with testers except the one using mmap()/munmap().

I think that cmpxchg() was not needed.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c2ab7f9..1a65739 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -483,8 +483,6 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct mm_struct *mm)
 {
-	struct mm_struct *old_mm;
-
 	if (!oom_reaper_th)
 		return;
 
@@ -492,14 +490,15 @@ static void wake_oom_reaper(struct mm_struct *mm)
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
+	 * Caller is serialized by oom_lock mutex.
 	 */
-	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm) {
+	if (!mm_to_reap) {
 		/*
 		 * Pin the given mm. Use mm_count instead of mm_users because
 		 * we do not want to delay the address space tear down.
 		 */
 		atomic_inc(&mm->mm_count);
+		mm_to_reap = mm;
 		wake_up(&oom_reaper_wait);
 	}
 }

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-11-28 16:10       ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-11-28 16:10 UTC (permalink / raw)
  To: mhocko, linux-mm
  Cc: akpm, torvalds, mgorman, rientjes, riel, hughd, oleg, andrea,
	linux-kernel, mhocko

Tetsuo Handa wrote:
> > Users of mmap_sem which need it for write should be carefully reviewed
> > to use _killable waiting as much as possible and reduce allocations
> > requests done with the lock held to absolute minimum to reduce the risk
> > even further.
> 
> It will be nice if we can have down_write_killable()/down_read_killable().

It will be nice if we can also have __GFP_KILLABLE. Although currently it can't
be perfect because reclaim functions called from __alloc_pages_slowpath() use
unkillable waits, starting from just bail out as with __GFP_NORETRY when
fatal_signal_pending(current) is true will be helpful.

So far I'm hitting no problem with testers except the one using mmap()/munmap().

I think that cmpxchg() was not needed.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c2ab7f9..1a65739 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -483,8 +483,6 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct mm_struct *mm)
 {
-	struct mm_struct *old_mm;
-
 	if (!oom_reaper_th)
 		return;
 
@@ -492,14 +490,15 @@ static void wake_oom_reaper(struct mm_struct *mm)
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
+	 * Caller is serialized by oom_lock mutex.
 	 */
-	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm) {
+	if (!mm_to_reap) {
 		/*
 		 * Pin the given mm. Use mm_count instead of mm_users because
 		 * we do not want to delay the address space tear down.
 		 */
 		atomic_inc(&mm->mm_count);
+		mm_to_reap = mm;
 		wake_up(&oom_reaper_wait);
 	}
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-11-27 16:39     ` Mel Gorman
@ 2015-12-01 13:07       ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-01 13:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Linus Torvalds, David Rientjes,
	Tetsuo Handa, Rik van Riel, Hugh Dickins, Oleg Nesterov,
	Andrea Argangeli, LKML

On Fri 27-11-15 16:39:00, Mel Gorman wrote:
> On Fri, Nov 27, 2015 at 05:12:52PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> > <SNIP>
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Other than a few small issues below, I didn't spot anything out of the
> ordinary so
> 
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks!

> 
> > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +
> > +		/*
> > +		 * Only anonymous pages have a good chance to be dropped
> > +		 * without additional steps which we cannot afford as we
> > +		 * are OOM already.
> > +		 */
> > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > +					 &details);
> > +	}
> 
> Care to add a comment why clean file pages should not be discarded? I'm
> assuming it's because you assume they were discarded already by normal
> reclaim before OOM.

Yes that is exactly my thinking. We are OOM so all the reclaim attempts
have failed already. Clean page cache is highly improbable and we do not
want to waste cycles without a good reason. Even though oom_reaper thread
doesn't care so much about few cycles it still has mm ref count elevated
so it might block a real exit_mmap.

I will add a comment:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3efd1efc8cd1..ece3eda4ee99 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -454,6 +454,11 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
 		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
 		 */
 		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,

> There is a slightly possibility they are been kept
> alive because the OOM victim is constantly referencing them so they get
> activated or that there might be additional work to discard buffers but
> I'm not 100% sure that's your logic.
> 
> > @@ -421,6 +528,7 @@ void mark_oom_victim(struct task_struct *tsk)
> >  	/* OOM killer might race with memcg OOM */
> >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> >  		return;
> > +
> >  	/*
> >  	 * Make sure that the task is woken up from uninterruptible sleep
> >  	 * if it is frozen because OOM killer wouldn't be able to free
> 
> Unnecessary whitespace change.

removed

> 
> > @@ -607,15 +716,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD))
> > -			continue;
> > -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +		if (unlikely(p->flags & PF_KTHREAD) ||
> > +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +			/*
> > +			 * We cannot usee oom_reaper for the mm shared by this process
> > +			 * because it wouldn't get killed and so the memory might be
> > +			 * still used.
> > +			 */
> > +			can_oom_reap = false;
> >  			continue;
> 
> s/usee/use/

Fixed

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-12-01 13:07       ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-01 13:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Linus Torvalds, David Rientjes,
	Tetsuo Handa, Rik van Riel, Hugh Dickins, Oleg Nesterov,
	Andrea Argangeli, LKML

On Fri 27-11-15 16:39:00, Mel Gorman wrote:
> On Fri, Nov 27, 2015 at 05:12:52PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> > <SNIP>
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Other than a few small issues below, I didn't spot anything out of the
> ordinary so
> 
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks!

> 
> > +	tlb_gather_mmu(&tlb, mm, 0, -1);
> > +	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> > +		if (is_vm_hugetlb_page(vma))
> > +			continue;
> > +
> > +		/*
> > +		 * Only anonymous pages have a good chance to be dropped
> > +		 * without additional steps which we cannot afford as we
> > +		 * are OOM already.
> > +		 */
> > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > +			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> > +					 &details);
> > +	}
> 
> Care to add a comment why clean file pages should not be discarded? I'm
> assuming it's because you assume they were discarded already by normal
> reclaim before OOM.

Yes that is exactly my thinking. We are OOM so all the reclaim attempts
have failed already. Clean page cache is highly improbable and we do not
want to waste cycles without a good reason. Even though oom_reaper thread
doesn't care so much about few cycles it still has mm ref count elevated
so it might block a real exit_mmap.

I will add a comment:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3efd1efc8cd1..ece3eda4ee99 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -454,6 +454,11 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
 		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
 		 */
 		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,

> There is a slightly possibility they are been kept
> alive because the OOM victim is constantly referencing them so they get
> activated or that there might be additional work to discard buffers but
> I'm not 100% sure that's your logic.
> 
> > @@ -421,6 +528,7 @@ void mark_oom_victim(struct task_struct *tsk)
> >  	/* OOM killer might race with memcg OOM */
> >  	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> >  		return;
> > +
> >  	/*
> >  	 * Make sure that the task is woken up from uninterruptible sleep
> >  	 * if it is frozen because OOM killer wouldn't be able to free
> 
> Unnecessary whitespace change.

removed

> 
> > @@ -607,15 +716,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  			continue;
> >  		if (same_thread_group(p, victim))
> >  			continue;
> > -		if (unlikely(p->flags & PF_KTHREAD))
> > -			continue;
> > -		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > +		if (unlikely(p->flags & PF_KTHREAD) ||
> > +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> > +			/*
> > +			 * We cannot usee oom_reaper for the mm shared by this process
> > +			 * because it wouldn't get killed and so the memory might be
> > +			 * still used.
> > +			 */
> > +			can_oom_reap = false;
> >  			continue;
> 
> s/usee/use/

Fixed

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-11-28  4:39     ` Tetsuo Handa
@ 2015-12-01 13:22       ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-01 13:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

On Sat 28-11-15 13:39:11, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > for write while write but the probability is reduced considerably wrt.
> 
> Is this "while write" garbage?

Fixed

> > Users of mmap_sem which need it for write should be carefully reviewed
> > to use _killable waiting as much as possible and reduce allocations
> > requests done with the lock held to absolute minimum to reduce the risk
> > even further.
> 
> It will be nice if we can have down_write_killable()/down_read_killable().

Yes that is an idea.

> > The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> > updates mm_to_reap with cmpxchg to guarantee only NUll->mm transition
> 
> NULL->mm

fixed

> > and oom_reaper clear this atomically once it is done with the work.
> 
> Can't oom_reaper() become as compact as below?

Good idea! I think we still need {READ,WRITE}_ONCE to prevent from any
potential mis optimizations, though.

Here is what I did:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 333953bf4968..b50ce41194b3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -477,21 +477,11 @@ static void oom_reap_vmas(struct mm_struct *mm)
 
 static int oom_reaper(void *unused)
 {
-	DEFINE_WAIT(wait);
-
 	while (true) {
 		struct mm_struct *mm;
-
-		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
-		mm = READ_ONCE(mm_to_reap);
-		if (!mm) {
-			freezable_schedule();
-			finish_wait(&oom_reaper_wait, &wait);
-		} else {
-			finish_wait(&oom_reaper_wait, &wait);
-			oom_reap_vmas(mm);
-			WRITE_ONCE(mm_to_reap, NULL);
-		}
+		wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
 	}
 
 	return 0;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-12-01 13:22       ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-01 13:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

On Sat 28-11-15 13:39:11, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > for write while write but the probability is reduced considerably wrt.
> 
> Is this "while write" garbage?

Fixed

> > Users of mmap_sem which need it for write should be carefully reviewed
> > to use _killable waiting as much as possible and reduce allocations
> > requests done with the lock held to absolute minimum to reduce the risk
> > even further.
> 
> It will be nice if we can have down_write_killable()/down_read_killable().

Yes that is an idea.

> > The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> > updates mm_to_reap with cmpxchg to guarantee only NUll->mm transition
> 
> NULL->mm

fixed

> > and oom_reaper clear this atomically once it is done with the work.
> 
> Can't oom_reaper() become as compact as below?

Good idea! I think we still need {READ,WRITE}_ONCE to prevent from any
potential mis optimizations, though.

Here is what I did:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 333953bf4968..b50ce41194b3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -477,21 +477,11 @@ static void oom_reap_vmas(struct mm_struct *mm)
 
 static int oom_reaper(void *unused)
 {
-	DEFINE_WAIT(wait);
-
 	while (true) {
 		struct mm_struct *mm;
-
-		prepare_to_wait(&oom_reaper_wait, &wait, TASK_UNINTERRUPTIBLE);
-		mm = READ_ONCE(mm_to_reap);
-		if (!mm) {
-			freezable_schedule();
-			finish_wait(&oom_reaper_wait, &wait);
-		} else {
-			finish_wait(&oom_reaper_wait, &wait);
-			oom_reap_vmas(mm);
-			WRITE_ONCE(mm_to_reap, NULL);
-		}
+		wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
 	}
 
 	return 0;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-11-28 16:10       ` Tetsuo Handa
@ 2015-12-01 13:29         ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-01 13:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

On Sun 29-11-15 01:10:10, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > > Users of mmap_sem which need it for write should be carefully reviewed
> > > to use _killable waiting as much as possible and reduce allocations
> > > requests done with the lock held to absolute minimum to reduce the risk
> > > even further.
> > 
> > It will be nice if we can have down_write_killable()/down_read_killable().
> 
> It will be nice if we can also have __GFP_KILLABLE.

Well, we already do this implicitly because OOM killer will
automatically do mark_oom_victim if it has fatal_signal_pending and then
__alloc_pages_slowpath fails the allocation if the memory reserves do
not help to finish the allocation.

> Although currently it can't
> be perfect because reclaim functions called from __alloc_pages_slowpath() use
> unkillable waits, starting from just bail out as with __GFP_NORETRY when
> fatal_signal_pending(current) is true will be helpful.
> 
> So far I'm hitting no problem with testers except the one using mmap()/munmap().
> 
> I think that cmpxchg() was not needed.

It is not needed right now but I would rather not depend on the oom
mutex here. This is not a hot path where an atomic would add an
overhead.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index c2ab7f9..1a65739 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -483,8 +483,6 @@ static int oom_reaper(void *unused)
>  
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> -	struct mm_struct *old_mm;
> -
>  	if (!oom_reaper_th)
>  		return;
>  
> @@ -492,14 +490,15 @@ static void wake_oom_reaper(struct mm_struct *mm)
>  	 * Make sure that only a single mm is ever queued for the reaper
>  	 * because multiple are not necessary and the operation might be
>  	 * disruptive so better reduce it to the bare minimum.
> +	 * Caller is serialized by oom_lock mutex.
>  	 */
> -	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> -	if (!old_mm) {
> +	if (!mm_to_reap) {
>  		/*
>  		 * Pin the given mm. Use mm_count instead of mm_users because
>  		 * we do not want to delay the address space tear down.
>  		 */
>  		atomic_inc(&mm->mm_count);
> +		mm_to_reap = mm;
>  		wake_up(&oom_reaper_wait);
>  	}
>  }

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-12-01 13:29         ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-01 13:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

On Sun 29-11-15 01:10:10, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > > Users of mmap_sem which need it for write should be carefully reviewed
> > > to use _killable waiting as much as possible and reduce allocations
> > > requests done with the lock held to absolute minimum to reduce the risk
> > > even further.
> > 
> > It will be nice if we can have down_write_killable()/down_read_killable().
> 
> It will be nice if we can also have __GFP_KILLABLE.

Well, we already do this implicitly because OOM killer will
automatically do mark_oom_victim if it has fatal_signal_pending and then
__alloc_pages_slowpath fails the allocation if the memory reserves do
not help to finish the allocation.

> Although currently it can't
> be perfect because reclaim functions called from __alloc_pages_slowpath() use
> unkillable waits, starting from just bail out as with __GFP_NORETRY when
> fatal_signal_pending(current) is true will be helpful.
> 
> So far I'm hitting no problem with testers except the one using mmap()/munmap().
> 
> I think that cmpxchg() was not needed.

It is not needed right now but I would rather not depend on the oom
mutex here. This is not a hot path where an atomic would add an
overhead.

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index c2ab7f9..1a65739 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -483,8 +483,6 @@ static int oom_reaper(void *unused)
>  
>  static void wake_oom_reaper(struct mm_struct *mm)
>  {
> -	struct mm_struct *old_mm;
> -
>  	if (!oom_reaper_th)
>  		return;
>  
> @@ -492,14 +490,15 @@ static void wake_oom_reaper(struct mm_struct *mm)
>  	 * Make sure that only a single mm is ever queued for the reaper
>  	 * because multiple are not necessary and the operation might be
>  	 * disruptive so better reduce it to the bare minimum.
> +	 * Caller is serialized by oom_lock mutex.
>  	 */
> -	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
> -	if (!old_mm) {
> +	if (!mm_to_reap) {
>  		/*
>  		 * Pin the given mm. Use mm_count instead of mm_users because
>  		 * we do not want to delay the address space tear down.
>  		 */
>  		atomic_inc(&mm->mm_count);
> +		mm_to_reap = mm;
>  		wake_up(&oom_reaper_wait);
>  	}
>  }

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-12-01 13:29         ` Michal Hocko
@ 2015-12-05 12:33           ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-12-05 12:33 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

Michal Hocko wrote:
> On Sun 29-11-15 01:10:10, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > > Users of mmap_sem which need it for write should be carefully reviewed
> > > > to use _killable waiting as much as possible and reduce allocations
> > > > requests done with the lock held to absolute minimum to reduce the risk
> > > > even further.
> > > 
> > > It will be nice if we can have down_write_killable()/down_read_killable().
> > 
> > It will be nice if we can also have __GFP_KILLABLE.
> 
> Well, we already do this implicitly because OOM killer will
> automatically do mark_oom_victim if it has fatal_signal_pending and then
> __alloc_pages_slowpath fails the allocation if the memory reserves do
> not help to finish the allocation.

I don't think so because !__GFP_FS && !__GFP_NOFAIL allocations do not do
mark_oom_victim() even if fatal_signal_pending() is true because
out_of_memory() is not called.

Also, __GFP_KILLABLE is helpful even before the kernel declares OOM because
users can give up earlier when memory allocation is slashing (i.e. allow
users who recognized that memory allocation is too slow to wait to kill
processes before the kernel declares OOM).
I'm willing to use __GFP_KILLABLE from TOMOYO security module because we are
using GFP_NOFS allocations for checking permissions for access requests from
user space (because some LSM hooks are GFP_KERNEL unsafe) where failing
GFP_NOFS allocations without invoking the OOM killer can result in
unrecoverable failure (e.g. unexpected termination of critical processes).

Anyway, __GFP_KILLABLE is outside of this thread, so I stop here for now.



> > Although currently it can't
> > be perfect because reclaim functions called from __alloc_pages_slowpath() use
> > unkillable waits, starting from just bail out as with __GFP_NORETRY when
> > fatal_signal_pending(current) is true will be helpful.
> > 
> > So far I'm hitting no problem with testers except the one using mmap()/munmap().
> > 
> > I think that cmpxchg() was not needed.
> 
> It is not needed right now but I would rather not depend on the oom
> mutex here. This is not a hot path where an atomic would add an
> overhead.

Current patch can allow oom_reaper() to call mmdrop(mm) before
wake_oom_reaper() calls atomic_inc(&mm->mm_count) because sequence like

  oom_reaper() (a realtime thread)         wake_oom_reaper() (current thread)         Current OOM victim

  oom_reap_vmas(mm); /* mm = Previous OOM victim */
  WRITE_ONCE(mm_to_reap, NULL);
                                           old_mm = cmpxchg(&mm_to_reap, NULL, mm); /* mm = Current OOM victim */
                                           if (!old_mm) {
  wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
  oom_reap_vmas(mm); /* mm = Current OOM victim, undo atomic_inc(&mm->mm_count) done by oom_kill_process() */
  WRITE_ONCE(mm_to_reap, NULL);
                                                                                      exit and release mm
                                           atomic_inc(&mm->mm_count); /* mm = Current OOM victim */
                                           wake_up(&oom_reaper_wait);

  wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap))); /* mm = Next OOM victim */

is possible.

If you are serious about execution ordering, we should protect mm_to_reap
using smp_mb__after_atomic_inc(), rcu_assign_pointer()/rcu_dereference() etc.
in addition to my patch.



But what I don't like is that current patch cannot handle a trap explained
below. What about marking current OOM victim unkillable by updating
victim->signal->oom_score_adj to OOM_SCORE_ADJ_MIN and clearing victim's
TIF_MEMDIE flag when the victim is still alive for a second after
oom_reap_vmas() completed? In this way, my worry (2) at
http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp
(though this trap is not a mmap_sem livelock) will be gone. That is,
holding a victim's task_struct than a victim's mm will do better things.

---------- oom-write.c start ----------
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
	unsigned long size;
	char *buf = NULL;
	unsigned long i;
	for (i = 0; i < 10; i++) {
		if (fork() == 0) {
			close(1);
			open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
			execl("./write", "./write", NULL);
			_exit(1);
		}
	}
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(5);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	pause();
	return 0;
}
---------- oom-write.c end ----------

----- write.asm start -----
; nasm -f elf write.asm && ld -s -m elf_i386 -o write write.o
section .text
    CPU 386
    global _start
_start:
; whlie (write(1, buf, 4096) == 4096);
    mov eax, 4 ; NR_write
    mov ebx, 1
    mov ecx, _start - 96
    mov edx, 4096
    int 0x80
    cmp eax, 4096
    je _start
; pause();
    mov eax, 29 ; NR_pause
    int 0x80
; _exit(0);
    mov eax, 1 ; NR_exit
    mov ebx, 0
    int 0x80
----- write.asm end -----

What is happening with this trap:

  (1) out_of_memory() chose oom-write(3805) which consumed most memory.
  (2) oom_kill_process() chose first write(3806) which is one of children
      of oom-write(3805).
  (3) oom_reaper() reclaimed write(3806)'s memory which consumed only
      a few pages.
  (4) out_of_memory() chose oom-write(3805) again.
  (5) oom_kill_process() chose second write(3807) which is one of children
      of oom-write(3805).
  (6) oom_reaper() reclaimed write(3807)'s memory which consumed only
      a few pages.
  (7) second write(3807) is blocked by unkillable mutex held by first
      write(3806), and first write(3806) is waiting for second write(3807)
      to release more memory even after oom_reaper() completed.
  (8) eventually first write(3806) successfully terminated, but
      second write(3807) remained stuck.
  (9) irqbalance(1710) got memory before second write(3807)
      can make forward progress.

----------
[   78.157198] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
(...snipped...)
[   78.325409] [ 3805]  1000  3805   541715   357876     708       6        0             0 oom-write
[   78.327978] [ 3806]  1000  3806       39        1       3       2        0             0 write
[   78.330149] [ 3807]  1000  3807       39        1       3       2        0             0 write
[   78.332167] [ 3808]  1000  3808       39        1       3       2        0             0 write
[   78.334488] [ 3809]  1000  3809       39        1       3       2        0             0 write
[   78.336471] [ 3810]  1000  3810       39        1       3       2        0             0 write
[   78.338414] [ 3811]  1000  3811       39        1       3       2        0             0 write
[   78.340709] [ 3812]  1000  3812       39        1       3       2        0             0 write
[   78.342711] [ 3813]  1000  3813       39        1       3       2        0             0 write
[   78.344727] [ 3814]  1000  3814       39        1       3       2        0             0 write
[   78.346613] [ 3815]  1000  3815       39        1       3       2        0             0 write
[   78.348829] Out of memory: Kill process 3805 (oom-write) score 808 or sacrifice child
[   78.350818] Killed process 3806 (write) total-vm:156kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[   78.455314] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
(...snipped...)
[   78.631333] [ 3805]  1000  3805   541715   361440     715       6        0             0 oom-write
[   78.633802] [ 3807]  1000  3807       39        1       3       2        0             0 write
[   78.635977] [ 3808]  1000  3808       39        1       3       2        0             0 write
[   78.638325] [ 3809]  1000  3809       39        1       3       2        0             0 write
[   78.640463] [ 3810]  1000  3810       39        1       3       2        0             0 write
[   78.642837] [ 3811]  1000  3811       39        1       3       2        0             0 write
[   78.644924] [ 3812]  1000  3812       39        1       3       2        0             0 write
[   78.646990] [ 3813]  1000  3813       39        1       3       2        0             0 write
[   78.649039] [ 3814]  1000  3814       39        1       3       2        0             0 write
[   78.651242] [ 3815]  1000  3815       39        1       3       2        0             0 write
[   78.653326] Out of memory: Kill process 3805 (oom-write) score 816 or sacrifice child
[   78.655235] Killed process 3807 (write) total-vm:156kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[   88.776446] MemAlloc-Info: 1 stalling task, 1 dying task, 1 victim task.
[   88.778228] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=10000
[   88.780158] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[   98.915687] MemAlloc-Info: 8 stalling task, 1 dying task, 1 victim task.
[   98.917888] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=14885 uninterruptible
[   98.920297] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=20139
[   98.922652] MemAlloc: irqbalance(1710) seq=3 gfp=0x24280ca order=0 delay=16231
[   98.924874] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=20044
[   98.927043] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=10230 uninterruptible
[   98.929405] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=14716
[   98.931559] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=14887
[   98.933843] MemAlloc: write(3806) seq=29813 gfp=0x2400240 order=0 delay=14887 uninterruptible exiting
[   98.936460] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[  140.356230] MemAlloc-Info: 9 stalling task, 1 dying task, 1 victim task.
[  140.358448] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=56326 uninterruptible
[  140.360979] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=61580 uninterruptible
[  140.363716] MemAlloc: irqbalance(1710) seq=3 gfp=0x24280ca order=0 delay=57672
[  140.365983] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=61485 uninterruptible
[  140.368521] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=51671 uninterruptible
[  140.371128] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=56157 uninterruptible
[  140.373548] MemAlloc: smbd(3734) seq=1 gfp=0x27000c0 order=2 delay=48147
[  140.375722] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=56328 uninterruptible
[  140.378647] MemAlloc: write(3806) seq=29813 gfp=0x2400240 order=0 delay=56328 exiting
[  140.381695] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[  150.493557] MemAlloc-Info: 7 stalling task, 1 dying task, 1 victim task.
[  150.495725] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=66463
[  150.497897] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=71717 uninterruptible
[  150.500490] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=71622 uninterruptible
[  150.502940] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=61808
[  150.505122] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=66294 uninterruptible
[  150.507521] MemAlloc: smbd(3734) seq=1 gfp=0x27000c0 order=2 delay=58284
[  150.509678] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=66465 uninterruptible
[  150.512333] MemAlloc: write(3807) uninterruptible dying victim
----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151205.txt.xz .

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-12-05 12:33           ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-12-05 12:33 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

Michal Hocko wrote:
> On Sun 29-11-15 01:10:10, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > > Users of mmap_sem which need it for write should be carefully reviewed
> > > > to use _killable waiting as much as possible and reduce allocations
> > > > requests done with the lock held to absolute minimum to reduce the risk
> > > > even further.
> > > 
> > > It will be nice if we can have down_write_killable()/down_read_killable().
> > 
> > It will be nice if we can also have __GFP_KILLABLE.
> 
> Well, we already do this implicitly because OOM killer will
> automatically do mark_oom_victim if it has fatal_signal_pending and then
> __alloc_pages_slowpath fails the allocation if the memory reserves do
> not help to finish the allocation.

I don't think so because !__GFP_FS && !__GFP_NOFAIL allocations do not do
mark_oom_victim() even if fatal_signal_pending() is true because
out_of_memory() is not called.

Also, __GFP_KILLABLE is helpful even before the kernel declares OOM because
users can give up earlier when memory allocation is slashing (i.e. allow
users who recognized that memory allocation is too slow to wait to kill
processes before the kernel declares OOM).
I'm willing to use __GFP_KILLABLE from TOMOYO security module because we are
using GFP_NOFS allocations for checking permissions for access requests from
user space (because some LSM hooks are GFP_KERNEL unsafe) where failing
GFP_NOFS allocations without invoking the OOM killer can result in
unrecoverable failure (e.g. unexpected termination of critical processes).

Anyway, __GFP_KILLABLE is outside of this thread, so I stop here for now.



> > Although currently it can't
> > be perfect because reclaim functions called from __alloc_pages_slowpath() use
> > unkillable waits, starting from just bail out as with __GFP_NORETRY when
> > fatal_signal_pending(current) is true will be helpful.
> > 
> > So far I'm hitting no problem with testers except the one using mmap()/munmap().
> > 
> > I think that cmpxchg() was not needed.
> 
> It is not needed right now but I would rather not depend on the oom
> mutex here. This is not a hot path where an atomic would add an
> overhead.

Current patch can allow oom_reaper() to call mmdrop(mm) before
wake_oom_reaper() calls atomic_inc(&mm->mm_count) because sequence like

  oom_reaper() (a realtime thread)         wake_oom_reaper() (current thread)         Current OOM victim

  oom_reap_vmas(mm); /* mm = Previous OOM victim */
  WRITE_ONCE(mm_to_reap, NULL);
                                           old_mm = cmpxchg(&mm_to_reap, NULL, mm); /* mm = Current OOM victim */
                                           if (!old_mm) {
  wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
  oom_reap_vmas(mm); /* mm = Current OOM victim, undo atomic_inc(&mm->mm_count) done by oom_kill_process() */
  WRITE_ONCE(mm_to_reap, NULL);
                                                                                      exit and release mm
                                           atomic_inc(&mm->mm_count); /* mm = Current OOM victim */
                                           wake_up(&oom_reaper_wait);

  wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap))); /* mm = Next OOM victim */

is possible.

If you are serious about execution ordering, we should protect mm_to_reap
using smp_mb__after_atomic_inc(), rcu_assign_pointer()/rcu_dereference() etc.
in addition to my patch.



But what I don't like is that current patch cannot handle a trap explained
below. What about marking current OOM victim unkillable by updating
victim->signal->oom_score_adj to OOM_SCORE_ADJ_MIN and clearing victim's
TIF_MEMDIE flag when the victim is still alive for a second after
oom_reap_vmas() completed? In this way, my worry (2) at
http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@I-love.SAKURA.ne.jp
(though this trap is not a mmap_sem livelock) will be gone. That is,
holding a victim's task_struct than a victim's mm will do better things.

---------- oom-write.c start ----------
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
	unsigned long size;
	char *buf = NULL;
	unsigned long i;
	for (i = 0; i < 10; i++) {
		if (fork() == 0) {
			close(1);
			open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
			execl("./write", "./write", NULL);
			_exit(1);
		}
	}
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(5);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	pause();
	return 0;
}
---------- oom-write.c end ----------

----- write.asm start -----
; nasm -f elf write.asm && ld -s -m elf_i386 -o write write.o
section .text
    CPU 386
    global _start
_start:
; whlie (write(1, buf, 4096) == 4096);
    mov eax, 4 ; NR_write
    mov ebx, 1
    mov ecx, _start - 96
    mov edx, 4096
    int 0x80
    cmp eax, 4096
    je _start
; pause();
    mov eax, 29 ; NR_pause
    int 0x80
; _exit(0);
    mov eax, 1 ; NR_exit
    mov ebx, 0
    int 0x80
----- write.asm end -----

What is happening with this trap:

  (1) out_of_memory() chose oom-write(3805) which consumed most memory.
  (2) oom_kill_process() chose first write(3806) which is one of children
      of oom-write(3805).
  (3) oom_reaper() reclaimed write(3806)'s memory which consumed only
      a few pages.
  (4) out_of_memory() chose oom-write(3805) again.
  (5) oom_kill_process() chose second write(3807) which is one of children
      of oom-write(3805).
  (6) oom_reaper() reclaimed write(3807)'s memory which consumed only
      a few pages.
  (7) second write(3807) is blocked by unkillable mutex held by first
      write(3806), and first write(3806) is waiting for second write(3807)
      to release more memory even after oom_reaper() completed.
  (8) eventually first write(3806) successfully terminated, but
      second write(3807) remained stuck.
  (9) irqbalance(1710) got memory before second write(3807)
      can make forward progress.

----------
[   78.157198] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
(...snipped...)
[   78.325409] [ 3805]  1000  3805   541715   357876     708       6        0             0 oom-write
[   78.327978] [ 3806]  1000  3806       39        1       3       2        0             0 write
[   78.330149] [ 3807]  1000  3807       39        1       3       2        0             0 write
[   78.332167] [ 3808]  1000  3808       39        1       3       2        0             0 write
[   78.334488] [ 3809]  1000  3809       39        1       3       2        0             0 write
[   78.336471] [ 3810]  1000  3810       39        1       3       2        0             0 write
[   78.338414] [ 3811]  1000  3811       39        1       3       2        0             0 write
[   78.340709] [ 3812]  1000  3812       39        1       3       2        0             0 write
[   78.342711] [ 3813]  1000  3813       39        1       3       2        0             0 write
[   78.344727] [ 3814]  1000  3814       39        1       3       2        0             0 write
[   78.346613] [ 3815]  1000  3815       39        1       3       2        0             0 write
[   78.348829] Out of memory: Kill process 3805 (oom-write) score 808 or sacrifice child
[   78.350818] Killed process 3806 (write) total-vm:156kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[   78.455314] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
(...snipped...)
[   78.631333] [ 3805]  1000  3805   541715   361440     715       6        0             0 oom-write
[   78.633802] [ 3807]  1000  3807       39        1       3       2        0             0 write
[   78.635977] [ 3808]  1000  3808       39        1       3       2        0             0 write
[   78.638325] [ 3809]  1000  3809       39        1       3       2        0             0 write
[   78.640463] [ 3810]  1000  3810       39        1       3       2        0             0 write
[   78.642837] [ 3811]  1000  3811       39        1       3       2        0             0 write
[   78.644924] [ 3812]  1000  3812       39        1       3       2        0             0 write
[   78.646990] [ 3813]  1000  3813       39        1       3       2        0             0 write
[   78.649039] [ 3814]  1000  3814       39        1       3       2        0             0 write
[   78.651242] [ 3815]  1000  3815       39        1       3       2        0             0 write
[   78.653326] Out of memory: Kill process 3805 (oom-write) score 816 or sacrifice child
[   78.655235] Killed process 3807 (write) total-vm:156kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[   88.776446] MemAlloc-Info: 1 stalling task, 1 dying task, 1 victim task.
[   88.778228] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=10000
[   88.780158] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[   98.915687] MemAlloc-Info: 8 stalling task, 1 dying task, 1 victim task.
[   98.917888] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=14885 uninterruptible
[   98.920297] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=20139
[   98.922652] MemAlloc: irqbalance(1710) seq=3 gfp=0x24280ca order=0 delay=16231
[   98.924874] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=20044
[   98.927043] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=10230 uninterruptible
[   98.929405] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=14716
[   98.931559] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=14887
[   98.933843] MemAlloc: write(3806) seq=29813 gfp=0x2400240 order=0 delay=14887 uninterruptible exiting
[   98.936460] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[  140.356230] MemAlloc-Info: 9 stalling task, 1 dying task, 1 victim task.
[  140.358448] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=56326 uninterruptible
[  140.360979] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=61580 uninterruptible
[  140.363716] MemAlloc: irqbalance(1710) seq=3 gfp=0x24280ca order=0 delay=57672
[  140.365983] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=61485 uninterruptible
[  140.368521] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=51671 uninterruptible
[  140.371128] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=56157 uninterruptible
[  140.373548] MemAlloc: smbd(3734) seq=1 gfp=0x27000c0 order=2 delay=48147
[  140.375722] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=56328 uninterruptible
[  140.378647] MemAlloc: write(3806) seq=29813 gfp=0x2400240 order=0 delay=56328 exiting
[  140.381695] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[  150.493557] MemAlloc-Info: 7 stalling task, 1 dying task, 1 victim task.
[  150.495725] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=66463
[  150.497897] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=71717 uninterruptible
[  150.500490] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=71622 uninterruptible
[  150.502940] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=61808
[  150.505122] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=66294 uninterruptible
[  150.507521] MemAlloc: smbd(3734) seq=1 gfp=0x27000c0 order=2 delay=58284
[  150.509678] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=66465 uninterruptible
[  150.512333] MemAlloc: write(3807) uninterruptible dying victim
----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151205.txt.xz .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-12-05 12:33           ` Tetsuo Handa
@ 2015-12-07 16:07             ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-07 16:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

On Sat 05-12-15 21:33:47, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 29-11-15 01:10:10, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > > Users of mmap_sem which need it for write should be carefully reviewed
> > > > > to use _killable waiting as much as possible and reduce allocations
> > > > > requests done with the lock held to absolute minimum to reduce the risk
> > > > > even further.
> > > > 
> > > > It will be nice if we can have down_write_killable()/down_read_killable().
> > > 
> > > It will be nice if we can also have __GFP_KILLABLE.
> > 
> > Well, we already do this implicitly because OOM killer will
> > automatically do mark_oom_victim if it has fatal_signal_pending and then
> > __alloc_pages_slowpath fails the allocation if the memory reserves do
> > not help to finish the allocation.
> 
> I don't think so because !__GFP_FS && !__GFP_NOFAIL allocations do not do
> mark_oom_victim() even if fatal_signal_pending() is true because
> out_of_memory() is not called.

OK you are right about GFP_NOFS allocations. I didn't consider them
because I still think that GFP_NOFS needs a separate solution. Ideally
systematic one but if this is really urgent and cannot wait for it then
we can at least check for fatal_signal_pending in __alloc_pages_may_oom.
Something like:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 728b7a129df3..42a78aee36f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2754,9 +2754,11 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			/*
 			 * XXX: Page reclaim didn't yield anything,
 			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
+			 * keep looping as per tradition. Do not bother
+			 * if we are killed already though
 			 */
-			*did_some_progress = 1;
+			if (!fatal_signal_pending(current))
+				*did_some_progress = 1;
 			goto out;
 		}
 		if (pm_suspended_storage())

__GFP_KILLABLE sounds like adding more mess to the current situation
to me. Weak reclaim context which is unkillable is simply a bad
design. Putting __GFP_KILLABLE doesn't make it work properly.

But this is unrelated I believe and should be discussed in a separate
email thread.

[...]

> > > Although currently it can't
> > > be perfect because reclaim functions called from __alloc_pages_slowpath() use
> > > unkillable waits, starting from just bail out as with __GFP_NORETRY when
> > > fatal_signal_pending(current) is true will be helpful.
> > > 
> > > So far I'm hitting no problem with testers except the one using mmap()/munmap().
> > > 
> > > I think that cmpxchg() was not needed.
> > 
> > It is not needed right now but I would rather not depend on the oom
> > mutex here. This is not a hot path where an atomic would add an
> > overhead.
> 
> Current patch can allow oom_reaper() to call mmdrop(mm) before
> wake_oom_reaper() calls atomic_inc(&mm->mm_count) because sequence like
> 
>   oom_reaper() (a realtime thread)         wake_oom_reaper() (current thread)         Current OOM victim
> 
>   oom_reap_vmas(mm); /* mm = Previous OOM victim */
>   WRITE_ONCE(mm_to_reap, NULL);
>                                            old_mm = cmpxchg(&mm_to_reap, NULL, mm); /* mm = Current OOM victim */
>                                            if (!old_mm) {
>   wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
>   oom_reap_vmas(mm); /* mm = Current OOM victim, undo atomic_inc(&mm->mm_count) done by oom_kill_process() */
>   WRITE_ONCE(mm_to_reap, NULL);
>                                                                                       exit and release mm
>                                            atomic_inc(&mm->mm_count); /* mm = Current OOM victim */
>                                            wake_up(&oom_reaper_wait);
> 
>   wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap))); /* mm = Next OOM victim */
> 
> is possible.

Yes you are right! The reference count should be incremented before
publishing the new mm_to_reap. I thought that an elevated ref. count by
the caller would be enough but this was clearly wrong. Does the update
below looks better?

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3431dcdb0a13..32ebf84795d8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -511,19 +511,21 @@ static void wake_oom_reaper(struct mm_struct *mm)
 		return;
 
 	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
 	 */
 	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm) {
-		/*
-		 * Pin the given mm. Use mm_count instead of mm_users because
-		 * we do not want to delay the address space tear down.
-		 */
-		atomic_inc(&mm->mm_count);
+	if (!old_mm)
 		wake_up(&oom_reaper_wait);
-	}
+	else
+		mmdrop(mm);
 }
 
 /**


> If you are serious about execution ordering, we should protect mm_to_reap
> using smp_mb__after_atomic_inc(), rcu_assign_pointer()/rcu_dereference() etc.
> in addition to my patch.

cmpxchg should imply the full mem. barrier on the success AFAIR so the
increment should be visible before publishing the new mm. mmdrop could
be simply atomic_dec because our caller should hold a reference already
but I guess we do not have to optimize this super slow path and rather
be consisten with the rest of the code which decrements via mmdrop.

[...]

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-12-07 16:07             ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-07 16:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

On Sat 05-12-15 21:33:47, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 29-11-15 01:10:10, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > > Users of mmap_sem which need it for write should be carefully reviewed
> > > > > to use _killable waiting as much as possible and reduce allocations
> > > > > requests done with the lock held to absolute minimum to reduce the risk
> > > > > even further.
> > > > 
> > > > It will be nice if we can have down_write_killable()/down_read_killable().
> > > 
> > > It will be nice if we can also have __GFP_KILLABLE.
> > 
> > Well, we already do this implicitly because OOM killer will
> > automatically do mark_oom_victim if it has fatal_signal_pending and then
> > __alloc_pages_slowpath fails the allocation if the memory reserves do
> > not help to finish the allocation.
> 
> I don't think so because !__GFP_FS && !__GFP_NOFAIL allocations do not do
> mark_oom_victim() even if fatal_signal_pending() is true because
> out_of_memory() is not called.

OK you are right about GFP_NOFS allocations. I didn't consider them
because I still think that GFP_NOFS needs a separate solution. Ideally
systematic one but if this is really urgent and cannot wait for it then
we can at least check for fatal_signal_pending in __alloc_pages_may_oom.
Something like:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 728b7a129df3..42a78aee36f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2754,9 +2754,11 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			/*
 			 * XXX: Page reclaim didn't yield anything,
 			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
+			 * keep looping as per tradition. Do not bother
+			 * if we are killed already though
 			 */
-			*did_some_progress = 1;
+			if (!fatal_signal_pending(current))
+				*did_some_progress = 1;
 			goto out;
 		}
 		if (pm_suspended_storage())

__GFP_KILLABLE sounds like adding more mess to the current situation
to me. Weak reclaim context which is unkillable is simply a bad
design. Putting __GFP_KILLABLE doesn't make it work properly.

But this is unrelated I believe and should be discussed in a separate
email thread.

[...]

> > > Although currently it can't
> > > be perfect because reclaim functions called from __alloc_pages_slowpath() use
> > > unkillable waits, starting from just bail out as with __GFP_NORETRY when
> > > fatal_signal_pending(current) is true will be helpful.
> > > 
> > > So far I'm hitting no problem with testers except the one using mmap()/munmap().
> > > 
> > > I think that cmpxchg() was not needed.
> > 
> > It is not needed right now but I would rather not depend on the oom
> > mutex here. This is not a hot path where an atomic would add an
> > overhead.
> 
> Current patch can allow oom_reaper() to call mmdrop(mm) before
> wake_oom_reaper() calls atomic_inc(&mm->mm_count) because sequence like
> 
>   oom_reaper() (a realtime thread)         wake_oom_reaper() (current thread)         Current OOM victim
> 
>   oom_reap_vmas(mm); /* mm = Previous OOM victim */
>   WRITE_ONCE(mm_to_reap, NULL);
>                                            old_mm = cmpxchg(&mm_to_reap, NULL, mm); /* mm = Current OOM victim */
>                                            if (!old_mm) {
>   wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
>   oom_reap_vmas(mm); /* mm = Current OOM victim, undo atomic_inc(&mm->mm_count) done by oom_kill_process() */
>   WRITE_ONCE(mm_to_reap, NULL);
>                                                                                       exit and release mm
>                                            atomic_inc(&mm->mm_count); /* mm = Current OOM victim */
>                                            wake_up(&oom_reaper_wait);
> 
>   wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap))); /* mm = Next OOM victim */
> 
> is possible.

Yes you are right! The reference count should be incremented before
publishing the new mm_to_reap. I thought that an elevated ref. count by
the caller would be enough but this was clearly wrong. Does the update
below looks better?

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3431dcdb0a13..32ebf84795d8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -511,19 +511,21 @@ static void wake_oom_reaper(struct mm_struct *mm)
 		return;
 
 	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
 	 */
 	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm) {
-		/*
-		 * Pin the given mm. Use mm_count instead of mm_users because
-		 * we do not want to delay the address space tear down.
-		 */
-		atomic_inc(&mm->mm_count);
+	if (!old_mm)
 		wake_up(&oom_reaper_wait);
-	}
+	else
+		mmdrop(mm);
 }
 
 /**


> If you are serious about execution ordering, we should protect mm_to_reap
> using smp_mb__after_atomic_inc(), rcu_assign_pointer()/rcu_dereference() etc.
> in addition to my patch.

cmpxchg should imply the full mem. barrier on the success AFAIR so the
increment should be visible before publishing the new mm. mmdrop could
be simply atomic_dec because our caller should hold a reference already
but I guess we do not have to optimize this super slow path and rather
be consisten with the rest of the code which decrements via mmdrop.

[...]

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-12-07 16:07             ` Michal Hocko
@ 2015-12-07 22:19               ` Tetsuo Handa
  -1 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-12-07 22:19 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

Michal Hocko wrote:
> Yes you are right! The reference count should be incremented before
> publishing the new mm_to_reap. I thought that an elevated ref. count by
> the caller would be enough but this was clearly wrong. Does the update
> below looks better?

I think that moving mmdrop() from oom_kill_process() to
oom_reap_vmas() xor wake_oom_reaper() makes the patch simpler.

 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm); /* will call mmdrop() */
+	else
+		mmdrop(mm);
-	mmdrop(mm);
 	put_task_struct(victim);
 }

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-12-07 22:19               ` Tetsuo Handa
  0 siblings, 0 replies; 38+ messages in thread
From: Tetsuo Handa @ 2015-12-07 22:19 UTC (permalink / raw)
  To: mhocko
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

Michal Hocko wrote:
> Yes you are right! The reference count should be incremented before
> publishing the new mm_to_reap. I thought that an elevated ref. count by
> the caller would be enough but this was clearly wrong. Does the update
> below looks better?

I think that moving mmdrop() from oom_kill_process() to
oom_reap_vmas() xor wake_oom_reaper() makes the patch simpler.

 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm); /* will call mmdrop() */
+	else
+		mmdrop(mm);
-	mmdrop(mm);
 	put_task_struct(victim);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
  2015-12-07 22:19               ` Tetsuo Handa
@ 2015-12-08 11:06                 ` Michal Hocko
  -1 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-08 11:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

On Tue 08-12-15 07:19:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Yes you are right! The reference count should be incremented before
> > publishing the new mm_to_reap. I thought that an elevated ref. count by
> > the caller would be enough but this was clearly wrong. Does the update
> > below looks better?
> 
> I think that moving mmdrop() from oom_kill_process() to
> oom_reap_vmas() xor wake_oom_reaper() makes the patch simpler.

It surely is less lines of code but I am not sure it is simpler. I do
not think we should drop the reference in a different path than it is
taken.  Maybe we will grow more users of wake_oom_reaper in the future
and this is quite subtle behavior.

> 
>  	rcu_read_unlock();
>  
> +	if (can_oom_reap)
> +		wake_oom_reaper(mm); /* will call mmdrop() */
> +	else
> +		mmdrop(mm);
> -	mmdrop(mm);
>  	put_task_struct(victim);
>  }

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH -v2] mm, oom: introduce oom reaper
@ 2015-12-08 11:06                 ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2015-12-08 11:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, akpm, torvalds, mgorman, rientjes, riel, hughd, oleg,
	andrea, linux-kernel

On Tue 08-12-15 07:19:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Yes you are right! The reference count should be incremented before
> > publishing the new mm_to_reap. I thought that an elevated ref. count by
> > the caller would be enough but this was clearly wrong. Does the update
> > below looks better?
> 
> I think that moving mmdrop() from oom_kill_process() to
> oom_reap_vmas() xor wake_oom_reaper() makes the patch simpler.

It surely is less lines of code but I am not sure it is simpler. I do
not think we should drop the reference in a different path than it is
taken.  Maybe we will grow more users of wake_oom_reaper in the future
and this is quite subtle behavior.

> 
>  	rcu_read_unlock();
>  
> +	if (can_oom_reap)
> +		wake_oom_reaper(mm); /* will call mmdrop() */
> +	else
> +		mmdrop(mm);
> -	mmdrop(mm);
>  	put_task_struct(victim);
>  }

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2015-12-08 11:06 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-25 15:56 [RFC PATCH] mm, oom: introduce oom reaper Michal Hocko
2015-11-25 15:56 ` Michal Hocko
2015-11-25 20:08 ` Johannes Weiner
2015-11-25 20:08   ` Johannes Weiner
2015-11-26 11:08   ` Michal Hocko
2015-11-26 11:08     ` Michal Hocko
2015-11-26 15:24     ` Tetsuo Handa
2015-11-26 15:24       ` Tetsuo Handa
2015-11-26 16:34       ` Michal Hocko
2015-11-26 16:34         ` Michal Hocko
2015-11-26 17:31         ` Michal Hocko
2015-11-26 17:31           ` Michal Hocko
2015-11-27 11:29         ` Tetsuo Handa
2015-11-27 11:29           ` Tetsuo Handa
2015-11-27 12:35           ` Michal Hocko
2015-11-27 12:35             ` Michal Hocko
2015-11-27 16:12 ` [RFC PATCH -v2] " Michal Hocko
2015-11-27 16:12   ` Michal Hocko
2015-11-27 16:39   ` Mel Gorman
2015-11-27 16:39     ` Mel Gorman
2015-12-01 13:07     ` Michal Hocko
2015-12-01 13:07       ` Michal Hocko
2015-11-28  4:39   ` Tetsuo Handa
2015-11-28  4:39     ` Tetsuo Handa
2015-11-28 16:10     ` Tetsuo Handa
2015-11-28 16:10       ` Tetsuo Handa
2015-12-01 13:29       ` Michal Hocko
2015-12-01 13:29         ` Michal Hocko
2015-12-05 12:33         ` Tetsuo Handa
2015-12-05 12:33           ` Tetsuo Handa
2015-12-07 16:07           ` Michal Hocko
2015-12-07 16:07             ` Michal Hocko
2015-12-07 22:19             ` Tetsuo Handa
2015-12-07 22:19               ` Tetsuo Handa
2015-12-08 11:06               ` Michal Hocko
2015-12-08 11:06                 ` Michal Hocko
2015-12-01 13:22     ` Michal Hocko
2015-12-01 13:22       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.