linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff
@ 2018-10-25  8:24 Michal Hocko
  2018-10-25  8:24 ` [RFC PATCH v2 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Michal Hocko @ 2018-10-25  8:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Roman Gushchin, David Rientjes, Andrew Morton, LKML

The previous version of this RFC has been posted here [1]. I have fixed
few issues spotted during the review and by 0day bot. I have also reworked
patch 2 to be ratio rather than an absolute number based.

With this series applied the locking protocol between the oom_reaper and
the exit path is as follows.

All parts which cannot race should use the exclusive lock on the exit
path. Once the exit path has passed the moment when no blocking locks
are taken then it clears mm->mmap under the exclusive lock. oom_reaper
checks for this and sets MMF_OOM_SKIP only if the exit path is not guaranteed
to finish the job. This is patch 3 so see the changelog for all the details.

I would really appreciate if David could give this a try and see how
this behaves in workloads where the oom_reaper falls flat now. I have
been playing with sparsely allocated memory with a high pte/real memory
ratio and large mlocked processes and it worked reasonably well.

There is still some room for tuning here of course. We can change the
number of retries for the oom_reaper as well as the threshold when the
keep retrying.

Michal Hocko (3):
      mm, oom: rework mmap_exit vs. oom_reaper synchronization
      mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left
      mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish

Diffstat:
 include/linux/oom.h |  2 --
 mm/internal.h       |  3 +++
 mm/memory.c         | 28 ++++++++++++++--------
 mm/mmap.c           | 69 +++++++++++++++++++++++++++++++++--------------------
 mm/oom_kill.c       | 45 ++++++++++++++++++++++++----------
 5 files changed, 97 insertions(+), 50 deletions(-)

[1] http://lkml.kernel.org/r/20180910125513.311-1-mhocko@kernel.org



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH v2 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization
  2018-10-25  8:24 [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff Michal Hocko
@ 2018-10-25  8:24 ` Michal Hocko
  2018-10-25  8:24 ` [RFC PATCH v2 2/3] mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2018-10-25  8:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Roman Gushchin, David Rientjes, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The oom_reaper cannot handle mlocked vmas right now and therefore we
have exit_mmap to reap the memory before it clears the mlock flags on
mappings. This is all good but we would like to have a better hand over
protocol between the oom_reaper and exit_mmap paths.

Therefore use exclusive mmap_sem in exit_mmap whenever exit_mmap has to
synchronize with the oom_reaper. There are two notable places. Mlocked
vmas (munlock_vma_pages_all) and page tables tear down path. All others
should be fine to race with oom_reap_task_mm.

This is mostly a preparatory patch which shouldn't introduce functional
changes.

Changes since RFC
- move MMF_OOM_SKIP in exit_mmap to before we are going to free page
  tables.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  2 --
 mm/mmap.c           | 50 ++++++++++++++++++++++-----------------------
 mm/oom_kill.c       |  4 ++--
 3 files changed, 27 insertions(+), 29 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 69864a547663..11e26ca565a7 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,8 +95,6 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-bool __oom_reap_task_mm(struct mm_struct *mm);
-
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
 		unsigned long totalpages);
diff --git a/mm/mmap.c b/mm/mmap.c
index 5f2b2b184c60..a02b314c0546 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3042,39 +3042,29 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	bool oom = mm_is_oom_victim(mm);
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
 
-	if (unlikely(mm_is_oom_victim(mm))) {
-		/*
-		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
-		 *
-		 * Nothing can be holding mm->mmap_sem here and the above call
-		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
-		 * __oom_reap_task_mm() will not block.
-		 *
-		 * This needs to be done before calling munlock_vma_pages_all(),
-		 * which clears VM_LOCKED, otherwise the oom reaper cannot
-		 * reliably test it.
-		 */
-		(void)__oom_reap_task_mm(mm);
-
-		set_bit(MMF_OOM_SKIP, &mm->flags);
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
-
 	if (mm->locked_vm) {
 		vma = mm->mmap;
 		while (vma) {
-			if (vma->vm_flags & VM_LOCKED)
+			if (vma->vm_flags & VM_LOCKED) {
+				/*
+				 * oom_reaper cannot handle mlocked vmas but we
+				 * need to serialize it with munlock_vma_pages_all
+				 * which clears VM_LOCKED, otherwise the oom reaper
+				 * cannot reliably test it.
+				 */
+				if (oom)
+					down_write(&mm->mmap_sem);
+
 				munlock_vma_pages_all(vma);
+
+				if (oom)
+					up_write(&mm->mmap_sem);
+			}
 			vma = vma->vm_next;
 		}
 	}
@@ -3091,6 +3081,13 @@ void exit_mmap(struct mm_struct *mm)
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
+
+	/* oom_reaper cannot race with the page tables teardown */
+	if (oom) {
+		down_write(&mm->mmap_sem);
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+	}
+
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3104,6 +3101,9 @@ void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
+
+	if (oom)
+		up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f10aa5360616..b3b2c2bbd8ab 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -488,7 +488,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-bool __oom_reap_task_mm(struct mm_struct *mm)
+static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
@@ -554,7 +554,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
 	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
+	 * down_write() in exit_mmap().
 	 */
 	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
 		trace_skip_task_reaping(tsk->pid);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v2 2/3] mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left
  2018-10-25  8:24 [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff Michal Hocko
  2018-10-25  8:24 ` [RFC PATCH v2 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization Michal Hocko
@ 2018-10-25  8:24 ` Michal Hocko
  2018-10-25  8:24 ` [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish Michal Hocko
  2018-11-08  9:32 ` [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff Michal Hocko
  3 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2018-10-25  8:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Roman Gushchin, David Rientjes, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

oom_reaper is not able to reap all types of memory. E.g. mlocked
mappings or page tables. In some cases this might be a lot of memory
and we do rely on exit_mmap to release that memory. Yet we cannot rely
on exit_mmap to set MMF_OOM_SKIP right now because there are several
places when sleeping locks are taken.

This patch adds a simple heuristic to check for the amount of memory
the mm is sitting on after oom_reaper is done with it. If this is still
a considerable amount of the original memory (more than 1/4 of the
original badness) then simply keep retrying oom_reap_task_mm. The
portion is quite arbitrary and a subject of future tuning based on real
life usecases but it should catch some outliars when the vast portion
of the memory is consumed by non-reapable memory.

Changes since RFC
- do not use hardcoded threshold for retry and rather use a portion of
  the original badness instead

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b3b2c2bbd8ab..ab42717661dc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -189,6 +189,16 @@ static bool is_dump_unreclaim_slabs(void)
 	return (global_node_page_state(NR_SLAB_UNRECLAIMABLE) > nr_lru);
 }
 
+/*
+ * Rough memory consumption of the given mm which should be theoretically freed
+ * when the mm is removed.
+ */
+static unsigned long oom_badness_pages(struct mm_struct *mm)
+{
+	return get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS) +
+		mm_pgtables_bytes(mm) / PAGE_SIZE;
+}
+
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
@@ -230,8 +240,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
 	 * The baseline for the badness score is the proportion of RAM that each
 	 * task's rss, pagetable and swap space use.
 	 */
-	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
-		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+	points = oom_badness_pages(p->mm);
 	task_unlock(p);
 
 	/* Normalize to oom_score_adj units */
@@ -488,7 +497,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-static bool __oom_reap_task_mm(struct mm_struct *mm)
+static bool __oom_reap_task_mm(struct mm_struct *mm, unsigned long original_badness)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
@@ -532,6 +541,16 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 		}
 	}
 
+	/*
+	 * If we still sit on a noticeable amount of memory even after successfully
+	 * reaping the address space then keep retrying until exit_mmap makes some
+	 * further progress.
+	 * TODO: add a flag for a stage when the exit path doesn't block anymore
+	 * and hand over MMF_OOM_SKIP handling there in that case
+	 */
+	if (oom_badness_pages(mm) > (original_badness >> 2))
+		ret = false;
+
 	return ret;
 }
 
@@ -541,7 +560,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
  * Returns true on success and false if none or part of the address space
  * has been reclaimed and the caller should retry later.
  */
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm, unsigned long original_badness)
 {
 	bool ret = true;
 
@@ -564,7 +583,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	trace_start_task_reaping(tsk->pid);
 
 	/* failed to reap part of the address space. Try again later */
-	ret = __oom_reap_task_mm(mm);
+	ret = __oom_reap_task_mm(mm, original_badness);
 	if (!ret)
 		goto out_finish;
 
@@ -586,9 +605,10 @@ static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 	struct mm_struct *mm = tsk->signal->oom_mm;
+	unsigned long original_badness = oom_badness_pages(mm);
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm, original_badness))
 		schedule_timeout_idle(HZ/10);
 
 	if (attempts <= MAX_OOM_REAP_RETRIES ||
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-10-25  8:24 [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff Michal Hocko
  2018-10-25  8:24 ` [RFC PATCH v2 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization Michal Hocko
  2018-10-25  8:24 ` [RFC PATCH v2 2/3] mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left Michal Hocko
@ 2018-10-25  8:24 ` Michal Hocko
  2018-10-30  4:45   ` Tetsuo Handa
  2018-11-08  9:32 ` [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff Michal Hocko
  3 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-25  8:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Tetsuo Handa, Roman Gushchin, David Rientjes, Andrew Morton,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

David Rientjes has noted that certain user space memory allocators leave
a lot of page tables behind and the current implementation of oom_reaper
doesn't deal with those workloads very well. In order to improve these
workloads define a point when exit_mmap is guaranteed to finish the tear
down without any further blocking etc. This is right after we unlink
vmas (those still depend on locks which are held while performing memory
allocations from other contexts) and before we start releasing page
tables.

Opencode free_pgtables and explicitly unlink all vmas first. Then set
mm->mmap to NULL (there shouldn't be anybody looking at it at this
stage) and check for mm->mmap in the oom_reaper path. If the mm->mmap
is NULL we rely on the exit path and won't set MMF_OOM_SKIP from the
reaper.

Changes since RFC
- the task is still visible to the OOM killer after exit_mmap terminates
  so we should set MMF_OOM_SKIP from that path to be sure the oom killer
  doesn't get stuck on this task (see d7a94e7e11badf84 for more context)
  - per Tetsuo
- split free_pgtables into unlinking and actual freeing part. We cannot
  rely on free_pgd_range because of hugetlb pages on ppc resp. sparc
  which do their own tear down

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/internal.h |  3 +++
 mm/memory.c   | 28 ++++++++++++++++++----------
 mm/mmap.c     | 25 +++++++++++++++++++++----
 mm/oom_kill.c | 13 +++++++------
 4 files changed, 49 insertions(+), 20 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 87256ae1bef8..35adbfec4935 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,9 @@ void page_writeback_init(void);
 
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
+void __unlink_vmas(struct vm_area_struct *vma);
+void __free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
+		unsigned long floor, unsigned long ceiling);
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..cf910ed5f283 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -612,20 +612,23 @@ void free_pgd_range(struct mmu_gather *tlb,
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
+void __unlink_vmas(struct vm_area_struct *vma)
+{
+	while (vma) {
+		unlink_anon_vmas(vma);
+		unlink_file_vma(vma);
+		vma = vma->vm_next;
+	}
+}
+
+/* expects that __unlink_vmas has been called already */
+void __free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		unsigned long floor, unsigned long ceiling)
 {
 	while (vma) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long addr = vma->vm_start;
 
-		/*
-		 * Hide vma from rmap and truncate_pagecache before freeing
-		 * pgtables
-		 */
-		unlink_anon_vmas(vma);
-		unlink_file_vma(vma);
-
 		if (is_vm_hugetlb_page(vma)) {
 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
@@ -637,8 +640,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
 				next = vma->vm_next;
-				unlink_anon_vmas(vma);
-				unlink_file_vma(vma);
 			}
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
@@ -647,6 +648,13 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 }
 
+void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		unsigned long floor, unsigned long ceiling)
+{
+	__unlink_vmas(vma);
+	__free_pgtables(tlb, vma, floor, ceiling);
+}
+
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
 {
 	spinlock_t *ptl;
diff --git a/mm/mmap.c b/mm/mmap.c
index a02b314c0546..f4b562e21764 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3082,13 +3082,26 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
-	/* oom_reaper cannot race with the page tables teardown */
+	/*
+	 * oom_reaper cannot race with the page tables teardown but we
+	 * want to make sure that the exit path can take over the full
+	 * tear down when it is safe to do so
+	 */
 	if (oom) {
 		down_write(&mm->mmap_sem);
-		set_bit(MMF_OOM_SKIP, &mm->flags);
+		__unlink_vmas(vma);
+		/*
+		 * the exit path is guaranteed to finish the memory tear down
+		 * without any unbound blocking at this stage so make it clear
+		 * to the oom_reaper
+		 */
+		mm->mmap = NULL;
+		up_write(&mm->mmap_sem);
+		__free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	} else {
+		free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	}
 
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
 	/*
@@ -3102,8 +3115,12 @@ void exit_mmap(struct mm_struct *mm)
 	}
 	vm_unacct_memory(nr_accounted);
 
+	/*
+	 * Now that the full address space is torn down, make sure the
+	 * OOM killer skips over this task
+	 */
 	if (oom)
-		up_write(&mm->mmap_sem);
+		set_bit(MMF_OOM_SKIP, &mm->flags);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ab42717661dc..db1ebb45c66a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -570,12 +570,10 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm, unsi
 	}
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
-	 * under mmap_sem for reading because it serializes against the
-	 * down_write() in exit_mmap().
+	 * If exit path clear mm->mmap then we know it will finish the tear down
+	 * and we can go and bail out here.
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+	if (!mm->mmap) {
 		trace_skip_task_reaping(tsk->pid);
 		goto out_unlock;
 	}
@@ -625,8 +623,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
+	 * Leave the MMF_OOM_SKIP to the exit path if it managed to reach the
+	 * point it is guaranteed to finish without any blocking
 	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (mm->mmap)
+		set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-10-25  8:24 ` [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish Michal Hocko
@ 2018-10-30  4:45   ` Tetsuo Handa
  2018-10-30  6:31     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2018-10-30  4:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Tetsuo Handa, Roman Gushchin, David Rientjes,
	Andrew Morton, LKML, Michal Hocko

Michal Hocko wrote:
> @@ -3156,6 +3166,13 @@ void exit_mmap(struct mm_struct *mm)
>                 vma = remove_vma(vma);
>         }
>         vm_unacct_memory(nr_accounted);
> +
> +       /*
> +        * Now that the full address space is torn down, make sure the
> +        * OOM killer skips over this task
> +        */
> +       if (oom)
> +               set_bit(MMF_OOM_SKIP, &mm->flags);
>  }
> 
>  /* Insert vm structure into process list sorted by address

I don't like setting MMF_OOF_SKIP after remove_vma() loop. 50 users might
call vma->vm_ops->close() from remove_vma(). Some of them are doing fs
writeback, some of them might be doing GFP_KERNEL allocation from
vma->vm_ops->open() with a lock also held by vma->vm_ops->close().

I don't think that waiting for completion of remove_vma() loop is safe.
And my patch is safe.

 drivers/android/binder.c                          |    2 +-
 drivers/gpu/drm/drm_gem_cma_helper.c              |    2 +-
 drivers/gpu/drm/drm_vm.c                          |    8 ++++----
 drivers/gpu/drm/gma500/framebuffer.c              |    2 +-
 drivers/gpu/drm/gma500/psb_drv.c                  |    2 +-
 drivers/gpu/drm/i915/i915_drv.c                   |    2 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c                   |    2 +-
 drivers/gpu/drm/udl/udl_drv.c                     |    2 +-
 drivers/gpu/drm/v3d/v3d_drv.c                     |    2 +-
 drivers/gpu/drm/vc4/vc4_drv.c                     |    2 +-
 drivers/gpu/drm/vgem/vgem_drv.c                   |    2 +-
 drivers/gpu/drm/vkms/vkms_drv.c                   |    2 +-
 drivers/gpu/drm/xen/xen_drm_front.c               |    2 +-
 drivers/hwtracing/intel_th/msu.c                  |    2 +-
 drivers/hwtracing/stm/core.c                      |    2 +-
 drivers/infiniband/core/uverbs_main.c             |    2 +-
 drivers/infiniband/sw/rdmavt/mmap.c               |    2 +-
 drivers/infiniband/sw/rxe/rxe_mmap.c              |    2 +-
 drivers/media/common/videobuf2/videobuf2-memops.c |    2 +-
 drivers/media/pci/meye/meye.c                     |    2 +-
 drivers/media/platform/omap/omap_vout.c           |    2 +-
 drivers/media/usb/stkwebcam/stk-webcam.c          |    2 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c     |    2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c         |    2 +-
 drivers/media/v4l2-core/videobuf-vmalloc.c        |    2 +-
 drivers/misc/genwqe/card_dev.c                    |    2 +-
 drivers/misc/mic/scif/scif_mmap.c                 |    2 +-
 drivers/misc/sgi-gru/grufile.c                    |    2 +-
 drivers/rapidio/devices/rio_mport_cdev.c          |    2 +-
 drivers/staging/comedi/comedi_fops.c              |    2 +-
 drivers/staging/media/zoran/zoran_driver.c        |    2 +-
 drivers/staging/vme/devices/vme_user.c            |    2 +-
 drivers/usb/core/devio.c                          |    2 +-
 drivers/usb/mon/mon_bin.c                         |    2 +-
 drivers/video/fbdev/omap2/omapfb/omapfb-main.c    |    2 +-
 drivers/xen/gntalloc.c                            |    2 +-
 drivers/xen/gntdev.c                              |    2 +-
 drivers/xen/privcmd-buf.c                         |    2 +-
 drivers/xen/privcmd.c                             |    2 +-
 fs/9p/vfs_file.c                                  |    2 +-
 fs/fuse/file.c                                    |    2 +-
 fs/kernfs/file.c                                  |    2 +-
 include/linux/mm.h                                |    2 +-
 ipc/shm.c                                         |    2 +-
 kernel/events/core.c                              |    2 +-
 kernel/relay.c                                    |    2 +-
 mm/hugetlb.c                                      |    2 +-
 mm/mmap.c                                         |   14 +++++++-------
 net/packet/af_packet.c                            |    2 +-
 sound/core/pcm_native.c                           |    4 ++--
 sound/usb/usx2y/us122l.c                          |    2 +-
 sound/usb/usx2y/usx2yhwdeppcm.c                   |    2 +-
 52 files changed, 62 insertions(+), 62 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-10-30  4:45   ` Tetsuo Handa
@ 2018-10-30  6:31     ` Michal Hocko
  2018-10-30  9:47       ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-30  6:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Roman Gushchin, David Rientjes, Andrew Morton, LKML

On Tue 30-10-18 13:45:22, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > @@ -3156,6 +3166,13 @@ void exit_mmap(struct mm_struct *mm)
> >                 vma = remove_vma(vma);
> >         }
> >         vm_unacct_memory(nr_accounted);
> > +
> > +       /*
> > +        * Now that the full address space is torn down, make sure the
> > +        * OOM killer skips over this task
> > +        */
> > +       if (oom)
> > +               set_bit(MMF_OOM_SKIP, &mm->flags);
> >  }
> > 
> >  /* Insert vm structure into process list sorted by address
> 
> I don't like setting MMF_OOF_SKIP after remove_vma() loop. 50 users might
> call vma->vm_ops->close() from remove_vma(). Some of them are doing fs
> writeback, some of them might be doing GFP_KERNEL allocation from
> vma->vm_ops->open() with a lock also held by vma->vm_ops->close().
> 
> I don't think that waiting for completion of remove_vma() loop is safe.

What do you mean by 'safe' here?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-10-30  6:31     ` Michal Hocko
@ 2018-10-30  9:47       ` Tetsuo Handa
  2018-10-30 11:39         ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2018-10-30  9:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Roman Gushchin, David Rientjes, Andrew Morton, LKML

On 2018/10/30 15:31, Michal Hocko wrote:
> On Tue 30-10-18 13:45:22, Tetsuo Handa wrote:
>> Michal Hocko wrote:
>>> @@ -3156,6 +3166,13 @@ void exit_mmap(struct mm_struct *mm)
>>>                 vma = remove_vma(vma);
>>>         }
>>>         vm_unacct_memory(nr_accounted);
>>> +
>>> +       /*
>>> +        * Now that the full address space is torn down, make sure the
>>> +        * OOM killer skips over this task
>>> +        */
>>> +       if (oom)
>>> +               set_bit(MMF_OOM_SKIP, &mm->flags);
>>>  }
>>>
>>>  /* Insert vm structure into process list sorted by address
>>
>> I don't like setting MMF_OOF_SKIP after remove_vma() loop. 50 users might
>> call vma->vm_ops->close() from remove_vma(). Some of them are doing fs
>> writeback, some of them might be doing GFP_KERNEL allocation from
>> vma->vm_ops->open() with a lock also held by vma->vm_ops->close().
>>
>> I don't think that waiting for completion of remove_vma() loop is safe.
> 
> What do you mean by 'safe' here?
> 

safe = "Does not cause OOM lockup."

remove_vma() is allowed to sleep, and some users might depend on memory
allocation when the OOM killer is waiting for remove_vma() to complete.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-10-30  9:47       ` Tetsuo Handa
@ 2018-10-30 11:39         ` Michal Hocko
  2018-10-30 12:02           ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-30 11:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Roman Gushchin, David Rientjes, Andrew Morton, LKML

On Tue 30-10-18 18:47:43, Tetsuo Handa wrote:
> On 2018/10/30 15:31, Michal Hocko wrote:
> > On Tue 30-10-18 13:45:22, Tetsuo Handa wrote:
> >> Michal Hocko wrote:
> >>> @@ -3156,6 +3166,13 @@ void exit_mmap(struct mm_struct *mm)
> >>>                 vma = remove_vma(vma);
> >>>         }
> >>>         vm_unacct_memory(nr_accounted);
> >>> +
> >>> +       /*
> >>> +        * Now that the full address space is torn down, make sure the
> >>> +        * OOM killer skips over this task
> >>> +        */
> >>> +       if (oom)
> >>> +               set_bit(MMF_OOM_SKIP, &mm->flags);
> >>>  }
> >>>
> >>>  /* Insert vm structure into process list sorted by address
> >>
> >> I don't like setting MMF_OOF_SKIP after remove_vma() loop. 50 users might
> >> call vma->vm_ops->close() from remove_vma(). Some of them are doing fs
> >> writeback, some of them might be doing GFP_KERNEL allocation from
> >> vma->vm_ops->open() with a lock also held by vma->vm_ops->close().
> >>
> >> I don't think that waiting for completion of remove_vma() loop is safe.
> > 
> > What do you mean by 'safe' here?
> > 
> 
> safe = "Does not cause OOM lockup."
> 
> remove_vma() is allowed to sleep, and some users might depend on memory
> allocation when the OOM killer is waiting for remove_vma() to complete.

But MMF_OOF_SKIP is set after we are done with remove_vma. In fact it is
the very last thing in exit_mmap. So I do not follow what you mean.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-10-30 11:39         ` Michal Hocko
@ 2018-10-30 12:02           ` Tetsuo Handa
  2018-10-30 12:10             ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2018-10-30 12:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Roman Gushchin, David Rientjes, Andrew Morton, LKML

On 2018/10/30 20:39, Michal Hocko wrote:
> On Tue 30-10-18 18:47:43, Tetsuo Handa wrote:
>> On 2018/10/30 15:31, Michal Hocko wrote:
>>> On Tue 30-10-18 13:45:22, Tetsuo Handa wrote:
>>>> Michal Hocko wrote:
>>>>> @@ -3156,6 +3166,13 @@ void exit_mmap(struct mm_struct *mm)
>>>>>                 vma = remove_vma(vma);
>>>>>         }
>>>>>         vm_unacct_memory(nr_accounted);
>>>>> +
>>>>> +       /*
>>>>> +        * Now that the full address space is torn down, make sure the
>>>>> +        * OOM killer skips over this task
>>>>> +        */
>>>>> +       if (oom)
>>>>> +               set_bit(MMF_OOM_SKIP, &mm->flags);
>>>>>  }
>>>>>
>>>>>  /* Insert vm structure into process list sorted by address
>>>>
>>>> I don't like setting MMF_OOF_SKIP after remove_vma() loop. 50 users might
>>>> call vma->vm_ops->close() from remove_vma(). Some of them are doing fs
>>>> writeback, some of them might be doing GFP_KERNEL allocation from
>>>> vma->vm_ops->open() with a lock also held by vma->vm_ops->close().
>>>>
>>>> I don't think that waiting for completion of remove_vma() loop is safe.
>>>
>>> What do you mean by 'safe' here?
>>>
>>
>> safe = "Does not cause OOM lockup."
>>
>> remove_vma() is allowed to sleep, and some users might depend on memory
>> allocation when the OOM killer is waiting for remove_vma() to complete.
> 
> But MMF_OOF_SKIP is set after we are done with remove_vma. In fact it is
> the very last thing in exit_mmap. So I do not follow what you mean.
> 

So what? Think the worst case. Quite obvious bug here.

What happens if memory reclaimed by up to __free_pgtables() was consumed by somebody
else, and then some vma->vm_ops->close() started waiting for memory allocation due to
dependency? It is called "OOM lockup" because the OOM killer cannot be enabled because
MMF_OOM_SKIP cannot be set because vma->vm_ops->close() is waiting for the OOM killer
due to memory allocation dependency in vma->vm_ops->close() from remove_vma()...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-10-30 12:02           ` Tetsuo Handa
@ 2018-10-30 12:10             ` Michal Hocko
  2018-10-30 13:57               ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-10-30 12:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Roman Gushchin, David Rientjes, Andrew Morton, LKML

On Tue 30-10-18 21:02:40, Tetsuo Handa wrote:
> On 2018/10/30 20:39, Michal Hocko wrote:
> > On Tue 30-10-18 18:47:43, Tetsuo Handa wrote:
> >> On 2018/10/30 15:31, Michal Hocko wrote:
> >>> On Tue 30-10-18 13:45:22, Tetsuo Handa wrote:
> >>>> Michal Hocko wrote:
> >>>>> @@ -3156,6 +3166,13 @@ void exit_mmap(struct mm_struct *mm)
> >>>>>                 vma = remove_vma(vma);
> >>>>>         }
> >>>>>         vm_unacct_memory(nr_accounted);
> >>>>> +
> >>>>> +       /*
> >>>>> +        * Now that the full address space is torn down, make sure the
> >>>>> +        * OOM killer skips over this task
> >>>>> +        */
> >>>>> +       if (oom)
> >>>>> +               set_bit(MMF_OOM_SKIP, &mm->flags);
> >>>>>  }
> >>>>>
> >>>>>  /* Insert vm structure into process list sorted by address
> >>>>
> >>>> I don't like setting MMF_OOF_SKIP after remove_vma() loop. 50 users might
> >>>> call vma->vm_ops->close() from remove_vma(). Some of them are doing fs
> >>>> writeback, some of them might be doing GFP_KERNEL allocation from
> >>>> vma->vm_ops->open() with a lock also held by vma->vm_ops->close().
> >>>>
> >>>> I don't think that waiting for completion of remove_vma() loop is safe.
> >>>
> >>> What do you mean by 'safe' here?
> >>>
> >>
> >> safe = "Does not cause OOM lockup."
> >>
> >> remove_vma() is allowed to sleep, and some users might depend on memory
> >> allocation when the OOM killer is waiting for remove_vma() to complete.
> > 
> > But MMF_OOF_SKIP is set after we are done with remove_vma. In fact it is
> > the very last thing in exit_mmap. So I do not follow what you mean.
> > 
> 
> So what? Think the worst case. Quite obvious bug here.

I misunderstood your concern. oom_reaper would back off without
MMF_OOF_SKIP as well. You are right we cannot assume anything about
close callbacks so MMF_OOM_SKIP has to come before that. I will move it
behind the pagetable freeing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-10-30 12:10             ` Michal Hocko
@ 2018-10-30 13:57               ` Tetsuo Handa
  2018-10-30 14:23                 ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2018-10-30 13:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Roman Gushchin, David Rientjes, Andrew Morton, LKML

On 2018/10/30 21:10, Michal Hocko wrote:
> I misunderstood your concern. oom_reaper would back off without
> MMF_OOF_SKIP as well. You are right we cannot assume anything about
> close callbacks so MMF_OOM_SKIP has to come before that. I will move it
> behind the pagetable freeing.
> 

And at that point, your patch can at best wait for only __free_pgtables(),
at the cost/risk of complicating exit_mmap() and arch specific code. Also,
you are asking for comments to wrong audiences. It is arch maintainers who
need to precisely understand the OOM behavior / possibility of OOM lockup,
and you must persuade them about restricting/complicating future changes in
their arch code due to your wish to allow handover. Without "up-to-dated
big fat comments to all relevant functions affected by your change" and
"acks from all arch maintainers", I'm sure that people keep making
errors/mistakes/overlooks.

My patch can wait for completion of (not only exit_mmap() but also) __mmput(),
by using simple polling approach. My patch can allow NOMMU kernels to avoid
possibility of OOM lockup by setting MMF_OOM_SKIP at __mmput() (and future
patch will implement timeout based back off for NOMMU kernels), and allows you
to get rid of TIF_MEMDIE (which you recently added to your TODO list) by getting
rid of conditional handling of oom_reserves_allowed() and ALLOC_OOM.

Your "refusing timeout based next OOM victim selection" keeps everyone unable
to safely make forward progress. OOM handling is too much complicated, and
nobody can become free from errors/mistakes/overlooks. Look at the reality!


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
  2018-10-30 13:57               ` Tetsuo Handa
@ 2018-10-30 14:23                 ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2018-10-30 14:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, Roman Gushchin, David Rientjes, Andrew Morton, LKML

On Tue 30-10-18 22:57:37, Tetsuo Handa wrote:
> On 2018/10/30 21:10, Michal Hocko wrote:
> > I misunderstood your concern. oom_reaper would back off without
> > MMF_OOF_SKIP as well. You are right we cannot assume anything about
> > close callbacks so MMF_OOM_SKIP has to come before that. I will move it
> > behind the pagetable freeing.
> > 
> 
> And at that point, your patch can at best wait for only __free_pgtables(),

Yes, mostly on the grounds that oom victims are mostly sitting on mapped
memory and page tables. I can see how last ->close() can release some
memory as well but a) we do not consider that memory when selecting a
victim and b) it shouldn't be a large memory consumer on its own.

> at the cost/risk of complicating exit_mmap() and arch specific code.Also,
> you are asking for comments to wrong audiences. It is arch maintainers who
> need to precisely understand the OOM behavior / possibility of OOM lockup,
> and you must persuade them about restricting/complicating future changes in
> their arch code due to your wish to allow handover. Without "up-to-dated
> big fat comments to all relevant functions affected by your change" and
> "acks from all arch maintainers", I'm sure that people keep making
> errors/mistakes/overlooks.

Are you talking about arch_exit_mmap or which part of the arch code?

> My patch can wait for completion of (not only exit_mmap() but also) __mmput(),
> by using simple polling approach. My patch can allow NOMMU kernels to avoid
> possibility of OOM lockup by setting MMF_OOM_SKIP at __mmput() (and future
> patch will implement timeout based back off for NOMMU kernels), and allows you
> to get rid of TIF_MEMDIE (which you recently added to your TODO list) by getting
> rid of conditional handling of oom_reserves_allowed() and ALLOC_OOM.

OK, let's settle on a simple fact. I would like to discuss _this_
approach here. Bringing up _yours_ all the time is not productive much.
You might have noticed that I have posted this for discussion (hence the
RGC) and as such I would appreciate staying on the topic.

What is the best approach in the end is a matter of discsussion of
course. At thise stage it is quite clear we can only agree to disagree
which approach is better and discussing the same set of points back and
forth is not going to get us anywhere. Therefore we would have to rely on the
maintainer to decide.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff
  2018-10-25  8:24 [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff Michal Hocko
                   ` (2 preceding siblings ...)
  2018-10-25  8:24 ` [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish Michal Hocko
@ 2018-11-08  9:32 ` Michal Hocko
  2018-11-14  9:46   ` Tetsuo Handa
  3 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-11-08  9:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tetsuo Handa, Roman Gushchin, linux-mm, Andrew Morton, LKML

On Thu 25-10-18 10:24:00, Michal Hocko wrote:
> The previous version of this RFC has been posted here [1]. I have fixed
> few issues spotted during the review and by 0day bot. I have also reworked
> patch 2 to be ratio rather than an absolute number based.
> 
> With this series applied the locking protocol between the oom_reaper and
> the exit path is as follows.
> 
> All parts which cannot race should use the exclusive lock on the exit
> path. Once the exit path has passed the moment when no blocking locks
> are taken then it clears mm->mmap under the exclusive lock. oom_reaper
> checks for this and sets MMF_OOM_SKIP only if the exit path is not guaranteed
> to finish the job. This is patch 3 so see the changelog for all the details.
> 
> I would really appreciate if David could give this a try and see how
> this behaves in workloads where the oom_reaper falls flat now. I have
> been playing with sparsely allocated memory with a high pte/real memory
> ratio and large mlocked processes and it worked reasonably well.

Does this help workloads you were referring to earlier David?

> There is still some room for tuning here of course. We can change the
> number of retries for the oom_reaper as well as the threshold when the
> keep retrying.
> 
> Michal Hocko (3):
>       mm, oom: rework mmap_exit vs. oom_reaper synchronization
>       mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left
>       mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish
> 
> Diffstat:
>  include/linux/oom.h |  2 --
>  mm/internal.h       |  3 +++
>  mm/memory.c         | 28 ++++++++++++++--------
>  mm/mmap.c           | 69 +++++++++++++++++++++++++++++++++--------------------
>  mm/oom_kill.c       | 45 ++++++++++++++++++++++++----------
>  5 files changed, 97 insertions(+), 50 deletions(-)
> 
> [1] http://lkml.kernel.org/r/20180910125513.311-1-mhocko@kernel.org
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff
  2018-11-08  9:32 ` [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff Michal Hocko
@ 2018-11-14  9:46   ` Tetsuo Handa
  2018-11-14 10:16     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2018-11-14  9:46 UTC (permalink / raw)
  To: Michal Hocko, David Rientjes
  Cc: Roman Gushchin, linux-mm, Andrew Morton, LKML

On 2018/11/08 18:32, Michal Hocko wrote:
> On Thu 25-10-18 10:24:00, Michal Hocko wrote:
>> The previous version of this RFC has been posted here [1]. I have fixed
>> few issues spotted during the review and by 0day bot. I have also reworked
>> patch 2 to be ratio rather than an absolute number based.
>>
>> With this series applied the locking protocol between the oom_reaper and
>> the exit path is as follows.
>>
>> All parts which cannot race should use the exclusive lock on the exit
>> path. Once the exit path has passed the moment when no blocking locks
>> are taken then it clears mm->mmap under the exclusive lock. oom_reaper
>> checks for this and sets MMF_OOM_SKIP only if the exit path is not guaranteed
>> to finish the job. This is patch 3 so see the changelog for all the details.
>>
>> I would really appreciate if David could give this a try and see how
>> this behaves in workloads where the oom_reaper falls flat now. I have
>> been playing with sparsely allocated memory with a high pte/real memory
>> ratio and large mlocked processes and it worked reasonably well.
> 
> Does this help workloads you were referring to earlier David?

I still refuse this patch, due to allowing OOM lockups in the exit path
triggered by CPU starvation.



First, try a stressor shown below

----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <sched.h>
#include <sys/wait.h>

int main(int argc, char *argv[])
{
	cpu_set_t cpu = { { 1 } };
	struct sched_param sp = { 99 };
	FILE *fp;
	int i;
	const unsigned long size = 1048576 * 200;
	char *buf = malloc(size);
	mkdir("/sys/fs/cgroup/memory/test1", 0755);
	fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
	fprintf(fp, "%lu\n", size / 2);
	fclose(fp);
	fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
	fprintf(fp, "%u\n", getpid());
	fclose(fp);
	fp = fopen("/proc/self/oom_score_adj", "w");
	fprintf(fp, "1000\n");
	fclose(fp);
	sched_setaffinity(0, sizeof(cpu), &cpu);
	sched_setscheduler(0, SCHED_FIFO, &sp);
	nice(-20);
	sp.sched_priority = 0;
	for (i = 0; i < 64; i++)
		if (fork() == 0) {
			if (i < 32) {
				sched_setscheduler(0, SCHED_IDLE, &sp);
				nice(19);
			}
			memset(buf, 0, size);
			_exit(0);
		}
	while (wait(NULL) > 0);
	return 0;
}
----------------------------------------

with a bit of additional printk() patch shown below.

----------------------------------------
 kernel/locking/lockdep.c | 6 ------
 mm/oom_kill.c            | 9 +++++++--
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 1efada2..822b412 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -564,12 +564,6 @@ static void lockdep_print_held_locks(struct task_struct *p)
 	else
 		printk("%d lock%s held by %s/%d:\n", depth,
 		       depth > 1 ? "s" : "", p->comm, task_pid_nr(p));
-	/*
-	 * It's not reliable to print a task's held locks if it's not sleeping
-	 * and it's not the current task.
-	 */
-	if (p->state == TASK_RUNNING && p != current)
-		return;
 	for (i = 0; i < depth; i++) {
 		printk(" #%d: ", i);
 		print_lock(p->held_locks + i);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 503d24d..b9e5a61 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -613,8 +613,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	    test_bit(MMF_OOM_SKIP, &mm->flags))
 		goto done;
 
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
+	pr_info("oom_reaper: unable to reap pid:%d (%s), anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+		task_pid_nr(tsk), tsk->comm,
+		K(get_mm_counter(mm, MM_ANONPAGES)),
+		K(get_mm_counter(mm, MM_FILEPAGES)),
+		K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	debug_show_all_locks();
 
 done:
@@ -628,6 +631,8 @@ static void oom_reap_task(struct task_struct *tsk)
 	 */
 	if (mm->mmap)
 		set_bit(MMF_OOM_SKIP, &mm->flags);
+	else if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+		pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk));
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
----------------------------------------

You will find that "oom_reaper: unable to reap" message is trivially printed (and
MMF_OOM_SKIP is set) when the OOM victim is doing sched_setscheduler() request in
the stressor shown above.

  1 lock held by a.out/1331:
   #0: 00000000065062b5 (rcu_read_lock){....}, at: do_sched_setscheduler+0x54/0x190

This is because

	if (oom_badness_pages(mm) > (original_badness >> 2))
		ret = false;

made OOM reaper to print "oom_reaper: unable to reap" message even after reaping
succeeded when the OOM victim was consuming little memory. Yes, we could disable
"oom_reaper: unable to reap" message if reaping succeeded.



Next, let's think the lucky cases where "Handed over " message is printed.
Remove

	if (i < 32) {
		sched_setscheduler(0, SCHED_IDLE, &sp);
		nice(19);
	}

lines from the stressor, and retry with a patch shown below which will do the
similar thing from the exit path.

----------------------------------------
 kernel/sys.c | 14 ++++++++++++++
 mm/mmap.c    |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..a1798bb 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -190,6 +190,20 @@ static int set_one_prio(struct task_struct *p, int niceval, int error)
 	return error;
 }
 
+struct sched_param {
+	int sched_priority;
+};
+
+void my_setpriority(void)
+{
+	struct sched_param sp = { };
+
+	sched_setscheduler(current, SCHED_IDLE, &sp);
+	rcu_read_lock();
+	set_one_prio(current, MAX_NICE, -ESRCH);
+	rcu_read_unlock();
+}
+
 SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 {
 	struct task_struct *g, *p;
diff --git a/mm/mmap.c b/mm/mmap.c
index 9063fdc..516cfc9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3140,6 +3140,7 @@ void exit_mmap(struct mm_struct *mm)
 	 * tear down when it is safe to do so
 	 */
 	if (oom) {
+		extern void my_setpriority(void);
 		down_write(&mm->mmap_sem);
 		__unlink_vmas(vma);
 		/*
@@ -3149,6 +3150,7 @@ void exit_mmap(struct mm_struct *mm)
 		 */
 		mm->mmap = NULL;
 		up_write(&mm->mmap_sem);
+		my_setpriority();
 		__free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	} else {
 		free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
----------------------------------------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20181114-1.txt.xz .
----------------------------------------
[   43.961735] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[   43.964369] kmem: usage 6528kB, limit 9007199254740988kB, failcnt 0
[   43.967410] Memory cgroup stats for /test1: cache:0KB rss:95832KB rss_huge:94208KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:95876KB inactive_file:0KB active_file:0KB unevictable:0KB
[   43.976150] Memory cgroup out of memory: Kill process 1079 (a.out) score 990 or sacrifice child
[   43.980391] Killed process 1083 (a.out) total-vm:209156kB, anon-rss:96kB, file-rss:0kB, shmem-rss:0kB
[   44.090703] Handed over 1083 to exit path.
[   76.209031] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 33s!
[   76.212351] Showing busy workqueues and worker pools:
[   76.215308] workqueue events: flags=0x0
[   76.217832]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[   76.217860]     pending: vmstat_shepherd, vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx]
[   76.217878] workqueue events_highpri: flags=0x10
[   76.217882]   pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256
[   76.217887]     pending: flush_backlog BAR(436)
[   76.217941] workqueue mm_percpu_wq: flags=0x8
[   76.218004]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[   76.218007]     pending: vmstat_update
[   76.218011] workqueue netns: flags=0xe000a
[   76.218039]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/1
[   76.218068]     in-flight: 436:cleanup_net
[   76.218081] workqueue memcg_kmem_cache: flags=0x0
[   76.218085]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
[   76.218088]     pending: memcg_kmem_cache_create_func
[   76.218091]     delayed: memcg_kmem_cache_create_func,(...snipped...), memcg_kmem_cache_create_func
[   77.511224] pool 16: cpus=0-7 flags=0x4 nice=0 hung=4s workers=32 idle: 435 437 434 433 432 431 430 429 428 427 426 425 424 423 422 421 420 419 417 418 416 414 413 412 143 411 75 7 145 439 438
[  141.216183] rcu: INFO: rcu_preempt self-detected stall on CPU
[  141.219379] rcu:     0-....: (66161 ticks this GP) idle=11e/1/0x4000000000000002 softirq=3290/3292 fqs=15874
[  141.223795] rcu:      (t=65000 jiffies g=10501 q=1115)
[  141.226815] NMI backtrace for cpu 0
[  141.229437] CPU: 0 PID: 1084 Comm: a.out Not tainted 4.20.0-rc2+ #778
[  141.232891] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  141.237363] Call Trace:
(...snipped...)
[  141.313476]  ? mem_cgroup_iter+0x156/0x7c0
[  141.315721]  mem_cgroup_iter+0x17e/0x7c0
[  141.318885]  ? mem_cgroup_iter+0x156/0x7c0
[  141.321488]  snapshot_refaults+0x5c/0x90
[  141.323609]  do_try_to_free_pages+0x2dd/0x3b0
[  141.326132]  try_to_free_mem_cgroup_pages+0x10d/0x390
[  141.329223]  try_charge+0x28b/0x830
[  141.331761]  memcg_kmem_charge_memcg+0x35/0x90
[  141.334568]  ? get_mem_cgroup_from_mm+0x239/0x2e0
[  141.336693]  memcg_kmem_charge+0x85/0x260
[  141.338605]  __alloc_pages_nodemask+0x218/0x360
[  141.340693]  pte_alloc_one+0x16/0xb0
[  141.342394]  __handle_mm_fault+0x873/0x1580
[  141.344355]  handle_mm_fault+0x1b2/0x3a0
[  141.346048]  ? handle_mm_fault+0x47/0x3a0
[  141.348501]  __do_page_fault+0x28c/0x530
[  141.350389]  do_page_fault+0x28/0x260
[  141.352133]  ? page_fault+0x8/0x30
[  141.353785]  page_fault+0x1e/0x30
[  141.355602] RIP: 0033:0x7f4bc7c92682
[  141.357336] Code: Bad RIP value.
[  141.358787] RSP: 002b:00007ffe682fc4a0 EFLAGS: 00010246
[  141.361110] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f4bc7c92682
[  141.363863] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  141.366464] RBP: 00007ffe682fc4c0 R08: 00007f4bc81a2740 R09: 0000000000000000
[  141.369074] R10: 00007f4bc81a2a10 R11: 0000000000000246 R12: 0000000000000000
[  141.371704] R13: 00007ffe682fc650 R14: 0000000000000000 R15: 0000000000000000
[  248.241504] INFO: task kworker/u16:28:436 blocked for more than 120 seconds.
[  248.244263]       Not tainted 4.20.0-rc2+ #778
[  248.246033] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  248.248718] kworker/u16:28  D10872   436      2 0x80000000
[  248.250768] Workqueue: netns cleanup_net
(...snipped...)
[  290.559855] a.out           R  running task    12792  1083   1079 0x80100084
[  290.562480] Call Trace:
[  290.563633]  __schedule+0x246/0x990
[  290.565031]  ? _raw_spin_unlock_irqrestore+0x55/0x70
[  290.566873]  preempt_schedule_common+0x34/0x54
[  290.568782]  preempt_schedule+0x1f/0x30
[  290.570896]  ___preempt_schedule+0x16/0x18
[  290.573639]  __sched_setscheduler+0x722/0x7c0
[  290.576363]  _sched_setscheduler+0x70/0x90
[  290.578785]  sched_setscheduler+0xe/0x10
[  290.580403]  my_setpriority+0x35/0x130
[  290.582277]  exit_mmap+0x1b8/0x1e0
[  290.583990]  mmput+0x63/0x130
[  290.585624]  do_exit+0x29d/0xcf0
[  290.586963]  ? _raw_spin_unlock_irq+0x27/0x50
[  290.588673]  ? __this_cpu_preempt_check+0x13/0x20
[  290.590600]  do_group_exit+0x47/0xc0
[  290.592098]  get_signal+0x329/0x920
[  290.594028]  do_signal+0x32/0x6e0
[  290.595620]  ? exit_to_usermode_loop+0x26/0x95
[  290.597268]  ? prepare_exit_to_usermode+0xa8/0xd0
[  290.599154]  exit_to_usermode_loop+0x3e/0x95
[  290.600794]  prepare_exit_to_usermode+0xa8/0xd0
[  290.602524]  ? page_fault+0x8/0x30
[  290.603928]  retint_user+0x8/0x18
(...snipped...)
[  336.219361] rcu: INFO: rcu_preempt self-detected stall on CPU
[  336.221340] rcu:     0-....: (259718 ticks this GP) idle=11e/1/0x4000000000000002 softirq=3290/3292 fqs=63254
[  336.224387] rcu:      (t=260003 jiffies g=10501 q=26578)
[  336.226119] NMI backtrace for cpu 0
[  336.227544] CPU: 0 PID: 1084 Comm: a.out Not tainted 4.20.0-rc2+ #778
[  336.229729] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  336.233056] Call Trace:
[  336.234645]  <IRQ>
[  336.235696]  dump_stack+0x67/0x95
[  336.226119] NMI backtrace for cpu 0
[  336.227544] CPU: 0 PID: 1084 Comm: a.out Not tainted 4.20.0-rc2+ #778
[  336.229729] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  336.233056] Call Trace:
(...snipped...)
[  336.287781]  shrink_node_memcg+0xa6/0x430
[  336.289621]  ? mem_cgroup_iter+0x21d/0x7c0
[  336.291382]  ? mem_cgroup_iter+0x156/0x7c0
[  336.293093]  shrink_node+0xd1/0x450
[  336.294734]  do_try_to_free_pages+0x103/0x3b0
[  336.296729]  try_to_free_mem_cgroup_pages+0x10d/0x390
[  336.298955]  try_charge+0x28b/0x830
[  336.300523]  memcg_kmem_charge_memcg+0x35/0x90
[  336.302258]  ? get_mem_cgroup_from_mm+0x239/0x2e0
[  336.304054]  memcg_kmem_charge+0x85/0x260
[  336.305688]  __alloc_pages_nodemask+0x218/0x360
[  336.307404]  pte_alloc_one+0x16/0xb0
[  336.308900]  __handle_mm_fault+0x873/0x1580
[  336.310577]  handle_mm_fault+0x1b2/0x3a0
[  336.312123]  ? handle_mm_fault+0x47/0x3a0
[  336.313768]  __do_page_fault+0x28c/0x530
[  336.315351]  do_page_fault+0x28/0x260
[  336.316870]  ? page_fault+0x8/0x30
[  336.318251]  page_fault+0x1e/0x30
----------------------------------------

Since the OOM victim thread is still reachable from the task list at this point,
it is possible that somebody changes scheduling priority of the OOM victim.



You might think that it is unfair to change scheduling priority from the exit path.
Then, retry with a patch shown below which will do the similar thing from outside
of the exit path (the OOM reaper kernel thread) instead of the patch shown above.

----------------------------------------
 kernel/sys.c  | 14 ++++++++++++++
 mm/oom_kill.c |  5 ++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..acf24f7 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -190,6 +190,20 @@ static int set_one_prio(struct task_struct *p, int niceval, int error)
 	return error;
 }
 
+struct sched_param {
+	int sched_priority;
+};
+
+void my_setpriority(struct task_struct *p)
+{
+	struct sched_param sp = { };
+
+	sched_setscheduler(p, SCHED_IDLE, &sp);
+	rcu_read_lock();
+	set_one_prio(p, MAX_NICE, -ESRCH);
+	rcu_read_unlock();
+}
+
 SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
 {
 	struct task_struct *g, *p;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b9e5a61..5841522 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -631,8 +631,11 @@ static void oom_reap_task(struct task_struct *tsk)
 	 */
 	if (mm->mmap)
 		set_bit(MMF_OOM_SKIP, &mm->flags);
-	else if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+	else if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+		extern void my_setpriority(struct task_struct *p);
+		my_setpriority(tsk);
 		pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk));
+	}
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
----------------------------------------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20181114-2.txt.xz .
----------------------------------------
[   63.713061] Memory cgroup out of memory: Kill process 1236 (a.out) score 994 or sacrifice child
[   63.716303] Killed process 1236 (a.out) total-vm:209156kB, anon-rss:1344kB, file-rss:440kB, shmem-rss:0kB
[   63.720719] Handed over 1236 to exit path.
[  121.588872] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 62s!
[  121.591898] Showing busy workqueues and worker pools:
[  121.594011] workqueue events: flags=0x0
[  121.596230]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=5/256
[  121.596258]     pending: vmstat_shepherd, vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], free_work, check_corruption
[  121.596331] workqueue events_power_efficient: flags=0x80
[  121.596336]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  121.596339]     pending: neigh_periodic_work, neigh_periodic_work
[  121.596347] workqueue mm_percpu_wq: flags=0x8
[  121.596350]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  121.596353]     pending: vmstat_update
[  186.594586] rcu: INFO: rcu_preempt self-detected stall on CPU
[  186.597329] rcu:     0-....: (67217 ticks this GP) idle=55e/1/0x4000000000000002 softirq=4231/4233 fqs=15984
[  186.601230] rcu:      (t=65000 jiffies g=9133 q=905)
[  186.603531] NMI backtrace for cpu 0
[  186.605611] CPU: 0 PID: 1237 Comm: a.out Not tainted 4.20.0-rc2+ #779
[  186.608291] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  186.612320] Call Trace:
(...snipped...)
[  186.674690]  mem_cgroup_iter+0x21d/0x7c0
[  186.676609]  ? mem_cgroup_iter+0x156/0x7c0
[  186.678443]  shrink_node+0xa9/0x450
[  186.680183]  do_try_to_free_pages+0x103/0x3b0
[  186.682154]  try_to_free_mem_cgroup_pages+0x10d/0x390
[  186.684358]  try_charge+0x28b/0x830
[  186.686018]  mem_cgroup_try_charge+0x42/0x1d0
[  186.687931]  mem_cgroup_try_charge_delay+0x11/0x30
[  186.689889]  __handle_mm_fault+0xa88/0x1580
[  186.691851]  handle_mm_fault+0x1b2/0x3a0
[  186.693686]  ? handle_mm_fault+0x47/0x3a0
[  186.695662]  __do_page_fault+0x28c/0x530
[  186.697859]  do_page_fault+0x28/0x260
[  186.699598]  ? page_fault+0x8/0x30
[  186.701223]  page_fault+0x1e/0x30
(...snipped...)
[  381.597229] rcu: INFO: rcu_preempt self-detected stall on CPU
[  381.599916] rcu:     0-....: (261318 ticks this GP) idle=55e/1/0x4000000000000002 softirq=4231/4233 fqs=63272
[  381.603762] rcu:      (t=260003 jiffies g=9133 q=4712)
[  381.605823] NMI backtrace for cpu 0
[  381.607551] CPU: 0 PID: 1237 Comm: a.out Not tainted 4.20.0-rc2+ #779
[  381.610201] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  381.614034] Call Trace:
(...snipped...)
[  381.672523]  ? mem_cgroup_iter+0x579/0x7c0
[  381.674461]  mem_cgroup_iter+0x5a6/0x7c0
[  381.676437]  ? mem_cgroup_iter+0x579/0x7c0
[  381.678375]  shrink_node+0x125/0x450
[  381.680013]  do_try_to_free_pages+0x103/0x3b0
[  381.682017]  try_to_free_mem_cgroup_pages+0x10d/0x390
[  381.684460]  try_charge+0x28b/0x830
[  381.686290]  mem_cgroup_try_charge+0x42/0x1d0
[  381.688454]  mem_cgroup_try_charge_delay+0x11/0x30
[  381.690929]  __handle_mm_fault+0xa88/0x1580
[  381.692860]  handle_mm_fault+0x1b2/0x3a0
[  381.694599]  ? handle_mm_fault+0x47/0x3a0
[  381.696269]  __do_page_fault+0x28c/0x530
[  381.697970]  do_page_fault+0x28/0x260
[  381.699650]  ? page_fault+0x8/0x30
[  381.701160]  page_fault+0x1e/0x30
(...snipped...)
[  446.166354] a.out           R  running task    13560  1236   1212 0x80100084
[  446.169179] Call Trace:
[  446.170469]  __schedule+0x246/0x990
[  446.172061]  preempt_schedule_common+0x34/0x54
[  446.173849]  preempt_schedule+0x1f/0x30
[  446.175514]  ___preempt_schedule+0x16/0x18
[  446.177187]  _raw_spin_unlock_irqrestore+0x5e/0x70
[  446.179277]  __debug_check_no_obj_freed+0x10b/0x1c0
[  446.181315]  ? kmem_cache_free+0x7f/0x2a0
[  446.183000]  ? vm_area_free+0x13/0x20
[  446.184646]  ? vm_area_free+0x13/0x20
[  446.186144]  debug_check_no_obj_freed+0x14/0x16
[  446.187950]  kmem_cache_free+0x15d/0x2a0
[  446.189556]  vm_area_free+0x13/0x20
[  446.190968]  remove_vma+0x54/0x60
[  446.192498]  exit_mmap+0x13d/0x1d0
[  446.193941]  mmput+0x63/0x130
[  446.195628]  do_exit+0x29d/0xcf0
[  446.197013]  ? _raw_spin_unlock_irq+0x27/0x50
[  446.198981]  ? __this_cpu_preempt_check+0x13/0x20
[  446.201033]  do_group_exit+0x47/0xc0
[  446.202715]  get_signal+0x329/0x920
[  446.204375]  do_signal+0x32/0x6e0
[  446.205731]  ? exit_to_usermode_loop+0x26/0x95
[  446.207487]  ? prepare_exit_to_usermode+0xa8/0xd0
[  446.209288]  exit_to_usermode_loop+0x3e/0x95
[  446.210924]  prepare_exit_to_usermode+0xa8/0xd0
[  446.212968]  ? page_fault+0x8/0x30
[  446.214627]  retint_user+0x8/0x18
(...snipped...)
[  446.234391] a.out           R  running task    13680  1237   1212 0x80000080
[  446.236896] Call Trace:
[  446.238079]  ? mem_cgroup_iter+0x156/0x7c0
[  446.239737]  ? find_held_lock+0x44/0xb0
[  446.241264]  ? debug_smp_processor_id+0x17/0x20
[  446.243088]  ? css_next_descendant_pre+0x45/0xb0
[  446.245025]  ? delayacct_end+0x1e/0x50
[  446.246640]  ? mem_cgroup_iter+0x2d2/0x7c0
[  446.248231]  ? shrink_node+0x125/0x450
[  446.249815]  ? do_try_to_free_pages+0x103/0x3b0
[  446.251724]  ? try_to_free_mem_cgroup_pages+0x10d/0x390
[  446.253703]  ? try_charge+0x28b/0x830
[  446.255270]  ? mem_cgroup_try_charge+0x42/0x1d0
[  446.257133]  ? mem_cgroup_try_charge_delay+0x11/0x30
[  446.259433]  ? __handle_mm_fault+0xa88/0x1580
[  446.261252]  ? handle_mm_fault+0x1b2/0x3a0
[  446.263071]  ? handle_mm_fault+0x47/0x3a0
[  446.264739]  ? __do_page_fault+0x28c/0x530
[  446.266376]  ? do_page_fault+0x28/0x260
[  446.267936]  ? page_fault+0x8/0x30
[  446.269478]  ? page_fault+0x1e/0x30
(...snipped...)
[  583.605227] a.out           R  running task    13560  1236   1212 0x80100084
[  583.608637] Call Trace:
[  583.609968]  __schedule+0x246/0x990
[  583.611682]  preempt_schedule_common+0x34/0x54
[  583.613472]  preempt_schedule+0x1f/0x30
[  583.615140]  ___preempt_schedule+0x16/0x18
[  583.616803]  _raw_spin_unlock_irqrestore+0x5e/0x70
[  583.618908]  __debug_check_no_obj_freed+0x10b/0x1c0
[  583.620844]  ? kmem_cache_free+0x7f/0x2a0
[  583.622691]  ? vm_area_free+0x13/0x20
[  583.624268]  ? vm_area_free+0x13/0x20
[  583.625765]  debug_check_no_obj_freed+0x14/0x16
[  583.627572]  kmem_cache_free+0x15d/0x2a0
[  583.629095]  vm_area_free+0x13/0x20
[  583.630631]  remove_vma+0x54/0x60
[  583.632533]  exit_mmap+0x13d/0x1d0
[  583.633993]  mmput+0x63/0x130
[  583.635500]  do_exit+0x29d/0xcf0
[  583.637010]  ? _raw_spin_unlock_irq+0x27/0x50
[  583.638776]  ? __this_cpu_preempt_check+0x13/0x20
[  583.640580]  do_group_exit+0x47/0xc0
[  583.642026]  get_signal+0x329/0x920
[  583.643908]  do_signal+0x32/0x6e0
[  583.645447]  ? exit_to_usermode_loop+0x26/0x95
[  583.647548]  ? prepare_exit_to_usermode+0xa8/0xd0
[  583.649457]  exit_to_usermode_loop+0x3e/0x95
[  583.651466]  prepare_exit_to_usermode+0xa8/0xd0
[  583.653244]  ? page_fault+0x8/0x30
[  583.654681]  retint_user+0x8/0x18
(...snipped...)
[  583.674642] a.out           R  running task    13680  1237   1212 0x80000080
[  583.677898] Call Trace:
[  583.679677]  ? find_held_lock+0x44/0xb0
[  583.681522]  ? lock_acquire+0xd3/0x210
[  583.685214]  ? rcu_is_watching+0x11/0x50
[  583.687452]  ? vmpressure+0x100/0x110
[  583.689069]  ? mem_cgroup_iter+0x5a6/0x7c0
[  583.690787]  ? lockdep_hardirqs_on+0xe5/0x1c0
[  583.692476]  ? shrink_node+0x125/0x450
[  583.694018]  ? css_task_iter_next+0x2b/0x70
[  583.695883]  ? do_try_to_free_pages+0x2fe/0x3b0
[  583.697702]  ? try_to_free_mem_cgroup_pages+0x10d/0x390
[  583.699611]  ? cgroup_file_notify+0x16/0x70
[  583.701252]  ? try_charge+0x28b/0x830
[  583.702954]  ? mem_cgroup_try_charge+0x42/0x1d0
[  583.704958]  ? mem_cgroup_try_charge_delay+0x11/0x30
[  583.706993]  ? __handle_mm_fault+0xa88/0x1580
[  583.708833]  ? handle_mm_fault+0x1b2/0x3a0
[  583.710502]  ? handle_mm_fault+0x47/0x3a0
[  583.712154]  ? __do_page_fault+0x28c/0x530
[  583.713806]  ? do_page_fault+0x28/0x260
[  583.715439]  ? page_fault+0x8/0x30
[  583.716973]  ? page_fault+0x1e/0x30
----------------------------------------

Since the allocating threads call only cond_resched(), and cond_resched() for
realtime priority threads is a no-op, allocating threads with realtime priority
will forever wait for the OOM victim with idle priority.



There is always an invisible lock called "scheduling priority". You can't
leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
handling the worst case.

Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff
  2018-11-14  9:46   ` Tetsuo Handa
@ 2018-11-14 10:16     ` Michal Hocko
  2018-11-15  9:54       ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-11-14 10:16 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: David Rientjes, Roman Gushchin, linux-mm, Andrew Morton, LKML

On Wed 14-11-18 18:46:13, Tetsuo Handa wrote:
[...]
> There is always an invisible lock called "scheduling priority". You can't
> leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
> handling the worst case.

And that problem is all over the memory reclaim. You can get starved
to death and block other resources. And the memory reclaim is not the
only one. This is a fundamental issue of the locking without priority
inheritance and other real time techniques.

> Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

OK, if your whole point is to block any changes into this area unless
it is your solution to be merged then I guess I will just push patches
through with your explicit Nack on them. Your feedback was far fetched
at many times has distracted the discussion way too often. This is
especially sad because your testing and review was really helpful at
times. I do not really have energy to argue the same set of arguments
over and over again.

You have expressed unwillingness to understand the overall
picture several times. You do not care about a long term maintenance
burden of this code which is quite tricky already and refuse to
understand the cost/benefit part.

If this series works for the workload reported by David I will simply
push it through and let Andrew decide. If there is a lack of feedback
I will just keep it around because it seems that most users do not care
about these corner cases anyway.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff
  2018-11-14 10:16     ` Michal Hocko
@ 2018-11-15  9:54       ` Tetsuo Handa
  2018-11-15 11:36         ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2018-11-15  9:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Roman Gushchin, linux-mm, Andrew Morton, LKML,
	Linus Torvalds

On 2018/11/14 19:16, Michal Hocko wrote:
> On Wed 14-11-18 18:46:13, Tetsuo Handa wrote:
> [...]
> > There is always an invisible lock called "scheduling priority". You can't
> > leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
> > handling the worst case.
> 
> And that problem is all over the memory reclaim. You can get starved
> to death and block other resources. And the memory reclaim is not the
> only one.

I think that it is a manner for kernel developers that no thread keeps
consuming CPU resources forever. In the kernel world, doing

  while (1);

is not permitted. Likewise, doing

  for (i = 0; i < very_large_value; i++)
      do_something_which_does_not_yield_CPU_to_others();

has to be avoided, in order to avoid lockup problems. We are required to
yield CPU to others when we are waiting for somebody else to make progress.
It is the page allocator who is refusing to yield CPU to those who need CPU.

Since the OOM reaper kernel thread "has normal priority" and "can run on any
CPU", the possibility of failing to run is lower than an OOM victim thread
which "has idle priority" and "can run on only limited CPU". You are trying
to add a dependency on such thread, and I'm saying that adding a dependency
on such thread increases possibility of lockup.

Yes, even the OOM reaper kernel thread might fail to run if all CPUs were
busy with realtime threads waiting for the OOM reaper kernel thread to make
progress. In that case, we had better stop relying on asynchronous memory
reclaim, and switch to direct OOM reaping by allocating threads.

But what I demonstrated is that

        /*
         * the exit path is guaranteed to finish the memory tear down
         * without any unbound blocking at this stage so make it clear
         * to the oom_reaper
         */

becomes a lie even when only one CPU was busy with realtime threads waiting
for an idle thread to make progress. If the page allocator stops telling a
lie that "an OOM victim is making progress on behalf of me", we can avoid
the lockup.

>           This is a fundamental issue of the locking without priority
> inheritance and other real time techniques.

That is nothing but an evidence that you are refusing to solve the possibility
of lockup, and will keep piling up stupid lockup bugs. OOM handling does not
use locks when waiting for somebody else to make progress. Blaming "the locking
without priority inheritance" is wrong.

Each subsystem is responsible for avoiding the lockup. If one subsystem is
triggering lockup due to printk() flooding, that subsystem avoids the lockup
by stop abusing CPU resources by reducing the printk() messages. That's all
we can do for now. MM is not privileged enough to lockup the system.

> 
> > Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> OK, if your whole point is to block any changes into this area unless
> it is your solution to be merged then I guess I will just push patches
> through with your explicit Nack on them. Your feedback was far fetched
> at many times has distracted the discussion way too often. This is
> especially sad because your testing and review was really helpful at
> times. I do not really have energy to argue the same set of arguments
> over and over again.
> 
> You have expressed unwillingness to understand the overall
> picture several times. You do not care about a long term maintenance
> burden of this code which is quite tricky already and refuse to
> understand the cost/benefit part.
> 
> If this series works for the workload reported by David I will simply
> push it through and let Andrew decide.

I'm skeptic about this approach. Given that exit_mmap() has to do more
things than what the OOM reaper kernel thread can do, how likely the OOM
reaper kernel thread can find that exit_mmap() has completed both
fullset of unmap_vmas() and __unlink_vmas() within (at most) one second
after the OOM reaper completed only subset of unmap_vmas() ? If exit_mmap()
was preempted (due to scheduling priority), the OOM reaper might likely fail
to find it. Anyway, if you insist this approach, I expect that exit_mmap()
asks the OOM reaper kernel thread to call __free_pgtables(), and the OOM
reaper kernel thread sets MMF_OOM_SKIP, and exit_mmap() resumes remove_vma()
etc. , for the OOM reaper kernel thread ("has normal priority" and "can run
on any CPU") is more reliable than a random thread ("has idle priority" and
"can run on only limited CPU").

Given that the OOM reaper likely can find it, there are lack of reviews from
each arch maintainers (regarding whether this approach is really safe, and
whether this approach constrains future improvements).

>                                        If there is a lack of feedback
> I will just keep it around because it seems that most users do not care
> about these corner cases anyway.

No way. We are always failing to get attention regarding OOM handling.  ;-)

Even how to avoid flooding by

    dump_header(oc, NULL);
    pr_warn("Out of memory and no killable processes...\n");

for memcg OOM is failing to make progress.

People make changes without thinking about OOM handling.
If you push this approach, solve the lack of reviews.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff
  2018-11-15  9:54       ` Tetsuo Handa
@ 2018-11-15 11:36         ` Michal Hocko
  2018-11-16 10:06           ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2018-11-15 11:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: David Rientjes, Roman Gushchin, linux-mm, Andrew Morton, LKML,
	Linus Torvalds

On Thu 15-11-18 18:54:15, Tetsuo Handa wrote:
> On 2018/11/14 19:16, Michal Hocko wrote:
> > On Wed 14-11-18 18:46:13, Tetsuo Handa wrote:
> > [...]
> > > There is always an invisible lock called "scheduling priority". You can't
> > > leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
> > > handling the worst case.
> > 
> > And that problem is all over the memory reclaim. You can get starved
> > to death and block other resources. And the memory reclaim is not the
> > only one.
> 
> I think that it is a manner for kernel developers that no thread keeps
> consuming CPU resources forever. In the kernel world, doing
> 
>   while (1);
> 
> is not permitted. Likewise, doing
> 
>   for (i = 0; i < very_large_value; i++)
>       do_something_which_does_not_yield_CPU_to_others();

There is nothing like that proposed in this series.

> has to be avoided, in order to avoid lockup problems. We are required to
> yield CPU to others when we are waiting for somebody else to make progress.
> It is the page allocator who is refusing to yield CPU to those who need CPU.

And we do that in the reclaim path.

> Since the OOM reaper kernel thread "has normal priority" and "can run on any
> CPU", the possibility of failing to run is lower than an OOM victim thread
> which "has idle priority" and "can run on only limited CPU". You are trying
> to add a dependency on such thread, and I'm saying that adding a dependency
> on such thread increases possibility of lockup.

Sigh. No, this is not the case. All this patch series does is that we
hand over to the exiting task once it doesn't block on any locks
anymore. If the thread is low priority then it is quite likely that the
oom reaper is done by the time the victim even reaches the exit path.

> Yes, even the OOM reaper kernel thread might fail to run if all CPUs were
> busy with realtime threads waiting for the OOM reaper kernel thread to make
> progress. In that case, we had better stop relying on asynchronous memory
> reclaim, and switch to direct OOM reaping by allocating threads.
> 
> But what I demonstrated is that
> 
>         /*
>          * the exit path is guaranteed to finish the memory tear down
>          * without any unbound blocking at this stage so make it clear
>          * to the oom_reaper
>          */
> 
> becomes a lie even when only one CPU was busy with realtime threads waiting
> for an idle thread to make progress. If the page allocator stops telling a
> lie that "an OOM victim is making progress on behalf of me", we can avoid
> the lockup.

OK, I stopped reading right here. This discussion is pointless. Once you
busy loop all CPUs you are screwed. Are you going to blame a filesystem
that no progress can be made if a code path holding an important lock
is preemempted by high priority stuff a no further progress can be
made? This is just ridiculous. What you are arguing here is not fixable
with the current upstream kernel. Even your so beloved timeout based
solution doesn't cope with that because oom reaper can be preempted for
unbound amount of time. Your argument just doens't make much sense in
the context of the current kernel. Full stop.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff
  2018-11-15 11:36         ` Michal Hocko
@ 2018-11-16 10:06           ` Tetsuo Handa
  0 siblings, 0 replies; 18+ messages in thread
From: Tetsuo Handa @ 2018-11-16 10:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Roman Gushchin, linux-mm, Andrew Morton, LKML,
	Linus Torvalds

On 2018/11/15 20:36, Michal Hocko wrote:
> On Thu 15-11-18 18:54:15, Tetsuo Handa wrote:
> > On 2018/11/14 19:16, Michal Hocko wrote:
> > > On Wed 14-11-18 18:46:13, Tetsuo Handa wrote:
> > > [...]
> > > > There is always an invisible lock called "scheduling priority". You can't
> > > > leave the MMF_OOM_SKIP to the exit path. Your approach is not ready for
> > > > handling the worst case.
> > > 
> > > And that problem is all over the memory reclaim. You can get starved
> > > to death and block other resources. And the memory reclaim is not the
> > > only one.
> > 
> > I think that it is a manner for kernel developers that no thread keeps
> > consuming CPU resources forever. In the kernel world, doing
> > 
> >   while (1);
> > 
> > is not permitted. Likewise, doing
> > 
> >   for (i = 0; i < very_large_value; i++)
> >       do_something_which_does_not_yield_CPU_to_others();
> 
> There is nothing like that proposed in this series.
> 
> > has to be avoided, in order to avoid lockup problems. We are required to
> > yield CPU to others when we are waiting for somebody else to make progress.
> > It is the page allocator who is refusing to yield CPU to those who need CPU.
> 
> And we do that in the reclaim path.
> 

No we don't. Please explain why holding oom_lock and not holding oom_lock in a
patch shown below makes difference (i.e. holding oom_lock can avoid lockups while
not holding oom_lock fails to avoid lockups).

----------
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3137,50 +3137,53 @@ void exit_mmap(struct mm_struct *mm)
 	/*
 	 * oom_reaper cannot race with the page tables teardown but we
 	 * want to make sure that the exit path can take over the full
 	 * tear down when it is safe to do so
 	 */
 	if (oom) {
 		extern void my_setpriority(void);
 		down_write(&mm->mmap_sem);
 		__unlink_vmas(vma);
+		mutex_lock(&oom_lock);
 		/*
 		 * the exit path is guaranteed to finish the memory tear down
 		 * without any unbound blocking at this stage so make it clear
 		 * to the oom_reaper
 		 */
 		mm->mmap = NULL;
 		up_write(&mm->mmap_sem);
 		my_setpriority();
 		__free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	} else {
 		free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	}
 
 	tlb_finish_mmu(&tlb, 0, -1);
 
 	/*
+	 * Now that the full address space is torn down, make sure the
+	 * OOM killer skips over this task
+	 */
+	if (oom) {
+		set_bit(MMF_OOM_SKIP, &mm->flags);
+		mutex_unlock(&oom_lock);
+	}
+
+	/*
 	 * Walk the list again, actually closing and freeing it,
 	 * with preemption enabled, without holding any MM locks.
 	 */
 	while (vma) {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
-
-	/*
-	 * Now that the full address space is torn down, make sure the
-	 * OOM killer skips over this task
-	 */
-	if (oom)
-		set_bit(MMF_OOM_SKIP, &mm->flags);
 }
 
 /* Insert vm structure into process list sorted by address
  * and into the inode's i_mmap tree.  If vm_file is non-NULL
  * then i_mmap_rwsem is taken here.
  */
 int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	struct vm_area_struct *prev;

----------

> > Since the OOM reaper kernel thread "has normal priority" and "can run on any
> > CPU", the possibility of failing to run is lower than an OOM victim thread
> > which "has idle priority" and "can run on only limited CPU". You are trying
> > to add a dependency on such thread, and I'm saying that adding a dependency
> > on such thread increases possibility of lockup.
> 
> Sigh. No, this is not the case. All this patch series does is that we
> hand over to the exiting task once it doesn't block on any locks
> anymore. If the thread is low priority then it is quite likely that the
> oom reaper is done by the time the victim even reaches the exit path.

Not true. A thread executing exit_mmap() can change its priority at any moment.

Also, the OOM reaper kernel thread can fail to complete OOM reaping before
exit_mmap() does mm->mmap = NULL due to e.g. down_read_trylock(&mm->mmap_sem)
failure, doing OOM reaping on other OOM victims in different OOM domains, being
preempted by scheduling priority. You will notice it by trying both

----------
 	 */
 	if (mm->mmap)
 		set_bit(MMF_OOM_SKIP, &mm->flags);
+	else
+		pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk));
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
----------

and

----------
 	 */
 	if (mm->mmap)
 		set_bit(MMF_OOM_SKIP, &mm->flags);
+	else if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+		pr_info("Handed over %u to exit path.\n", task_pid_nr(tsk));
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
----------

and checking the frequency of printing "Handed over " messages.

> 
> > Yes, even the OOM reaper kernel thread might fail to run if all CPUs were
> > busy with realtime threads waiting for the OOM reaper kernel thread to make
> > progress. In that case, we had better stop relying on asynchronous memory
> > reclaim, and switch to direct OOM reaping by allocating threads.
> > 
> > But what I demonstrated is that
> > 
> >         /*
> >          * the exit path is guaranteed to finish the memory tear down
> >          * without any unbound blocking at this stage so make it clear
> >          * to the oom_reaper
> >          */
> > 
> > becomes a lie even when only one CPU was busy with realtime threads waiting
> > for an idle thread to make progress. If the page allocator stops telling a
> > lie that "an OOM victim is making progress on behalf of me", we can avoid
> > the lockup.
> 
> OK, I stopped reading right here. This discussion is pointless. Once you
> busy loop all CPUs you are screwed.

Look at the log files. I made only 1 CPU (out of 8 CPUs) busy.

>                                     Are you going to blame a filesystem
> that no progress can be made if a code path holding an important lock
> is preemempted by high priority stuff a no further progress can be
> made?

I don't blame that case because it is doing something which is not a kernel bug.

>       This is just ridiculous. What you are arguing here is not fixable
> with the current upstream kernel.

I do blame memory allocation case because it is doing something which is a
kernel bug which can be avoided if we stop telling a lie. No future kernel
is fixable as long as we keep telling a lie.

>                                   Even your so beloved timeout based
> solution doesn't cope with that because oom reaper can be preempted for
> unbound amount of time.

Yes, that's the reason I suggest direct OOM reaping.

>                         Your argument just doens't make much sense in
> the context of the current kernel. Full stop.

Won't full stop at all. What I'm saying is "don't rely on busy polling loop".
The reclaim path becomes a no-op for a thread executing __free_pgtables()
when memcg OOM is waiting for that thread to complete __free_pgtables() and
set MMF_OOM_SKIP.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-11-16 10:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-25  8:24 [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff Michal Hocko
2018-10-25  8:24 ` [RFC PATCH v2 1/3] mm, oom: rework mmap_exit vs. oom_reaper synchronization Michal Hocko
2018-10-25  8:24 ` [RFC PATCH v2 2/3] mm, oom: keep retrying the oom_reap operation as long as there is substantial memory left Michal Hocko
2018-10-25  8:24 ` [RFC PATCH v2 3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish Michal Hocko
2018-10-30  4:45   ` Tetsuo Handa
2018-10-30  6:31     ` Michal Hocko
2018-10-30  9:47       ` Tetsuo Handa
2018-10-30 11:39         ` Michal Hocko
2018-10-30 12:02           ` Tetsuo Handa
2018-10-30 12:10             ` Michal Hocko
2018-10-30 13:57               ` Tetsuo Handa
2018-10-30 14:23                 ` Michal Hocko
2018-11-08  9:32 ` [RFC PATCH v2 0/3] oom: rework oom_reaper vs. exit_mmap handoff Michal Hocko
2018-11-14  9:46   ` Tetsuo Handa
2018-11-14 10:16     ` Michal Hocko
2018-11-15  9:54       ` Tetsuo Handa
2018-11-15 11:36         ` Michal Hocko
2018-11-16 10:06           ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).