linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
       [not found] <1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
@ 2017-08-24 12:18 ` Tetsuo Handa
  2017-08-24 13:18   ` Michal Hocko
  2017-08-24 13:03 ` [PATCH 1/2] mm,page_alloc: Don't call __node_reclaim() with oom_lock held Michal Hocko
  2017-08-25 20:47 ` Andrew Morton
  2 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-08-24 12:18 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Tetsuo Handa, David Rientjes, Manish Jaggi, Mel Gorman,
	Michal Hocko, Oleg Nesterov, Vladimir Davydov, Vlastimil Babka

Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
count causes random kernel panics when an OOM victim which consumed memory
in a way the OOM reaper does not help was selected by the OOM killer [1].

Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
victim's mm were not able to try allocation from memory reserves after the
OOM reaper gave up reclaiming memory.

I proposed a patch which alllows task_will_free_mem(current) in
out_of_memory() to ignore MMF_OOM_SKIP for once so that all OOM victim
threads are guaranteed to have tried ALLOC_OOM allocation attempt before
start selecting next OOM victims [2], for Michal Hocko did not like
calling get_page_from_freelist() from the OOM killer which is a layer
violation [3]. But now, Michal thinks that calling get_page_from_freelist()
after task_will_free_mem(current) test is better than allowing
task_will_free_mem(current) to ignore MMF_OOM_SKIP for once [4], for
this would help other cases when we race with an exiting tasks or somebody
managed to free memory while we were selecting an OOM victim which can take
quite some time.

Thus, this patch brings "struct alloc_context" into the OOM killer layer
and does really last second get_page_from_freelist() attempt inside
oom_kill_process(). This patch calls whole __alloc_pages_slowpath() than
cherry-picks get_page_from_freelist() call, for we need to try ALLOC_OOM
allocation if task_is_oom_victim(current) == true (because
task_will_free_mem(current) not to ignore MMF_OOM_SKIP might have prevented
current thread from trying ALLOC_OOM allocation).

Although this patch tries to close all possible races which lead to
premature OOM killer invocation, compared to approaches which preserves
the layer and retry __alloc_pages_slowpath() without oom_lock held (e.g.
[2]), there are two races which cannot be closed by this patch.

  (1) Since we cannot use direct reclaim for this allocation attempt due to
      oom_lock already held, an OOM victim will be prematurely killed which
      could have been avoided if direct reclaim with oom_lock released was
      used.

  (2) Since we call panic() before calling oom_kill_process() when there is
      no killable process, panic() will be prematurely called which could
      have been avoided if [2] is used. For example, if a multithreaded
      application running with a dedicated CPUs/memory was OOM-killed, we
      can wait until ALLOC_OOM allocation fails to solve OOM situation.

Which approach should we take (this patch and/or [2]) ?

[1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
[2] http://lkml.kernel.org/r/201708191523.BJH90621.MHOOFFQSOLJFtV@I-love.SAKURA.ne.jp
[3] http://lkml.kernel.org/r/20160129152307.GF32174@dhcp22.suse.cz
[4] http://lkml.kernel.org/r/20170821131851.GJ25956@dhcp22.suse.cz

Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
Reported-by: Manish Jaggi <mjaggi@caviumnetworks.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/oom.h | 13 +++++++++++++
 mm/oom_kill.c       | 13 +++++++++++++
 mm/page_alloc.c     | 27 +++++++++++++++++++++------
 3 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 76aac4c..eb92aa8 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -13,6 +13,8 @@
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
+struct alloc_context;
+struct page;
 
 /*
  * Details of the page allocation that triggered the oom killer that are used to
@@ -37,6 +39,15 @@ struct oom_control {
 	 */
 	const int order;
 
+	/* Context for really last second allocation attempt. */
+	struct alloc_context *ac;
+	/*
+	 * Set by the OOM killer if ac != NULL and last second allocation
+	 * attempt succeeded. If ac != NULL, the caller must check for
+	 * page != NULL.
+	 */
+	struct page *page;
+
 	/* Used by oom implementation, do not set */
 	unsigned long totalpages;
 	struct task_struct *chosen;
@@ -101,6 +112,8 @@ extern unsigned long oom_badness(struct task_struct *p,
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
+extern struct page *alloc_pages_before_oomkill(struct oom_control *oc);
+
 /* sysctls */
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 99736e0..fe1aa30 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -832,6 +832,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(p);
 
+	/*
+	 * Try really last second allocation attempt after we selected an OOM
+	 * victim, for somebody might have managed to free memory while we were
+	 * selecting an OOM victim which can take quite some time.
+	 */
+	if (oc->ac) {
+		oc->page = alloc_pages_before_oomkill(oc);
+		if (oc->page) {
+			put_task_struct(p);
+			return;
+		}
+	}
+
 	if (__ratelimit(&oom_rs))
 		dump_header(oc, p);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 788318f..90b2de9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3277,7 +3277,8 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 
 static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
-	const struct alloc_context *ac, unsigned long *did_some_progress)
+		      struct alloc_context *ac,
+		      unsigned long *did_some_progress)
 {
 	struct oom_control oc = {
 		.zonelist = ac->zonelist,
@@ -3285,6 +3286,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		.memcg = NULL,
 		.gfp_mask = gfp_mask,
 		.order = order,
+		.ac = ac,
 	};
 	struct page *page;
 
@@ -3347,16 +3349,17 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 		goto out;
 
 	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
+	if (out_of_memory(&oc)) {
+		*did_some_progress = 1;
+		page = oc.page;
+	} else if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
-
 		/*
 		 * Help non-failing allocations by giving them access to memory
 		 * reserves
 		 */
-		if (gfp_mask & __GFP_NOFAIL)
-			page = __alloc_pages_cpuset_fallback(gfp_mask, order,
-					ALLOC_NO_WATERMARKS, ac);
+		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
+						     ALLOC_NO_WATERMARKS, ac);
 	}
 out:
 	mutex_unlock(&oom_lock);
@@ -4126,6 +4129,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return page;
 }
 
+struct page *alloc_pages_before_oomkill(struct oom_control *oc)
+{
+	/*
+	 * Make sure that this allocation attempt shall not depend on
+	 * __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation, for the caller is
+	 * already holding oom_lock.
+	 */
+	return __alloc_pages_slowpath((oc->gfp_mask | __GFP_NOWARN) &
+				      ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL),
+				      oc->order, oc->ac);
+}
+
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_mask,
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm,page_alloc: Don't call __node_reclaim() with oom_lock held.
       [not found] <1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
  2017-08-24 12:18 ` [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim Tetsuo Handa
@ 2017-08-24 13:03 ` Michal Hocko
  2017-08-25 20:47 ` Andrew Morton
  2 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2017-08-24 13:03 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, linux-mm, Mel Gorman, Vlastimil Babka

On Thu 24-08-17 21:18:25, Tetsuo Handa wrote:
> We are doing last second memory allocation attempt before calling
> out_of_memory(). But since slab shrinker functions might indirectly
> wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
> allocations via sleeping locks, calling slab shrinker functions from
> node_reclaim() from get_page_from_freelist() with oom_lock held has
> possibility of deadlock. Therefore, make sure that last second memory
> allocation attempt does not call slab shrinker functions.

OK, I have previously missed that node_reclaim does
gfpflags_allow_blocking

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b983ee..788318f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3303,10 +3303,13 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>  	/*
>  	 * Go through the zonelist yet one more time, keep very high watermark
>  	 * here, this is only to catch a parallel oom killing, we must fail if
> -	 * we're still under heavy pressure.
> +	 * we're still under heavy pressure. But make sure that this reclaim
> +	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> +	 * allocation which will never fail due to oom_lock already held.
>  	 */
> -	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
> -					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> +	page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
> +				      ~__GFP_DIRECT_RECLAIM, order,
> +				      ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
>  	if (page)
>  		goto out;
>  
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-08-24 12:18 ` [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim Tetsuo Handa
@ 2017-08-24 13:18   ` Michal Hocko
  2017-08-24 14:40     ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-08-24 13:18 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, David Rientjes, Manish Jaggi, Mel Gorman,
	Oleg Nesterov, Vladimir Davydov, Vlastimil Babka

On Thu 24-08-17 21:18:26, Tetsuo Handa wrote:
> Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> count causes random kernel panics when an OOM victim which consumed memory
> in a way the OOM reaper does not help was selected by the OOM killer [1].
> 
> Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> victim's mm were not able to try allocation from memory reserves after the
> OOM reaper gave up reclaiming memory.
> 
> I proposed a patch which alllows task_will_free_mem(current) in
> out_of_memory() to ignore MMF_OOM_SKIP for once so that all OOM victim
> threads are guaranteed to have tried ALLOC_OOM allocation attempt before
> start selecting next OOM victims [2], for Michal Hocko did not like
> calling get_page_from_freelist() from the OOM killer which is a layer
> violation [3]. But now, Michal thinks that calling get_page_from_freelist()
> after task_will_free_mem(current) test is better than allowing
> task_will_free_mem(current) to ignore MMF_OOM_SKIP for once [4], for
> this would help other cases when we race with an exiting tasks or somebody
> managed to free memory while we were selecting an OOM victim which can take
> quite some time.

This a lot of text which can be more confusing than helpful. Could you
state the problem clearly without detours? Yes, the oom killer selection
can race with those freeing memory. And it has been like that since
basically ever. Doing a last minute allocation attempt might help. Now
there are more important questions. How likely is that. Do people have
to care? __alloc_pages_may_oom already does a almost-the-last moment
allocation. Do we still need it? It also does ALLOC_WMARK_HIGH
allocation which your path doesn't do. I wanted to remove this some time
ago but it has been pointed out that this was really needed
https://patchwork.kernel.org/patch/8153841/ Maybe things have changed
and if so please explain.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-08-24 13:18   ` Michal Hocko
@ 2017-08-24 14:40     ` Tetsuo Handa
  2017-08-25  8:00       ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-08-24 14:40 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, rientjes, mjaggi, mgorman, oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Thu 24-08-17 21:18:26, Tetsuo Handa wrote:
> > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> > count causes random kernel panics when an OOM victim which consumed memory
> > in a way the OOM reaper does not help was selected by the OOM killer [1].
> > 
> > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> > to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> > victim's mm were not able to try allocation from memory reserves after the
> > OOM reaper gave up reclaiming memory.
> > 
> > I proposed a patch which alllows task_will_free_mem(current) in
> > out_of_memory() to ignore MMF_OOM_SKIP for once so that all OOM victim
> > threads are guaranteed to have tried ALLOC_OOM allocation attempt before
> > start selecting next OOM victims [2], for Michal Hocko did not like
> > calling get_page_from_freelist() from the OOM killer which is a layer
> > violation [3]. But now, Michal thinks that calling get_page_from_freelist()
> > after task_will_free_mem(current) test is better than allowing
> > task_will_free_mem(current) to ignore MMF_OOM_SKIP for once [4], for
> > this would help other cases when we race with an exiting tasks or somebody
> > managed to free memory while we were selecting an OOM victim which can take
> > quite some time.
> 
> This a lot of text which can be more confusing than helpful. Could you
> state the problem clearly without detours? Yes, the oom killer selection
> can race with those freeing memory. And it has been like that since
> basically ever.

The problem which Manish Jaggi reported (and I can still reproduce) is that
the OOM killer ignores MMF_OOM_SKIP mm too early. And the problem became real
in 4.8 due to commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks"). Thus, it has _not_ been like that since basically ever.

>                 Doing a last minute allocation attempt might help. Now
> there are more important questions. How likely is that. Do people have
> to care? __alloc_pages_may_oom already does a almost-the-last moment
> allocation. Do we still need it?

get_page_from_freelist() in __alloc_pages_may_oom() would help only if
MMF_OOM_SKIP is set after some memory is reclaimed. But the problem is
that MMF_OOM_SKIP is set without reclaiming any memory.

>                                  It also does ALLOC_WMARK_HIGH
> allocation which your path doesn't do.

The intent of this patch is to replace "[PATCH v2] mm, oom:
task_will_free_mem(current) should ignore MMF_OOM_SKIP for once."
which you have nacked 3 days ago.

>                                        I wanted to remove this some time
> ago but it has been pointed out that this was really needed
> https://patchwork.kernel.org/patch/8153841/ Maybe things have changed
> and if so please explain.

get_page_from_freelist() in __alloc_pages_may_oom() will remain needed
because it can help allocations which do not call oom_kill_process() (i.e.
allocations which do "goto out;" in __alloc_pages_may_oom() without calling
out_of_memory(), and allocations which do "return;" in out_of_memory()
without calling oom_kill_process() (e.g. !__GFP_FS)) to succeed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-08-24 14:40     ` Tetsuo Handa
@ 2017-08-25  8:00       ` Michal Hocko
  2017-09-09  0:55         ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-08-25  8:00 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, linux-mm, rientjes, mjaggi, mgorman, oleg, vdavydov.dev, vbabka

On Thu 24-08-17 23:40:36, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 24-08-17 21:18:26, Tetsuo Handa wrote:
> > > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> > > count causes random kernel panics when an OOM victim which consumed memory
> > > in a way the OOM reaper does not help was selected by the OOM killer [1].
> > > 
> > > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> > > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> > > to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> > > victim's mm were not able to try allocation from memory reserves after the
> > > OOM reaper gave up reclaiming memory.
> > > 
> > > I proposed a patch which alllows task_will_free_mem(current) in
> > > out_of_memory() to ignore MMF_OOM_SKIP for once so that all OOM victim
> > > threads are guaranteed to have tried ALLOC_OOM allocation attempt before
> > > start selecting next OOM victims [2], for Michal Hocko did not like
> > > calling get_page_from_freelist() from the OOM killer which is a layer
> > > violation [3]. But now, Michal thinks that calling get_page_from_freelist()
> > > after task_will_free_mem(current) test is better than allowing
> > > task_will_free_mem(current) to ignore MMF_OOM_SKIP for once [4], for
> > > this would help other cases when we race with an exiting tasks or somebody
> > > managed to free memory while we were selecting an OOM victim which can take
> > > quite some time.
> > 
> > This a lot of text which can be more confusing than helpful. Could you
> > state the problem clearly without detours? Yes, the oom killer selection
> > can race with those freeing memory. And it has been like that since
> > basically ever.
> 
> The problem which Manish Jaggi reported (and I can still reproduce) is that
> the OOM killer ignores MMF_OOM_SKIP mm too early. And the problem became real
> in 4.8 due to commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks"). Thus, it has _not_ been like that since basically ever.

Again, you are mixing more things together. Manish usecase triggers a
pathological case where the oom reaper is not able to reclaim basically
any memory and so we unnecessarily kill another victim if the original
one doesn't finish quick enough.

This patch and your former attempts will only help (for that particular
case) if the victim itself wanted to allocate and didn't manage to pass
through the ALLOC_OOM attempt since it was killed. This yet again a
corner case and something this patch won't plug in general (it only
takes another task to go that path). That's why I consider that
confusing to mention in the changelog.

What I am trying to say is that time-to-check vs time-to-kill has
been a race window since ever and a large amount of memory can be
released during that time. This patch definitely reduces that time
window _considerably_. There is still a race window left but this is
inherently racy so you could argue that the remaining window is small to
lose sleep over. After all this is a corner case again. From my years of
experience with OOM reports I haven't met many (if any) cases like that.
So the primary question is whether we do care about this race window
enough to even try to fix it. Considering an absolute lack of reports
I would tend to say we don't but if the fix can be made non-intrusive
which seems likely then we actually can try it out at least.

> >                                        I wanted to remove this some time
> > ago but it has been pointed out that this was really needed
> > https://patchwork.kernel.org/patch/8153841/ Maybe things have changed
> > and if so please explain.
> 
> get_page_from_freelist() in __alloc_pages_may_oom() will remain needed
> because it can help allocations which do not call oom_kill_process() (i.e.
> allocations which do "goto out;" in __alloc_pages_may_oom() without calling
> out_of_memory(), and allocations which do "return;" in out_of_memory()
> without calling oom_kill_process() (e.g. !__GFP_FS)) to succeed.

I do not understand. Those request will simply back off and retry the
allocation or bail out and fail the allocation. My primary question was

: that the above link contains an explanation from Andrea that the reason
: for the high wmark is to reduce the likelihood of livelocks and be sure
: to invoke the OOM killer,

I am not sure how much that reason applies to the current code but if it
still applies then we should do the same for later
last-minute-allocation as well. Having both and disagreeing is just a
mess.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm,page_alloc: Don't call __node_reclaim() with oom_lock held.
       [not found] <1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
  2017-08-24 12:18 ` [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim Tetsuo Handa
  2017-08-24 13:03 ` [PATCH 1/2] mm,page_alloc: Don't call __node_reclaim() with oom_lock held Michal Hocko
@ 2017-08-25 20:47 ` Andrew Morton
  2017-08-26  1:28   ` Tetsuo Handa
  2 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2017-08-25 20:47 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Mel Gorman, Michal Hocko, Vlastimil Babka

On Thu, 24 Aug 2017 21:18:25 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> We are doing last second memory allocation attempt before calling
> out_of_memory(). But since slab shrinker functions might indirectly
> wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
> allocations via sleeping locks, calling slab shrinker functions from
> node_reclaim() from get_page_from_freelist() with oom_lock held has
> possibility of deadlock. Therefore, make sure that last second memory
> allocation attempt does not call slab shrinker functions.

I wonder if there's any way we could gert lockdep to detect this sort
of thing.

Has the deadlock been observed in testing?  Do we think this fix
should be backported into -stable?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm,page_alloc: Don't call __node_reclaim() with oom_lock held.
  2017-08-25 20:47 ` Andrew Morton
@ 2017-08-26  1:28   ` Tetsuo Handa
  2017-08-27  4:17     ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-08-26  1:28 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, mgorman, mhocko, vbabka

Andrew Morton wrote:
> On Thu, 24 Aug 2017 21:18:25 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
> > We are doing last second memory allocation attempt before calling
> > out_of_memory(). But since slab shrinker functions might indirectly
> > wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
> > allocations via sleeping locks, calling slab shrinker functions from
> > node_reclaim() from get_page_from_freelist() with oom_lock held has
> > possibility of deadlock. Therefore, make sure that last second memory
> > allocation attempt does not call slab shrinker functions.
> 
> I wonder if there's any way we could gert lockdep to detect this sort
> of thing.

That is hopeless regarding MM subsystem.

The root problem is that MM subsystem assumes that somebody else shall make
progress for me. And direct reclaim does not check for other thread's progress
(e.g. too_many_isolated() looping forever waiting for kswapd) and continue
consuming CPU resource (e.g. deprive a thread doing schedule_timeout_killable()
with oom_lock held of all CPU time for doing pointless get_page_from_freelist()
etc.).

Since the page allocator chooses retry the attempt rather than wait for locks,
lockdep won't help. The dependency is spreaded to all threads with timing and
threshold checks, preventing threads from calling operations which lockdep
will detect.

I do wish we can get rid of __GFP_DIRECT_RECLAIM and offload memory reclaim
operation to some kswapd-like kernel threads. Then, we would be able to check
progress of relevant threads and invoke the OOM killer as needed (rather than
doing __GFP_FS check in out_of_memory()), as well as implementing __GFP_KILLABLE.

> 
> Has the deadlock been observed in testing?  Do we think this fix
> should be backported into -stable?

I have never observed this deadlock, but it is hard for everybody to know
if he/she hit this deadlock. The only clue which is available since 4.9+
(though still unreliable) is warn_alloc() complaining memory allocation is
stalling for some reason. For users using 2.6.18/2.6.32/3.10 kernels, they
have absolutely no clue to know it (other than using SysRq-t etc. which is
generating too much messages and asking for too much efforts).

Judging from my experience at a support center, it is too difficult for users
to report memory allocation hangs. It requires users to stand by in front of
the console twenty-four seven so that we get SysRq-t etc. whenever a memory
allocation related problem is suspected. We can't ask users for such effort.
There is no report does not mean memory allocation hang is not occurring in
the real life. But nobody (other than me) is interested in adding asynchronous
watchdog like kmallocwd. Thus, I'm spending much effort for finding potential
lockup bugs using stress tests, and Michal do not care bugs which are found by
stress tests, and nobody else are responding, and users do not have a reliable
mean to report lockup bugs caused by memory allocation (e.g. kmallocwd).

Sigh.....

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] mm,page_alloc: Don't call __node_reclaim() with oom_lock held.
  2017-08-26  1:28   ` Tetsuo Handa
@ 2017-08-27  4:17     ` Tetsuo Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2017-08-27  4:17 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, mgorman, mhocko, vbabka

Tetsuo Handa wrote:
> Andrew Morton wrote:
> > On Thu, 24 Aug 2017 21:18:25 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > 
> > > We are doing last second memory allocation attempt before calling
> > > out_of_memory(). But since slab shrinker functions might indirectly
> > > wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
> > > allocations via sleeping locks, calling slab shrinker functions from
> > > node_reclaim() from get_page_from_freelist() with oom_lock held has
> > > possibility of deadlock. Therefore, make sure that last second memory
> > > allocation attempt does not call slab shrinker functions.
> > 
> > I wonder if there's any way we could gert lockdep to detect this sort
> > of thing.
> 
> That is hopeless regarding MM subsystem.
> 
> The root problem is that MM subsystem assumes that somebody else shall make
> progress for me. And direct reclaim does not check for other thread's progress
> (e.g. too_many_isolated() looping forever waiting for kswapd) and continue
> consuming CPU resource (e.g. deprive a thread doing schedule_timeout_killable()
> with oom_lock held of all CPU time for doing pointless get_page_from_freelist()
> etc.).
> 
> Since the page allocator chooses retry the attempt rather than wait for locks,
> lockdep won't help. The dependency is spreaded to all threads with timing and
> threshold checks, preventing threads from calling operations which lockdep
> will detect.

Here is an example. If we use trylock, lockdep cannot detect the dependency.

----------
#include <linux/module.h>

static int __init test_init(void)
{
        static DEFINE_MUTEX(my_oom_lock);
        static DEFINE_MUTEX(some_shrinker_lock);

        if (!mutex_trylock(&my_oom_lock))
                return -EINVAL;
        mutex_lock(&some_shrinker_lock);
        mutex_unlock(&some_shrinker_lock);
        mutex_unlock(&my_oom_lock);

        mutex_lock(&some_shrinker_lock);
        if (!mutex_trylock(&my_oom_lock))
                goto unlock;
        mutex_unlock(&my_oom_lock);
 unlock:
        mutex_unlock(&some_shrinker_lock);
        return -EINVAL;
}

module_init(test_init);
MODULE_LICENSE("GPL");
----------

If we don't use trylock, lockdep can detect the dependency.

----------
#include <linux/module.h>

static int __init test_init(void)
{
        static DEFINE_MUTEX(my_oom_lock);
        static DEFINE_MUTEX(some_shrinker_lock);

        if (!mutex_trylock(&my_oom_lock))
                return -EINVAL;
        mutex_lock(&some_shrinker_lock);
        mutex_unlock(&some_shrinker_lock);
        mutex_unlock(&my_oom_lock);

        mutex_lock(&some_shrinker_lock);
        mutex_lock(&my_oom_lock);
        mutex_unlock(&my_oom_lock);
        mutex_unlock(&some_shrinker_lock);
        return -EINVAL;
}

module_init(test_init);
MODULE_LICENSE("GPL");
----------

----------
[  276.343968] ======================================================
[  276.345523] WARNING: possible circular locking dependency detected
[  276.347101] 4.13.0-rc5-next-20170817 #666 Tainted: G           O
[  276.349421] ------------------------------------------------------
[  276.351506] insmod/9628 is trying to acquire lock:
[  276.354961]  (my_oom_lock){+.+.}, at: [<ffffffffa015f056>] test_init+0x56/0x1000 [test]
[  276.357598]
but task is already holding lock:
[  276.359453]  (some_shrinker_lock){+.+.}, at: [<ffffffffa015f048>] test_init+0x48/0x1000 [test]
[  276.361693]
which lock already depends on the new lock.

[  276.364402]
the existing dependency chain (in reverse order) is:
[  276.367241]
-> #1 (some_shrinker_lock){+.+.}:
[  276.369246]        __lock_acquire+0x1292/0x15f0
[  276.370752]        lock_acquire+0x82/0xd0
[  276.372156]        __mutex_lock+0x83/0x950
[  276.373533]        mutex_lock_nested+0x16/0x20
[  276.374894]        test_init+0x22/0x1000 [test]
[  276.376396]        do_one_initcall+0x4c/0x1a0
[  276.377752]        do_init_module+0x56/0x1ea
[  276.379110]        load_module+0x20d1/0x2600
[  276.380409]        SyS_finit_module+0xc5/0xd0
[  276.381892]        do_syscall_64+0x61/0x1d0
[  276.383391]        return_from_SYSCALL_64+0x0/0x7a
[  276.385004]
-> #0 (my_oom_lock){+.+.}:
[  276.386921]        check_prev_add+0x832/0x840
[  276.388210]        __lock_acquire+0x1292/0x15f0
[  276.389526]        lock_acquire+0x82/0xd0
[  276.390670]        __mutex_lock+0x83/0x950
[  276.391994]        mutex_lock_nested+0x16/0x20
[  276.393793]        test_init+0x56/0x1000 [test]
[  276.395169]        do_one_initcall+0x4c/0x1a0
[  276.396883]        do_init_module+0x56/0x1ea
[  276.399988]        load_module+0x20d1/0x2600
[  276.401669]        SyS_finit_module+0xc5/0xd0
[  276.403050]        do_syscall_64+0x61/0x1d0
[  276.404407]        return_from_SYSCALL_64+0x0/0x7a
[  276.405771]
other info that might help us debug this:

[  276.408627]  Possible unsafe locking scenario:

[  276.410712]        CPU0                    CPU1
[  276.412094]        ----                    ----
[  276.413373]   lock(some_shrinker_lock);
[  276.414489]                                lock(my_oom_lock);
[  276.416584]                                lock(some_shrinker_lock);
[  276.418423]   lock(my_oom_lock);
[  276.419639]
 *** DEADLOCK ***

[  276.421869] 1 lock held by insmod/9628:
[  276.422945]  #0:  (some_shrinker_lock){+.+.}, at: [<ffffffffa015f048>] test_init+0x48/0x1000 [test]
[  276.425026]
stack backtrace:
[  276.426627] CPU: 3 PID: 9628 Comm: insmod Tainted: G           O    4.13.0-rc5-next-20170817 #666
[  276.428668] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  276.431176] Call Trace:
[  276.432108]  dump_stack+0x67/0x9e
[  276.433320]  print_circular_bug+0x390/0x39e
[  276.434728]  check_prev_add+0x832/0x840
[  276.436114]  ? add_lock_to_list.isra.25+0xe0/0xe0
[  276.437465]  ? native_sched_clock+0x36/0xa0
[  276.438755]  ? add_lock_to_list.isra.25+0xe0/0xe0
[  276.440107]  __lock_acquire+0x1292/0x15f0
[  276.441366]  ? test_init+0x56/0x1000 [test]
[  276.442683]  lock_acquire+0x82/0xd0
[  276.443799]  ? test_init+0x56/0x1000 [test]
[  276.445127]  __mutex_lock+0x83/0x950
[  276.446297]  ? test_init+0x56/0x1000 [test]
[  276.447542]  ? test_init+0x56/0x1000 [test]
[  276.448896]  ? 0xffffffffa015f000
[  276.450321]  ? __mutex_unlock_slowpath+0x4d/0x2a0
[  276.451822]  ? 0xffffffffa015f000
[  276.452979]  mutex_lock_nested+0x16/0x20
[  276.454125]  test_init+0x56/0x1000 [test]
[  276.455296]  do_one_initcall+0x4c/0x1a0
[  276.456752]  ? do_init_module+0x1d/0x1ea
[  276.457932]  ? kmem_cache_alloc+0x19e/0x1e0
[  276.459135]  do_init_module+0x56/0x1ea
[  276.460336]  load_module+0x20d1/0x2600
[  276.461534]  ? __symbol_get+0x90/0x90
[  276.462681]  SyS_finit_module+0xc5/0xd0
[  276.463873]  do_syscall_64+0x61/0x1d0
[  276.464968]  entry_SYSCALL64_slow_path+0x25/0x25
[  276.466432] RIP: 0033:0x7fca5de2f7f9
[  276.467839] RSP: 002b:00007ffe0cc50238 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
[  276.469852] RAX: ffffffffffffffda RBX: 00000000010b6220 RCX: 00007fca5de2f7f9
[  276.471918] RDX: 0000000000000000 RSI: 000000000041a2d8 RDI: 0000000000000003
[  276.471919] RBP: 000000000041a2d8 R08: 0000000000000000 R09: 00007ffe0cc503d8
[  276.471919] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000000000
[  276.471920] R13: 00000000010b61f0 R14: 0000000000000000 R15: 0000000000000000
----------

This is because the page allocator pretends that progress is made
without waiting for lock, but actually the page allocator cannot
make progress without waiting for lock. As long as we use trylock,
lockdep won't detect "this sort of thing". (Or we want to jump into
lockdep annotation hell?)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-08-25  8:00       ` Michal Hocko
@ 2017-09-09  0:55         ` Tetsuo Handa
       [not found]           ` <201710172204.AGG30740.tVHJFFOQLMSFOO@I-love.SAKURA.ne.jp>
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-09-09  0:55 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, linux-mm, rientjes, mjaggi, mgorman, oleg, vdavydov.dev, vbabka

There is no response to your suggestion. Can we agree with going to this direction?
If no response, for now I push ignore MMF_OOM_SKIP for once approach.

Michal Hocko wrote:
> On Thu 24-08-17 23:40:36, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 24-08-17 21:18:26, Tetsuo Handa wrote:
> > > > Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> > > > count causes random kernel panics when an OOM victim which consumed memory
> > > > in a way the OOM reaper does not help was selected by the OOM killer [1].
> > > > 
> > > > Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> > > > oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> > > > to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> > > > victim's mm were not able to try allocation from memory reserves after the
> > > > OOM reaper gave up reclaiming memory.
> > > > 
> > > > I proposed a patch which alllows task_will_free_mem(current) in
> > > > out_of_memory() to ignore MMF_OOM_SKIP for once so that all OOM victim
> > > > threads are guaranteed to have tried ALLOC_OOM allocation attempt before
> > > > start selecting next OOM victims [2], for Michal Hocko did not like
> > > > calling get_page_from_freelist() from the OOM killer which is a layer
> > > > violation [3]. But now, Michal thinks that calling get_page_from_freelist()
> > > > after task_will_free_mem(current) test is better than allowing
> > > > task_will_free_mem(current) to ignore MMF_OOM_SKIP for once [4], for
> > > > this would help other cases when we race with an exiting tasks or somebody
> > > > managed to free memory while we were selecting an OOM victim which can take
> > > > quite some time.
> > > 
> > > This a lot of text which can be more confusing than helpful. Could you
> > > state the problem clearly without detours? Yes, the oom killer selection
> > > can race with those freeing memory. And it has been like that since
> > > basically ever.
> > 
> > The problem which Manish Jaggi reported (and I can still reproduce) is that
> > the OOM killer ignores MMF_OOM_SKIP mm too early. And the problem became real
> > in 4.8 due to commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> > oom_reaped tasks"). Thus, it has _not_ been like that since basically ever.
> 
> Again, you are mixing more things together. Manish usecase triggers a
> pathological case where the oom reaper is not able to reclaim basically
> any memory and so we unnecessarily kill another victim if the original
> one doesn't finish quick enough.
> 
> This patch and your former attempts will only help (for that particular
> case) if the victim itself wanted to allocate and didn't manage to pass
> through the ALLOC_OOM attempt since it was killed. This yet again a
> corner case and something this patch won't plug in general (it only
> takes another task to go that path). That's why I consider that
> confusing to mention in the changelog.
> 
> What I am trying to say is that time-to-check vs time-to-kill has
> been a race window since ever and a large amount of memory can be
> released during that time. This patch definitely reduces that time
> window _considerably_. There is still a race window left but this is
> inherently racy so you could argue that the remaining window is small to
> lose sleep over. After all this is a corner case again. From my years of
> experience with OOM reports I haven't met many (if any) cases like that.
> So the primary question is whether we do care about this race window
> enough to even try to fix it. Considering an absolute lack of reports
> I would tend to say we don't but if the fix can be made non-intrusive
> which seems likely then we actually can try it out at least.
> 
> > >                                        I wanted to remove this some time
> > > ago but it has been pointed out that this was really needed
> > > https://patchwork.kernel.org/patch/8153841/ Maybe things have changed
> > > and if so please explain.
> > 
> > get_page_from_freelist() in __alloc_pages_may_oom() will remain needed
> > because it can help allocations which do not call oom_kill_process() (i.e.
> > allocations which do "goto out;" in __alloc_pages_may_oom() without calling
> > out_of_memory(), and allocations which do "return;" in out_of_memory()
> > without calling oom_kill_process() (e.g. !__GFP_FS)) to succeed.
> 
> I do not understand. Those request will simply back off and retry the
> allocation or bail out and fail the allocation. My primary question was
> 
> : that the above link contains an explanation from Andrea that the reason
> : for the high wmark is to reduce the likelihood of livelocks and be sure
> : to invoke the OOM killer,
> 
> I am not sure how much that reason applies to the current code but if it
> still applies then we should do the same for later
> last-minute-allocation as well. Having both and disagreeing is just a
> mess.
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
       [not found]           ` <201710172204.AGG30740.tVHJFFOQLMSFOO@I-love.SAKURA.ne.jp>
@ 2017-10-20 12:40             ` Michal Hocko
  2017-10-20 14:18               ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-10-20 12:40 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, hannes, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

On Tue 17-10-17 22:04:59, Tetsuo Handa wrote:
[...]
> I checked http://lkml.kernel.org/r/20160128163802.GA15953@dhcp22.suse.cz but
> I didn't find reason to use high watermark for the last second allocation
> attempt. The only thing required for avoiding livelock will be "do not
> depend on __GFP_DIRECT_RECLAIM allocation while oom_lock is held".

Andrea tried to explain it http://lkml.kernel.org/r/20160128190204.GJ12228@redhat.com
"
: Elaborating the comment: the reason for the high wmark is to reduce
: the likelihood of livelocks and be sure to invoke the OOM killer, if
: we're still under pressure and reclaim just failed. The high wmark is
: used to be sure the failure of reclaim isn't going to be ignored. If
: using the min wmark like you propose there's risk of livelock or
: anyway of delayed OOM killer invocation.
: 
: The reason for doing one last wmark check (regardless of the wmark
: used) before invoking the oom killer, was just to be sure another OOM
: killer invocation hasn't already freed a ton of memory while we were
: stuck in reclaim. A lot of free memory generated by the OOM killer,
: won't make a parallel reclaim more likely to succeed, it just creates
: free memory, but reclaim only succeeds when it finds "freeable" memory
: and it makes progress in converting it to free memory. So for the
: purpose of this last check, the high wmark would work fine as lots of
: free memory would have been generated in such case.
"

I've had some problems with this reasoning for the current OOM killer
logic but I haven't been convincing enough. Maybe you will have a better
luck.

> Below is updated patch. The motivation of this patch is to guarantee that
> the thread (it can be SCHED_IDLE priority) calling out_of_memory() can use
> enough CPU resource by saving CPU resource wasted by threads (they can be
> !SCHED_IDLE priority) waiting for out_of_memory(). Thus, replace
> mutex_trylock() with mutex_lock_killable().

So what exactly guanratees SCHED_IDLE running while other high priority
processes keep preempting it while it holds the oom lock? Not everybody
is inside the allocation path to get out of the way.
> 
> By replacing mutex_trylock() with mutex_lock_killable(), it might prevent
> the OOM reaper from start reaping immediately. Thus, remove mutex_lock() from
> the OOM reaper.

oom_lock shouldn't be necessary in oom_reaper anymore and that is worth
a separate patch.
 
> By removing mutex_lock() from the OOM reaper, the race window of needlessly
> selecting next OOM victim becomes wider, for the last second allocation
> attempt no longer waits for the OOM reaper. Thus, do the really last
> allocation attempt after selecting an OOM victim using the same watermark.
> 
> Can we go with this direction?

The patch is just too cluttered. You do not want to use
__alloc_pages_slowpath. get_page_from_freelist would be more
appropriate. Also doing alloc_pages_before_oomkill two times seems to be
excessive.

That being said, make sure you adrress all the concerns brought up by
Andrea and Johannes in the above email thread first.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-20 12:40             ` Michal Hocko
@ 2017-10-20 14:18               ` Tetsuo Handa
  2017-10-23 11:30                 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-10-20 14:18 UTC (permalink / raw)
  To: mhocko
  Cc: aarcange, hannes, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Tue 17-10-17 22:04:59, Tetsuo Handa wrote:
> > Below is updated patch. The motivation of this patch is to guarantee that
> > the thread (it can be SCHED_IDLE priority) calling out_of_memory() can use
> > enough CPU resource by saving CPU resource wasted by threads (they can be
> > !SCHED_IDLE priority) waiting for out_of_memory(). Thus, replace
> > mutex_trylock() with mutex_lock_killable().
> 
> So what exactly guanratees SCHED_IDLE running while other high priority
> processes keep preempting it while it holds the oom lock? Not everybody
> is inside the allocation path to get out of the way.

I think that that is a too much worry. If you worry such possibility,
current assumption

	/*
	 * Acquire the oom lock.  If that fails, somebody else is
	 * making progress for us.
	 */

is horribly broken. Also, high priority threads keep preempting will
prevent low priority threads from reaching __alloc_pages_may_oom()
because preemption will occur not only during a low priority thread is
holding oom_lock but also while oom_lock is not held. We can try to
reduce preemption while oom_lock is held by scattering around
preempt_disable()/preempt_enable(). But you said you don't want to
disable preemption during OOM kill operation when I proposed scattering
patch, didn't you?

So, I think that worrying about high priority threads preventing the low
priority thread with oom_lock held is too much. Preventing high priority
threads waiting for oom_lock from disturbing the low priority thread with
oom_lock held by wasting CPU resource will be sufficient.

If you don't like it, the only way will be to offload to a dedicated
kernel thread (like the OOM reaper) so that allocating threads are
no longer blocked by oom_lock. That's a big change.

> > 
> > By replacing mutex_trylock() with mutex_lock_killable(), it might prevent
> > the OOM reaper from start reaping immediately. Thus, remove mutex_lock() from
> > the OOM reaper.
> 
> oom_lock shouldn't be necessary in oom_reaper anymore and that is worth
> a separate patch.

I'll propose as a separate patch after we apply "mm, oom:
task_will_free_mem(current) should ignore MMF_OOM_SKIP for once." or
we call __alloc_pages_slowpath() with oom_lock held.

>  
> > By removing mutex_lock() from the OOM reaper, the race window of needlessly
> > selecting next OOM victim becomes wider, for the last second allocation
> > attempt no longer waits for the OOM reaper. Thus, do the really last
> > allocation attempt after selecting an OOM victim using the same watermark.
> > 
> > Can we go with this direction?
> 
> The patch is just too cluttered. You do not want to use
> __alloc_pages_slowpath. get_page_from_freelist would be more
> appropriate. Also doing alloc_pages_before_oomkill two times seems to be
> excessive.

This patch is intentionally calling __alloc_pages_slowpath() because
it handles ALLOC_OOM by calling __gfp_pfmemalloc_flags(). If this patch
calls only get_page_from_freelist(), we will fail to try ALLOC_OOM before
calling out_of_memory() (when current thread is selected as OOM victim
while waiting for oom_lock) and just before sending SIGKILL (when
task_will_free_mem(current) in out_of_memory() returned false because
MMF_OOM_SKIP was set before ALLOC_OOM allocation is attempted) unless
we apply "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP
for once.".

> 
> That being said, make sure you adrress all the concerns brought up by
> Andrea and Johannes in the above email thread first.

I don't think there are concerns if we wait for oom_lock.
The only concern will be do not depend on __GFP_DIRECT_RECLAIM allocation
while oom_lock is held. Andrea and Johannes, what are your concerns?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-20 14:18               ` Tetsuo Handa
@ 2017-10-23 11:30                 ` Michal Hocko
  2017-10-24 11:24                   ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-10-23 11:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, hannes, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

On Fri 20-10-17 23:18:19, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 17-10-17 22:04:59, Tetsuo Handa wrote:
> > > Below is updated patch. The motivation of this patch is to guarantee that
> > > the thread (it can be SCHED_IDLE priority) calling out_of_memory() can use
> > > enough CPU resource by saving CPU resource wasted by threads (they can be
> > > !SCHED_IDLE priority) waiting for out_of_memory(). Thus, replace
> > > mutex_trylock() with mutex_lock_killable().
> > 
> > So what exactly guanratees SCHED_IDLE running while other high priority
> > processes keep preempting it while it holds the oom lock? Not everybody
> > is inside the allocation path to get out of the way.
> 
> I think that that is a too much worry. If you worry such possibility,
> current assumption
> 
> 	/*
> 	 * Acquire the oom lock.  If that fails, somebody else is
> 	 * making progress for us.
> 	 */
> 
> is horribly broken. Also, high priority threads keep preempting will
> prevent low priority threads from reaching __alloc_pages_may_oom()
> because preemption will occur not only during a low priority thread is
> holding oom_lock but also while oom_lock is not held. We can try to
> reduce preemption while oom_lock is held by scattering around
> preempt_disable()/preempt_enable(). But you said you don't want to
> disable preemption during OOM kill operation when I proposed scattering
> patch, didn't you?
> 
> So, I think that worrying about high priority threads preventing the low
> priority thread with oom_lock held is too much. Preventing high priority
> threads waiting for oom_lock from disturbing the low priority thread with
> oom_lock held by wasting CPU resource will be sufficient.

In other words this is just to paper over an overloaded allocation path
close to OOM. Your changelog is really misleading in that direction
IMHO. I have to think some more about using the full lock rather than
the trylock, because taking the try lock is somehow easier.

> If you don't like it, the only way will be to offload to a dedicated
> kernel thread (like the OOM reaper) so that allocating threads are
> no longer blocked by oom_lock. That's a big change.

This doesn't solve anything as all the tasks would have to somehow wait
for the kernel thread to do its stuff.
 
> > > By replacing mutex_trylock() with mutex_lock_killable(), it might prevent
> > > the OOM reaper from start reaping immediately. Thus, remove mutex_lock() from
> > > the OOM reaper.
> > 
> > oom_lock shouldn't be necessary in oom_reaper anymore and that is worth
> > a separate patch.
> 
> I'll propose as a separate patch after we apply "mm, oom:
> task_will_free_mem(current) should ignore MMF_OOM_SKIP for once." or
> we call __alloc_pages_slowpath() with oom_lock held.
> 
> >  
> > > By removing mutex_lock() from the OOM reaper, the race window of needlessly
> > > selecting next OOM victim becomes wider, for the last second allocation
> > > attempt no longer waits for the OOM reaper. Thus, do the really last
> > > allocation attempt after selecting an OOM victim using the same watermark.
> > > 
> > > Can we go with this direction?
> > 
> > The patch is just too cluttered. You do not want to use
> > __alloc_pages_slowpath. get_page_from_freelist would be more
> > appropriate. Also doing alloc_pages_before_oomkill two times seems to be
> > excessive.
> 
> This patch is intentionally calling __alloc_pages_slowpath() because
> it handles ALLOC_OOM by calling __gfp_pfmemalloc_flags(). If this patch
> calls only get_page_from_freelist(), we will fail to try ALLOC_OOM before
> calling out_of_memory() (when current thread is selected as OOM victim
> while waiting for oom_lock) and just before sending SIGKILL (when
> task_will_free_mem(current) in out_of_memory() returned false because
> MMF_OOM_SKIP was set before ALLOC_OOM allocation is attempted) unless
> we apply "mm, oom: task_will_free_mem(current) should ignore MMF_OOM_SKIP
> for once.".

Nothing really prevents you from re-evaluating alloc_flags which is
definitely preferred to maintaining an already complext allocator slow
path reentrant safe.
 
> > That being said, make sure you adrress all the concerns brought up by
> > Andrea and Johannes in the above email thread first.
> 
> I don't think there are concerns if we wait for oom_lock.
> The only concern will be do not depend on __GFP_DIRECT_RECLAIM allocation
> while oom_lock is held. Andrea and Johannes, what are your concerns?

Read, what they wrote carefully, take their concerns and argument with
each of them. You cannot simply hand wave them like that.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-23 11:30                 ` Michal Hocko
@ 2017-10-24 11:24                   ` Tetsuo Handa
  2017-10-24 11:41                     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-10-24 11:24 UTC (permalink / raw)
  To: mhocko
  Cc: aarcange, hannes, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> > > That being said, make sure you adrress all the concerns brought up by
> > > Andrea and Johannes in the above email thread first.
> > 
> > I don't think there are concerns if we wait for oom_lock.
> > The only concern will be do not depend on __GFP_DIRECT_RECLAIM allocation
> > while oom_lock is held. Andrea and Johannes, what are your concerns?
> 
> Read, what they wrote carefully, take their concerns and argument with
> each of them. You cannot simply hand wave them like that.

I'm not hand waving at all. I really can't figure out what are valid concerns.
Below are summary of changes regarding OOM killer serialization.

  As of linux-2.6.11, nothing prevented from concurrently calling out_of_memory().
  TIF_MEMDIE test in select_bad_process() tried to avoid needless OOM killing.
  Thus, it was safe to do __GFP_DIRECT_RECLAIM allocation (apart from which watermark
  should be used) just before calling out_of_memory().

  As of linux-2.6.14, nothing prevented from concurrently calling out_of_memory().
  But unsafe cpuset_lock() call was added to out_of_memory() by ef08e3b4981aebf2
  ("[PATCH] cpusets: confine oom_killer to mem_exclusive cpuset").

  As of linux-2.6.16, cpuset_lock() effectively started acting as today's
  mutex_lock(&oom_lock) by 505970b96e3b7d22 ("[PATCH] cpuset oom lock fix").

  As of linux-2.6.24, cpuset_lock() was removed from out_of_memory() by 3ff566963ce80480
  ("oom: do not take callback_mutex"), and try_set_zone_oom() was added to
  __alloc_pages_may_oom() by ff0ceb9deb6eb017 ("oom: serialize out of memory calls")
  which effectively started acting as a kind of today's mutex_trylock(&oom_lock).

  As of linux-3.19, the ordering of __GFP_FS versus try_set_zone_oom() test was inverted by
  9879de7373fcfb46 ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
  and was repaired by cc87317726f85153 ("mm: page_alloc: revert inadvertent !__GFP_FS retry
  behavior change").

  As of linux-4.2, try_set_zone_oom() was replaced with oom_lock by dc56401fc9f25e8f
  ("mm: oom_kill: simplify OOM killer locking"). At least by this time, it became
  no longer safe to do __GFP_DIRECT_RECLAIM allocation with oom_lock held, didn't it?

  As of linux-4.9, warn_alloc() for reporting allocation stalls was introduced by
  63f53dea0c9866e9 ("mm: warn about allocations which stall for too long"), and we have
  been discussing how to avoid printk() versus oom_lock deadlock since last December (
  http://lkml.kernel.org/r/1481020439-5867-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp )
  by making sure that the thread holding oom_lock can use enough CPU resource and
  the threads not holding oom_lock can refrain from appending to printk() buffer
  and yield enough CPU resource to the thread holding oom_lock.

So, I really cannot figure out why we can compare the concerns of 2.6.11 and 4.14.
Andrea and Johannes, please do come out and explain what your concerns are.

> > > > Below is updated patch. The motivation of this patch is to guarantee that
> > > > the thread (it can be SCHED_IDLE priority) calling out_of_memory() can use
> > > > enough CPU resource by saving CPU resource wasted by threads (they can be
> > > > !SCHED_IDLE priority) waiting for out_of_memory(). Thus, replace
> > > > mutex_trylock() with mutex_lock_killable().
> > > 
> > > So what exactly guanratees SCHED_IDLE running while other high priority
> > > processes keep preempting it while it holds the oom lock? Not everybody
> > > is inside the allocation path to get out of the way.
> > 
> > I think that that is a too much worry. If you worry such possibility,
> > current assumption
> > 
> > 	/*
> > 	 * Acquire the oom lock.  If that fails, somebody else is
> > 	 * making progress for us.
> > 	 */
> > 
> > is horribly broken. Also, high priority threads keep preempting will
> > prevent low priority threads from reaching __alloc_pages_may_oom()
> > because preemption will occur not only during a low priority thread is
> > holding oom_lock but also while oom_lock is not held. We can try to
> > reduce preemption while oom_lock is held by scattering around
> > preempt_disable()/preempt_enable(). But you said you don't want to
> > disable preemption during OOM kill operation when I proposed scattering
> > patch, didn't you?
> > 
> > So, I think that worrying about high priority threads preventing the low
> > priority thread with oom_lock held is too much. Preventing high priority
> > threads waiting for oom_lock from disturbing the low priority thread with
> > oom_lock held by wasting CPU resource will be sufficient.
> 
> In other words this is just to paper over an overloaded allocation path
> close to OOM. Your changelog is really misleading in that direction
> IMHO. I have to think some more about using the full lock rather than
> the trylock, because taking the try lock is somehow easier.

Somehow easier to what? Please don't omit.

I consider that the OOM killer is a safety mechanism in case a system got
overloaded. Therefore, I really hate your comments like "Your system is already
DOSed". It is stupid thing that safety mechanism drives the overloaded system
worse and defunctional when it should rescue.

Current code is somehow easier to OOM lockup due to printk() versus oom_lock
dependency, and I'm proposing a patch for mitigating printk() versus oom_lock
dependency using oom_printk_lock because I can hardly examine OOM related
problems since linux-4.9, and your response was "Hell no!".

> > If you don't like it, the only way will be to offload to a dedicated
> > kernel thread (like the OOM reaper) so that allocating threads are
> > no longer blocked by oom_lock. That's a big change.
> 
> This doesn't solve anything as all the tasks would have to somehow wait
> for the kernel thread to do its stuff.

Which direction are you looking at?

You said "Hell no!" to an attempt which preserves mutex_trylock(&oom_lock).

You said "violent NAK!" to an attempt which effectively replaces
mutex_trylock(&oom_lock) with mutex_lock_killable(&oom_lock) while allow
retrying __GFP_DIRECT_RECLAIM allocation attempt which would more likely
succeed than alloc_pages_before_oomkill(). If their concerns are that we
should retry __GFP_DIRECT_RECLAIM allocation before calling out_of_memory(),
this is a preferred choice (because we can't try __GFP_DIRECT_RECLAIM allocation
with oom_lock held). If their concerns are that we should not delay calling
out_of_memory() too much, alloc_pages_before_oomkill() is a preferred choice.
If their concerns are neither one, please do come out and explain.

You admit that all the threads will have to wait for a thread which does the
OOM-kill operation, and you have even suggested replacing mutex_trylock(&oom_lock)
with mutex_lock_killable(&oom_lock) in the thread above. Despite we have used
mutex_lock(&callback_mutex) for linux-2.6.16 to linux-2.6.23, you don't want to
use mutex_lock_killable(&oom_lock) !?

Are we on the same page that we are seeking for a solution which can allow
a thread doing the OOM-kill operation to use enough CPU resources?

Instead of postponing with comments like "Hell no!" and "violent NAK!", please
explain us your approach and show us your patches which we can apply _now_.
Are you planning to remove oom_lock and go back to linux-2.6.11 so that we
don't need to worry about printk() versus oom_lock deadlock? Are you planning
to completely rewrite the page allocator so that __GFP_DIRECT_RECLAIM
allocating threads are throttled before calling __alloc_pages_may_oom() ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-24 11:24                   ` Tetsuo Handa
@ 2017-10-24 11:41                     ` Michal Hocko
  2017-10-25 10:48                       ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-10-24 11:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, hannes, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

On Tue 24-10-17 20:24:46, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > > So, I think that worrying about high priority threads preventing the low
> > > priority thread with oom_lock held is too much. Preventing high priority
> > > threads waiting for oom_lock from disturbing the low priority thread with
> > > oom_lock held by wasting CPU resource will be sufficient.
> > 
> > In other words this is just to paper over an overloaded allocation path
> > close to OOM. Your changelog is really misleading in that direction
> > IMHO. I have to think some more about using the full lock rather than
> > the trylock, because taking the try lock is somehow easier.
> 
> Somehow easier to what? Please don't omit.

To back off on the oom races.

> I consider that the OOM killer is a safety mechanism in case a system got
> overloaded. Therefore, I really hate your comments like "Your system is already
> DOSed". It is stupid thing that safety mechanism drives the overloaded system
> worse and defunctional when it should rescue.

The OOM killer is the last hand break. At the time you hit the OOM
condition your system is usually hard to use anyway. And that is why I
do care to make this path deadlock free. I have mentioned multiple times
that I find real life triggers much more important than artificial DoS
like workloads which make your system unsuable long before you hit OOM
killer.

> Current code is somehow easier to OOM lockup due to printk() versus oom_lock
> dependency, and I'm proposing a patch for mitigating printk() versus oom_lock
> dependency using oom_printk_lock because I can hardly examine OOM related
> problems since linux-4.9, and your response was "Hell no!".

Because you are repeatedly proposing a paper over rather than to attempt
something resembling a solution. And this is highly annoying. I've
already said that I am willing to sacrifice the stall warning rather
than fiddle with random locks put here and there.

> > > If you don't like it, the only way will be to offload to a dedicated
> > > kernel thread (like the OOM reaper) so that allocating threads are
> > > no longer blocked by oom_lock. That's a big change.
> > 
> > This doesn't solve anything as all the tasks would have to somehow wait
> > for the kernel thread to do its stuff.
> 
> Which direction are you looking at?

I am sorry but I would only repeat something that has been said many
times already. You keep hammering this particular issue like it was the
number one problem in the MM code. We have many much more important
issues to deal with. While it is interesting to make kernel more robust
under OOM conditions it doesn't make much sense to overcomplicate the
code for unrealistic workloads. If we have problems with real life
scenarios then let's fix them, by all means.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-24 11:41                     ` Michal Hocko
@ 2017-10-25 10:48                       ` Tetsuo Handa
  2017-10-25 11:09                         ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-10-25 10:48 UTC (permalink / raw)
  To: mhocko
  Cc: aarcange, hannes, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Tue 24-10-17 20:24:46, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > > So, I think that worrying about high priority threads preventing the low
> > > > priority thread with oom_lock held is too much. Preventing high priority
> > > > threads waiting for oom_lock from disturbing the low priority thread with
> > > > oom_lock held by wasting CPU resource will be sufficient.
> > > 
> > > In other words this is just to paper over an overloaded allocation path
> > > close to OOM. Your changelog is really misleading in that direction
> > > IMHO. I have to think some more about using the full lock rather than
> > > the trylock, because taking the try lock is somehow easier.
> > 
> > Somehow easier to what? Please don't omit.
> 
> To back off on the oom races.
> 

But that choice is breaking the

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */

assumption due to warn_alloc() since Linux 4.9 unless at least
below patch is applied.

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4002,7 +4002,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		goto nopage;
 
 	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
+	if (time_after(jiffies, alloc_start + stall_timeout) &&
+	    !mutex_is_locked(&oom_lock)) {
 		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
 			"page allocation stalls for %ums, order:%u",
 			jiffies_to_msecs(jiffies-alloc_start), order);
--

> > I consider that the OOM killer is a safety mechanism in case a system got
> > overloaded. Therefore, I really hate your comments like "Your system is already
> > DOSed". It is stupid thing that safety mechanism drives the overloaded system
> > worse and defunctional when it should rescue.
> 
> The OOM killer is the last hand break. At the time you hit the OOM
> condition your system is usually hard to use anyway. And that is why I
> do care to make this path deadlock free. I have mentioned multiple times
> that I find real life triggers much more important than artificial DoS
> like workloads which make your system unsuable long before you hit OOM
> killer.

Unable to invoke the OOM killer (i.e. OOM lockup) is worse than hand break injury.

If you do care to make this path deadlock free, you had better stop depending on
mutex_trylock(&oom_lock). Not only printk() from oom_kill_process() can trigger
deadlock due to console_sem versus oom_lock dependency but also
schedule_timeout_killable(1) from out_of_memory() can also trigger deadlock
due to SCHED_IDLE versus !SCHED_IDLE dependency (like I suggested at 
http://lkml.kernel.org/r/201603031941.CBC81272.OtLMSFVOFJHOFQ@I-love.SAKURA.ne.jp ).

> 
> > Current code is somehow easier to OOM lockup due to printk() versus oom_lock
> > dependency, and I'm proposing a patch for mitigating printk() versus oom_lock
> > dependency using oom_printk_lock because I can hardly examine OOM related
> > problems since linux-4.9, and your response was "Hell no!".
> 
> Because you are repeatedly proposing a paper over rather than to attempt
> something resembling a solution. And this is highly annoying. I've
> already said that I am willing to sacrifice the stall warning rather
> than fiddle with random locks put here and there.

I've already said that I do welcome removing the stall warning if it is
replaced with a better approach. If there is no acceptable alternative now,
I do want to avoid "warn_alloc() without oom_lock held" versus
"oom_kill_process() with oom_lock held" dependency. And I'm waiting for your
answer in that thread.

> 
> > > > If you don't like it, the only way will be to offload to a dedicated
> > > > kernel thread (like the OOM reaper) so that allocating threads are
> > > > no longer blocked by oom_lock. That's a big change.
> > > 
> > > This doesn't solve anything as all the tasks would have to somehow wait
> > > for the kernel thread to do its stuff.
> > 
> > Which direction are you looking at?
> 
> I am sorry but I would only repeat something that has been said many
> times already. You keep hammering this particular issue like it was the
> number one problem in the MM code. We have many much more important
> issues to deal with. While it is interesting to make kernel more robust
> under OOM conditions it doesn't make much sense to overcomplicate the
> code for unrealistic workloads. If we have problems with real life
> scenarios then let's fix them, by all means.

You are misunderstanding again. My goal is not to put systems under unrealistic
workloads. My goal is to provide better debugging information when a memory
allocation related problem occurred (such as unexpected infinite loop or too
long waiting). Since there is no asynchronous watchdog, we have to test whether
it is possible to hit corner cases (e.g. infinite too_many_isolated() loop in
shrink_inactive_list()) using a kind of fuzzing approach. Creating hundreds of
processes and let each child to free memory bit by bit is needed for testing
almost OOM situations. What I'm doing is handmade OOM fuzzing.

The warn_alloc() for reporting allocation stalls (which is synchronous mechanism)
is failing to provide information other than "something is going wrong". And so far
nobody (except I) is interested in asynchronous mechanism which would allow
triggering more actions such as turning on vmscan tracepoints at
http://lkml.kernel.org/r/20171024200639.2pyxkw2cucwxrtlb@dhcp22.suse.cz .

Since that system has 32GB memory and 64GB swap, and chrome browser which
might eat a lot of memory is running, and 3.5GB of swap is in use, I suspect
the possibility that disk I/O was not done smoothly because any allocation
which allocates from Node 0 Normal was already hitting min watermark.
But I'm not sure whether vmscan tracepoints can provide enough information, nor
whether it is possible to save the information to files under OOM situation. If
printk() were used for dumping the information using serial console or netconsole,
we would want to avoid mixing the output with the stall warning and/or the OOM
killer messages.

Despite you have said

  So let's agree to disagree about importance of the reliability
  warn_alloc. I see it as an improvement which doesn't really have to be
  perfect.

at https://patchwork.kernel.org/patch/9381891/ , can we agree with killing
the synchronous allocation stall warning messages and start seeking for
asynchronous approach?

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3868,8 +3868,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	enum compact_result compact_result;
 	int compaction_retries;
 	int no_progress_loops;
-	unsigned long alloc_start = jiffies;
-	unsigned int stall_timeout = 10 * HZ;
 	unsigned int cpuset_mems_cookie;
 	int reserve_flags;
 
@@ -4001,14 +3999,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (!can_direct_reclaim)
 		goto nopage;
 
-	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
-		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
-			"page allocation stalls for %ums, order:%u",
-			jiffies_to_msecs(jiffies-alloc_start), order);
-		stall_timeout += 10 * HZ;
-	}
-
 	/* Avoid recursion of direct reclaim */
 	if (current->flags & PF_MEMALLOC)
 		goto nopage;
-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-25 10:48                       ` Tetsuo Handa
@ 2017-10-25 11:09                         ` Michal Hocko
  2017-10-25 12:15                           ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-10-25 11:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: aarcange, hannes, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

On Wed 25-10-17 19:48:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > The OOM killer is the last hand break. At the time you hit the OOM
> > condition your system is usually hard to use anyway. And that is why I
> > do care to make this path deadlock free. I have mentioned multiple times
> > that I find real life triggers much more important than artificial DoS
> > like workloads which make your system unsuable long before you hit OOM
> > killer.
> 
> Unable to invoke the OOM killer (i.e. OOM lockup) is worse than hand break injury.
> 
> If you do care to make this path deadlock free, you had better stop depending on
> mutex_trylock(&oom_lock). Not only printk() from oom_kill_process() can trigger
> deadlock due to console_sem versus oom_lock dependency but also

And this means that we have to fix printk. Completely silent oom path is
out of question IMHO

> schedule_timeout_killable(1) from out_of_memory() can also trigger deadlock
> due to SCHED_IDLE versus !SCHED_IDLE dependency (like I suggested at 
> http://lkml.kernel.org/r/201603031941.CBC81272.OtLMSFVOFJHOFQ@I-love.SAKURA.ne.jp ).

You are still missing the point here. You do not really have to sleep to
get preempted by high priority task here. Moreover sleep is done after
we have killed the victim and the reaper can already start tearing down
the memory. If you oversubscribe your system by high priority tasks you
are screwed no matter what.
 
> > > Current code is somehow easier to OOM lockup due to printk() versus oom_lock
> > > dependency, and I'm proposing a patch for mitigating printk() versus oom_lock
> > > dependency using oom_printk_lock because I can hardly examine OOM related
> > > problems since linux-4.9, and your response was "Hell no!".
> > 
> > Because you are repeatedly proposing a paper over rather than to attempt
> > something resembling a solution. And this is highly annoying. I've
> > already said that I am willing to sacrifice the stall warning rather
> > than fiddle with random locks put here and there.
> 
> I've already said that I do welcome removing the stall warning if it is
> replaced with a better approach. If there is no acceptable alternative now,
> I do want to avoid "warn_alloc() without oom_lock held" versus
> "oom_kill_process() with oom_lock held" dependency. And I'm waiting for your
> answer in that thread.

I have already responded. Nagging me further doesn't help.

[...]
> Despite you have said
> 
>   So let's agree to disagree about importance of the reliability
>   warn_alloc. I see it as an improvement which doesn't really have to be
>   perfect.

And I stand by this statement.

> at https://patchwork.kernel.org/patch/9381891/ , can we agree with killing
> the synchronous allocation stall warning messages and start seeking for
> asynchronous approach?

I've already said that I will not oppose removing it if regular
workloads are tripping over it. Johannes had some real world examples
AFAIR but didn't provide any details which we could use for the
changelog. I wouldn't be entirely happy about that but the reality says
that the printk infrastructure is not really prepared for extreme loads.
 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3868,8 +3868,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	enum compact_result compact_result;
>  	int compaction_retries;
>  	int no_progress_loops;
> -	unsigned long alloc_start = jiffies;
> -	unsigned int stall_timeout = 10 * HZ;
>  	unsigned int cpuset_mems_cookie;
>  	int reserve_flags;
>  
> @@ -4001,14 +3999,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	if (!can_direct_reclaim)
>  		goto nopage;
>  
> -	/* Make sure we know about allocations which stall for too long */
> -	if (time_after(jiffies, alloc_start + stall_timeout)) {
> -		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> -			"page allocation stalls for %ums, order:%u",
> -			jiffies_to_msecs(jiffies-alloc_start), order);
> -		stall_timeout += 10 * HZ;
> -	}
> -
>  	/* Avoid recursion of direct reclaim */
>  	if (current->flags & PF_MEMALLOC)
>  		goto nopage;
> -- 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-25 11:09                         ` Michal Hocko
@ 2017-10-25 12:15                           ` Tetsuo Handa
  2017-10-25 12:41                             ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-10-25 12:15 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: aarcange, akpm, linux-mm, rientjes, mjaggi, mgorman, oleg,
	vdavydov.dev, vbabka

Michal Hocko wrote:
> On Wed 25-10-17 19:48:09, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > The OOM killer is the last hand break. At the time you hit the OOM
> > > condition your system is usually hard to use anyway. And that is why I
> > > do care to make this path deadlock free. I have mentioned multiple times
> > > that I find real life triggers much more important than artificial DoS
> > > like workloads which make your system unsuable long before you hit OOM
> > > killer.
> > 
> > Unable to invoke the OOM killer (i.e. OOM lockup) is worse than hand break injury.
> > 
> > If you do care to make this path deadlock free, you had better stop depending on
> > mutex_trylock(&oom_lock). Not only printk() from oom_kill_process() can trigger
> > deadlock due to console_sem versus oom_lock dependency but also
> 
> And this means that we have to fix printk. Completely silent oom path is
> out of question IMHO

We cannot fix printk() without giving enough CPU resource to printk().

I don't think "Completely silent oom path" can happen, for warn_alloc() is called
again when it is retried. But anyway, let's remove warn_alloc().

> 
> > schedule_timeout_killable(1) from out_of_memory() can also trigger deadlock
> > due to SCHED_IDLE versus !SCHED_IDLE dependency (like I suggested at 
> > http://lkml.kernel.org/r/201603031941.CBC81272.OtLMSFVOFJHOFQ@I-love.SAKURA.ne.jp ).
> 
> You are still missing the point here. You do not really have to sleep to
> get preempted by high priority task here. Moreover sleep is done after
> we have killed the victim and the reaper can already start tearing down
> the memory. If you oversubscribe your system by high priority tasks you
> are screwed no matter what.
>  

Not possible yet. The OOM reaper is waiting for a thread sleeping at
schedule_timeout_killable(1) to release the oom_lock. If this patch
(alloc_pages_before_oomkill() etc.) is applied, the OOM reaper no longer
needs to wait for the oom_lock.

> > Despite you have said
> > 
> >   So let's agree to disagree about importance of the reliability
> >   warn_alloc. I see it as an improvement which doesn't really have to be
> >   perfect.
> 
> And I stand by this statement.
> 
> > at https://patchwork.kernel.org/patch/9381891/ , can we agree with killing
> > the synchronous allocation stall warning messages and start seeking for
> > asynchronous approach?
> 
> I've already said that I will not oppose removing it if regular
> workloads are tripping over it. Johannes had some real world examples
> AFAIR but didn't provide any details which we could use for the
> changelog. I wouldn't be entirely happy about that but the reality says
> that the printk infrastructure is not really prepared for extreme loads.

Yes, Johannes's case is a real world example. OK. Let's remove warn_alloc().
Removing warn_alloc() will also solve Mikulas Patocka's "mm: respect the
__GFP_NOWARN flag when warning about stalls" suggestion.

Johannes, can you provide details? (I wonder whether he can pick up meaningful
lines from "Tons and tons of allocation stall warnings followed by the soft
lock-ups." though...)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-25 12:15                           ` Tetsuo Handa
@ 2017-10-25 12:41                             ` Michal Hocko
  2017-10-25 14:58                               ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-10-25 12:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, aarcange, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

On Wed 25-10-17 21:15:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 25-10-17 19:48:09, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > The OOM killer is the last hand break. At the time you hit the OOM
> > > > condition your system is usually hard to use anyway. And that is why I
> > > > do care to make this path deadlock free. I have mentioned multiple times
> > > > that I find real life triggers much more important than artificial DoS
> > > > like workloads which make your system unsuable long before you hit OOM
> > > > killer.
> > > 
> > > Unable to invoke the OOM killer (i.e. OOM lockup) is worse than hand break injury.
> > > 
> > > If you do care to make this path deadlock free, you had better stop depending on
> > > mutex_trylock(&oom_lock). Not only printk() from oom_kill_process() can trigger
> > > deadlock due to console_sem versus oom_lock dependency but also
> > 
> > And this means that we have to fix printk. Completely silent oom path is
> > out of question IMHO
> 
> We cannot fix printk() without giving enough CPU resource to printk().

This is a separate discussion but having a basically unbound time spent
in printk is simply a no-go.
 
> I don't think "Completely silent oom path" can happen, for warn_alloc() is called
> again when it is retried. But anyway, let's remove warn_alloc().

I mean something else. We simply cannot do the oom killing without
telling userspace about that. And printk is the only API we can use for
that.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-25 12:41                             ` Michal Hocko
@ 2017-10-25 14:58                               ` Tetsuo Handa
  2017-10-25 15:05                                 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Tetsuo Handa @ 2017-10-25 14:58 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, aarcange, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Wed 25-10-17 21:15:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 25-10-17 19:48:09, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > [...]
> > > > > The OOM killer is the last hand break. At the time you hit the OOM
> > > > > condition your system is usually hard to use anyway. And that is why I
> > > > > do care to make this path deadlock free. I have mentioned multiple times
> > > > > that I find real life triggers much more important than artificial DoS
> > > > > like workloads which make your system unsuable long before you hit OOM
> > > > > killer.
> > > > 
> > > > Unable to invoke the OOM killer (i.e. OOM lockup) is worse than hand break injury.
> > > > 
> > > > If you do care to make this path deadlock free, you had better stop depending on
> > > > mutex_trylock(&oom_lock). Not only printk() from oom_kill_process() can trigger
> > > > deadlock due to console_sem versus oom_lock dependency but also
> > > 
> > > And this means that we have to fix printk. Completely silent oom path is
> > > out of question IMHO
> > 
> > We cannot fix printk() without giving enough CPU resource to printk().
> 
> This is a separate discussion but having a basically unbound time spent
> in printk is simply a no-go.
>  
> > I don't think "Completely silent oom path" can happen, for warn_alloc() is called
> > again when it is retried. But anyway, let's remove warn_alloc().
> 
> I mean something else. We simply cannot do the oom killing without
> telling userspace about that. And printk is the only API we can use for
> that.

I thought something like

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3872,6 +3872,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
        unsigned int stall_timeout = 10 * HZ;
        unsigned int cpuset_mems_cookie;
        int reserve_flags;
+       static DEFINE_MUTEX(warn_lock);

        /*
         * In the slowpath, we sanity check order to avoid ever trying to
@@ -4002,11 +4003,15 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
                goto nopage;

        /* Make sure we know about allocations which stall for too long */
-       if (time_after(jiffies, alloc_start + stall_timeout)) {
-               warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
-                       "page allocation stalls for %ums, order:%u",
-                       jiffies_to_msecs(jiffies-alloc_start), order);
-               stall_timeout += 10 * HZ;
+       if (time_after(jiffies, alloc_start + stall_timeout) &&
+           mutex_trylock(&warn_lock)) {
+               if (!mutex_is_locked(&oom_lock)) {
+                       warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
+                                  "page allocation stalls for %ums, order:%u",
+                                  jiffies_to_msecs(jiffies-alloc_start), order);
+                       stall_timeout += 10 * HZ;
+               }
+               mutex_unlock(&warn_lock);
        }

        /* Avoid recursion of direct reclaim */

for isolating the OOM killer messages and the stall warning messages (in order to
break continuation condition in console_unlock()), and

@@ -3294,7 +3294,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
-       if (!mutex_trylock(&oom_lock)) {
+       if (mutex_lock_killable(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;

for giving printk() enough CPU resource.

What you thought is avoid using printk() from out_of_memory() in case enough
CPU resource is not given, isn't it? Then, that is out of question.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-25 14:58                               ` Tetsuo Handa
@ 2017-10-25 15:05                                 ` Michal Hocko
  2017-10-25 15:34                                   ` Tetsuo Handa
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2017-10-25 15:05 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hannes, aarcange, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

On Wed 25-10-17 23:58:33, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 25-10-17 21:15:24, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Wed 25-10-17 19:48:09, Tetsuo Handa wrote:
> > > > > Michal Hocko wrote:
> > > > [...]
> > > > > > The OOM killer is the last hand break. At the time you hit the OOM
> > > > > > condition your system is usually hard to use anyway. And that is why I
> > > > > > do care to make this path deadlock free. I have mentioned multiple times
> > > > > > that I find real life triggers much more important than artificial DoS
> > > > > > like workloads which make your system unsuable long before you hit OOM
> > > > > > killer.
> > > > > 
> > > > > Unable to invoke the OOM killer (i.e. OOM lockup) is worse than hand break injury.
> > > > > 
> > > > > If you do care to make this path deadlock free, you had better stop depending on
> > > > > mutex_trylock(&oom_lock). Not only printk() from oom_kill_process() can trigger
> > > > > deadlock due to console_sem versus oom_lock dependency but also
> > > > 
> > > > And this means that we have to fix printk. Completely silent oom path is
> > > > out of question IMHO
> > > 
> > > We cannot fix printk() without giving enough CPU resource to printk().
> > 
> > This is a separate discussion but having a basically unbound time spent
> > in printk is simply a no-go.
> >  
> > > I don't think "Completely silent oom path" can happen, for warn_alloc() is called
> > > again when it is retried. But anyway, let's remove warn_alloc().
> > 
> > I mean something else. We simply cannot do the oom killing without
> > telling userspace about that. And printk is the only API we can use for
> > that.
> 
> I thought something like
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3872,6 +3872,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>         unsigned int stall_timeout = 10 * HZ;
>         unsigned int cpuset_mems_cookie;
>         int reserve_flags;
> +       static DEFINE_MUTEX(warn_lock);
> 
>         /*
>          * In the slowpath, we sanity check order to avoid ever trying to
> @@ -4002,11 +4003,15 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>                 goto nopage;
> 
>         /* Make sure we know about allocations which stall for too long */
> -       if (time_after(jiffies, alloc_start + stall_timeout)) {
> -               warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> -                       "page allocation stalls for %ums, order:%u",
> -                       jiffies_to_msecs(jiffies-alloc_start), order);
> -               stall_timeout += 10 * HZ;
> +       if (time_after(jiffies, alloc_start + stall_timeout) &&
> +           mutex_trylock(&warn_lock)) {
> +               if (!mutex_is_locked(&oom_lock)) {

The check for oom_lock just doesn't make any sense. The lock can be take
at any time after the check.

> +                       warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> +                                  "page allocation stalls for %ums, order:%u",
> +                                  jiffies_to_msecs(jiffies-alloc_start), order);
> +                       stall_timeout += 10 * HZ;
> +               }
> +               mutex_unlock(&warn_lock);
>         }
> 
>         /* Avoid recursion of direct reclaim */
> 
> for isolating the OOM killer messages and the stall warning messages (in order to
> break continuation condition in console_unlock()), and
> 
> @@ -3294,7 +3294,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
>          * Acquire the oom lock.  If that fails, somebody else is
>          * making progress for us.
>          */
> -       if (!mutex_trylock(&oom_lock)) {
> +       if (mutex_lock_killable(&oom_lock)) {
>                 *did_some_progress = 1;
>                 schedule_timeout_uninterruptible(1);
>                 return NULL;
> 
> for giving printk() enough CPU resource.
> 
> What you thought is avoid using printk() from out_of_memory() in case enough
> CPU resource is not given, isn't it? Then, that is out of question.

No I meant that we simply _have to_ use printk from the OOM killer.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim.
  2017-10-25 15:05                                 ` Michal Hocko
@ 2017-10-25 15:34                                   ` Tetsuo Handa
  0 siblings, 0 replies; 21+ messages in thread
From: Tetsuo Handa @ 2017-10-25 15:34 UTC (permalink / raw)
  To: mhocko
  Cc: hannes, aarcange, akpm, linux-mm, rientjes, mjaggi, mgorman,
	oleg, vdavydov.dev, vbabka

Michal Hocko wrote:
> On Wed 25-10-17 23:58:33, Tetsuo Handa wrote:
> > I thought something like
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3872,6 +3872,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >         unsigned int stall_timeout = 10 * HZ;
> >         unsigned int cpuset_mems_cookie;
> >         int reserve_flags;
> > +       static DEFINE_MUTEX(warn_lock);
> > 
> >         /*
> >          * In the slowpath, we sanity check order to avoid ever trying to
> > @@ -4002,11 +4003,15 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >                 goto nopage;
> > 
> >         /* Make sure we know about allocations which stall for too long */
> > -       if (time_after(jiffies, alloc_start + stall_timeout)) {
> > -               warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> > -                       "page allocation stalls for %ums, order:%u",
> > -                       jiffies_to_msecs(jiffies-alloc_start), order);
> > -               stall_timeout += 10 * HZ;
> > +       if (time_after(jiffies, alloc_start + stall_timeout) &&
> > +           mutex_trylock(&warn_lock)) {
> > +               if (!mutex_is_locked(&oom_lock)) {
> 
> The check for oom_lock just doesn't make any sense. The lock can be take
> at any time after the check.

The check for oom_lock is optimistic. If we go pessimistic, we will
need to use oom_printk_lock, but you don't like oom_printk_lock, do you?
Anyway, let's remove warn_alloc().

> 
> > +                       warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> > +                                  "page allocation stalls for %ums, order:%u",
> > +                                  jiffies_to_msecs(jiffies-alloc_start), order);
> > +                       stall_timeout += 10 * HZ;
> > +               }
> > +               mutex_unlock(&warn_lock);
> >         }
> > 
> >         /* Avoid recursion of direct reclaim */
> > 
> > for isolating the OOM killer messages and the stall warning messages (in order to
> > break continuation condition in console_unlock()), and

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-10-25 16:34 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp>
2017-08-24 12:18 ` [RFC PATCH 2/2] mm,oom: Try last second allocation after selecting an OOM victim Tetsuo Handa
2017-08-24 13:18   ` Michal Hocko
2017-08-24 14:40     ` Tetsuo Handa
2017-08-25  8:00       ` Michal Hocko
2017-09-09  0:55         ` Tetsuo Handa
     [not found]           ` <201710172204.AGG30740.tVHJFFOQLMSFOO@I-love.SAKURA.ne.jp>
2017-10-20 12:40             ` Michal Hocko
2017-10-20 14:18               ` Tetsuo Handa
2017-10-23 11:30                 ` Michal Hocko
2017-10-24 11:24                   ` Tetsuo Handa
2017-10-24 11:41                     ` Michal Hocko
2017-10-25 10:48                       ` Tetsuo Handa
2017-10-25 11:09                         ` Michal Hocko
2017-10-25 12:15                           ` Tetsuo Handa
2017-10-25 12:41                             ` Michal Hocko
2017-10-25 14:58                               ` Tetsuo Handa
2017-10-25 15:05                                 ` Michal Hocko
2017-10-25 15:34                                   ` Tetsuo Handa
2017-08-24 13:03 ` [PATCH 1/2] mm,page_alloc: Don't call __node_reclaim() with oom_lock held Michal Hocko
2017-08-25 20:47 ` Andrew Morton
2017-08-26  1:28   ` Tetsuo Handa
2017-08-27  4:17     ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).