LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
@ 2019-01-07 14:38 Michal Hocko
  2019-01-07 14:38 ` [PATCH 1/2] mm, oom: marks all killed tasks as oom victims Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Michal Hocko @ 2019-01-07 14:38 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, Johannes Weiner, Andrew Morton, LKML

Hi,
I have posted this as an RFC previously [1]. Tetsuo has pointed out some
issues with the patch 1 which I have fixed hopefully. Other than that
this is just a rebase on top of Linus tree.

The original cover:
this is a follow up for [2] which has been nacked mostly because Tetsuo
was able to find a simple workload which can trigger a race where
no-eligible task is reported without a good reason. I believe the patch2
addresses that issue and we do not have to play dirty games with
throttling just because of the race. I still believe that patch proposed
in [2] is a useful one but this can be addressed later.

This series comprises 2 patch. The first one is something I meant to do
loooong time ago, I just never have time to do that. We need it here to
handle CLONE_VM without CLONE_SIGHAND cases. The second patch closes the
race.

Feedback is appreciated of course.

[1] http://lkml.kernel.org/r/20181022071323.9550-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20181010151135.25766-1-mhocko@kernel.org



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/2] mm, oom: marks all killed tasks as oom victims
  2019-01-07 14:38 [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM Michal Hocko
@ 2019-01-07 14:38 ` Michal Hocko
  2019-01-07 20:58   ` Tetsuo Handa
  2019-01-07 14:38 ` [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-07 14:38 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, Johannes Weiner, Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Historically we have called mark_oom_victim only to the main task
selected as the oom victim because oom victims have access to memory
reserves and granting the access to all killed tasks could deplete
memory reserves very quickly and cause even larger problems.

Since only a partial access to memory reserves is allowed there is no
longer this risk and so all tasks killed along with the oom victim
can be considered as well.

The primary motivation for that is that process groups which do not
shared signals would behave more like standard thread groups wrt oom
handling (aka tsk_is_oom_victim will work the same way for them).

- Use find_lock_task_mm to stabilize mm as suggested by Tetsuo

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f0e8cd9edb1a..0246c7a4e44e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -892,6 +892,7 @@ static void __oom_kill_process(struct task_struct *victim)
 	 */
 	rcu_read_lock();
 	for_each_process(p) {
+		struct task_struct *t;
 		if (!process_shares_mm(p, mm))
 			continue;
 		if (same_thread_group(p, victim))
@@ -911,6 +912,11 @@ static void __oom_kill_process(struct task_struct *victim)
 		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
 		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
+		t = find_lock_task_mm(p);
+		if (!t)
+			continue;
+		mark_oom_victim(t);
+		task_unlock(t);
 	}
 	rcu_read_unlock();
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks
  2019-01-07 14:38 [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM Michal Hocko
  2019-01-07 14:38 ` [PATCH 1/2] mm, oom: marks all killed tasks as oom victims Michal Hocko
@ 2019-01-07 14:38 ` Michal Hocko
  2019-01-07 20:59   ` Tetsuo Handa
  2019-01-08  8:35   ` kbuild test robot
  2019-01-08 14:21 ` [PATCH 3/2] memcg: Facilitate termination of memcg OOM victims Tetsuo Handa
  2019-01-09 11:03 ` [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM Michal Hocko
  3 siblings, 2 replies; 28+ messages in thread
From: Michal Hocko @ 2019-01-07 14:38 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, Johannes Weiner, Andrew Morton, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo has reported [1] that a single process group memcg might easily
swamp the log with no-eligible oom victim reports due to race between
the memcg charge and oom_reaper

Thread 1		Thread2				oom_reaper
try_charge		try_charge
			  mem_cgroup_out_of_memory
			    mutex_lock(oom_lock)
  mem_cgroup_out_of_memory
    mutex_lock(oom_lock)
			      out_of_memory
			        select_bad_process
				oom_kill_process(current)
				  wake_oom_reaper
							  oom_reap_task
							  MMF_OOM_SKIP->victim
			    mutex_unlock(oom_lock)
    out_of_memory
      select_bad_process # no task

If Thread1 didn't race it would bail out from try_charge and force the
charge. We can achieve the same by checking tsk_is_oom_victim inside
the oom_lock and therefore close the race.

[1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7f18b32389..90eb2e2093e7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1387,10 +1387,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.gfp_mask = gfp_mask,
 		.order = order,
 	};
-	bool ret;
+	bool ret = true;
 
 	mutex_lock(&oom_lock);
+
+	/*
+	 * multi-threaded tasks might race with oom_reaper and gain
+	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
+	 * to out_of_memory failure if the task is the last one in
+	 * memcg which would be a false possitive failure reported
+	 */
+	if (tsk_is_oom_victim(current))
+		goto unlock;
+
 	ret = out_of_memory(&oc);
+
+unlock:
 	mutex_unlock(&oom_lock);
 	return ret;
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] mm, oom: marks all killed tasks as oom victims
  2019-01-07 14:38 ` [PATCH 1/2] mm, oom: marks all killed tasks as oom victims Michal Hocko
@ 2019-01-07 20:58   ` Tetsuo Handa
  2019-01-08  8:11     ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-07 20:58 UTC (permalink / raw)
  To: Michal Hocko, linux-mm; +Cc: Johannes Weiner, Andrew Morton, LKML, Michal Hocko

On 2019/01/07 23:38, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Historically we have called mark_oom_victim only to the main task
> selected as the oom victim because oom victims have access to memory
> reserves and granting the access to all killed tasks could deplete
> memory reserves very quickly and cause even larger problems.
> 
> Since only a partial access to memory reserves is allowed there is no
> longer this risk and so all tasks killed along with the oom victim
> can be considered as well.
> 
> The primary motivation for that is that process groups which do not
> shared signals would behave more like standard thread groups wrt oom
> handling (aka tsk_is_oom_victim will work the same way for them).
> 
> - Use find_lock_task_mm to stabilize mm as suggested by Tetsuo
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f0e8cd9edb1a..0246c7a4e44e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,6 +892,7 @@ static void __oom_kill_process(struct task_struct *victim)
>  	 */
>  	rcu_read_lock();
>  	for_each_process(p) {
> +		struct task_struct *t;
>  		if (!process_shares_mm(p, mm))
>  			continue;
>  		if (same_thread_group(p, victim))
> @@ -911,6 +912,11 @@ static void __oom_kill_process(struct task_struct *victim)
>  		if (unlikely(p->flags & PF_KTHREAD))
>  			continue;
>  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
> +		t = find_lock_task_mm(p);
> +		if (!t)
> +			continue;
> +		mark_oom_victim(t);
> +		task_unlock(t);

Thank you for updating this patch. This patch is correct from the point of
view of avoiding TIF_MEMDIE race. But if I recall correctly, the reason we
did not do this is to avoid depleting memory reserves. And we still grant
full access to memory reserves for CONFIG_MMU=n case. Shouldn't the changelog
mention CONFIG_MMU=n case?

>  	}
>  	rcu_read_unlock();
>  
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks
  2019-01-07 14:38 ` [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks Michal Hocko
@ 2019-01-07 20:59   ` Tetsuo Handa
  2019-01-08  8:14     ` Michal Hocko
  2019-01-08  8:35   ` kbuild test robot
  1 sibling, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-07 20:59 UTC (permalink / raw)
  To: Michal Hocko, linux-mm; +Cc: Johannes Weiner, Andrew Morton, LKML, Michal Hocko

On 2019/01/07 23:38, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo has reported [1] that a single process group memcg might easily
> swamp the log with no-eligible oom victim reports due to race between
> the memcg charge and oom_reaper

This explanation is outdated. I reported that one memcg OOM killer can
kill all processes in that memcg. I expect the changelog to be updated.

> 
> Thread 1		Thread2				oom_reaper
> try_charge		try_charge
> 			  mem_cgroup_out_of_memory
> 			    mutex_lock(oom_lock)
>   mem_cgroup_out_of_memory
>     mutex_lock(oom_lock)
> 			      out_of_memory
> 			        select_bad_process
> 				oom_kill_process(current)
> 				  wake_oom_reaper
> 							  oom_reap_task
> 							  MMF_OOM_SKIP->victim
> 			    mutex_unlock(oom_lock)
>     out_of_memory
>       select_bad_process # no task
> 
> If Thread1 didn't race it would bail out from try_charge and force the
> charge. We can achieve the same by checking tsk_is_oom_victim inside
> the oom_lock and therefore close the race.
> 
> [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index af7f18b32389..90eb2e2093e7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1387,10 +1387,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  	};
> -	bool ret;
> +	bool ret = true;
>  
>  	mutex_lock(&oom_lock);

And because of "[PATCH 1/2] mm, oom: marks all killed tasks as oom
victims", mark_oom_victim() will be called on current thread even if
we used mutex_lock_killable(&oom_lock) here, like you said

  mutex_lock_killable would take care of exiting task already. I would
  then still prefer to check for mark_oom_victim because that is not racy
  with the exit path clearing signals. I can update my patch to use
  _killable lock variant if we are really going with the memcg specific
  fix.

. If current thread is not yet killed by the OOM killer but can terminate
without invoking the OOM killer, using mutex_lock_killable(&oom_lock) here
saves some processes. What is the race you are referring by "racy with the
exit path clearing signals" ?

> +
> +	/*
> +	 * multi-threaded tasks might race with oom_reaper and gain
> +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> +	 * to out_of_memory failure if the task is the last one in
> +	 * memcg which would be a false possitive failure reported
> +	 */

Not only out_of_memory() failure. Current thread needlessly tries to
select next OOM victim. out_of_memory() failure is nothing but a result
of no eligible candidate case.

> +	if (tsk_is_oom_victim(current))
> +		goto unlock;
> +
>  	ret = out_of_memory(&oc);
> +
> +unlock:
>  	mutex_unlock(&oom_lock);
>  	return ret;
>  }
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] mm, oom: marks all killed tasks as oom victims
  2019-01-07 20:58   ` Tetsuo Handa
@ 2019-01-08  8:11     ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2019-01-08  8:11 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Johannes Weiner, Andrew Morton, LKML

On Tue 08-01-19 05:58:41, Tetsuo Handa wrote:
> On 2019/01/07 23:38, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Historically we have called mark_oom_victim only to the main task
> > selected as the oom victim because oom victims have access to memory
> > reserves and granting the access to all killed tasks could deplete
> > memory reserves very quickly and cause even larger problems.
> > 
> > Since only a partial access to memory reserves is allowed there is no
> > longer this risk and so all tasks killed along with the oom victim
> > can be considered as well.
> > 
> > The primary motivation for that is that process groups which do not
> > shared signals would behave more like standard thread groups wrt oom
> > handling (aka tsk_is_oom_victim will work the same way for them).
> > 
> > - Use find_lock_task_mm to stabilize mm as suggested by Tetsuo
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/oom_kill.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index f0e8cd9edb1a..0246c7a4e44e 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -892,6 +892,7 @@ static void __oom_kill_process(struct task_struct *victim)
> >  	 */
> >  	rcu_read_lock();
> >  	for_each_process(p) {
> > +		struct task_struct *t;
> >  		if (!process_shares_mm(p, mm))
> >  			continue;
> >  		if (same_thread_group(p, victim))
> > @@ -911,6 +912,11 @@ static void __oom_kill_process(struct task_struct *victim)
> >  		if (unlikely(p->flags & PF_KTHREAD))
> >  			continue;
> >  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
> > +		t = find_lock_task_mm(p);
> > +		if (!t)
> > +			continue;
> > +		mark_oom_victim(t);
> > +		task_unlock(t);
> 
> Thank you for updating this patch. This patch is correct from the point of
> view of avoiding TIF_MEMDIE race. But if I recall correctly, the reason we
> did not do this is to avoid depleting memory reserves. And we still grant
> full access to memory reserves for CONFIG_MMU=n case. Shouldn't the changelog
> mention CONFIG_MMU=n case?

Like so many times before. Does nommu matter in this context at all? You
keep bringing it up without actually trying to understand that nommu is
so special that reserves for those architectures are of very limited
use. I do not really see much point mentioning nommu in every oom patch.

Or do you know of a nommu oom killer bug out there? I would be more than
curious. Seriously.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks
  2019-01-07 20:59   ` Tetsuo Handa
@ 2019-01-08  8:14     ` Michal Hocko
  2019-01-08 10:39       ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-08  8:14 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Johannes Weiner, Andrew Morton, LKML

On Tue 08-01-19 05:59:49, Tetsuo Handa wrote:
> On 2019/01/07 23:38, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Tetsuo has reported [1] that a single process group memcg might easily
> > swamp the log with no-eligible oom victim reports due to race between
> > the memcg charge and oom_reaper
> 
> This explanation is outdated. I reported that one memcg OOM killer can
> kill all processes in that memcg. I expect the changelog to be updated.

I am open to refinements. Any specific wording you have in mind?

> > 
> > Thread 1		Thread2				oom_reaper
> > try_charge		try_charge
> > 			  mem_cgroup_out_of_memory
> > 			    mutex_lock(oom_lock)
> >   mem_cgroup_out_of_memory
> >     mutex_lock(oom_lock)
> > 			      out_of_memory
> > 			        select_bad_process
> > 				oom_kill_process(current)
> > 				  wake_oom_reaper
> > 							  oom_reap_task
> > 							  MMF_OOM_SKIP->victim
> > 			    mutex_unlock(oom_lock)
> >     out_of_memory
> >       select_bad_process # no task
> > 
> > If Thread1 didn't race it would bail out from try_charge and force the
> > charge. We can achieve the same by checking tsk_is_oom_victim inside
> > the oom_lock and therefore close the race.
> > 
> > [1] http://lkml.kernel.org/r/bb2074c0-34fe-8c2c-1c7d-db71338f1e7f@i-love.sakura.ne.jp
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/memcontrol.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index af7f18b32389..90eb2e2093e7 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1387,10 +1387,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  		.gfp_mask = gfp_mask,
> >  		.order = order,
> >  	};
> > -	bool ret;
> > +	bool ret = true;
> >  
> >  	mutex_lock(&oom_lock);
> 
> And because of "[PATCH 1/2] mm, oom: marks all killed tasks as oom
> victims", mark_oom_victim() will be called on current thread even if
> we used mutex_lock_killable(&oom_lock) here, like you said
> 
>   mutex_lock_killable would take care of exiting task already. I would
>   then still prefer to check for mark_oom_victim because that is not racy
>   with the exit path clearing signals. I can update my patch to use
>   _killable lock variant if we are really going with the memcg specific
>   fix.
> 
> . If current thread is not yet killed by the OOM killer but can terminate
> without invoking the OOM killer, using mutex_lock_killable(&oom_lock) here
> saves some processes. What is the race you are referring by "racy with the
> exit path clearing signals" ?

This is unrelated to the patch.
 
> > +
> > +	/*
> > +	 * multi-threaded tasks might race with oom_reaper and gain
> > +	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> > +	 * to out_of_memory failure if the task is the last one in
> > +	 * memcg which would be a false possitive failure reported
> > +	 */
> 
> Not only out_of_memory() failure. Current thread needlessly tries to
> select next OOM victim. out_of_memory() failure is nothing but a result
> of no eligible candidate case.

So?

Let me ask again. Does this solve the issue you are seeing? I really do
not want to end in nit picking endless thread again and would like to
move on.

> > +	if (tsk_is_oom_victim(current))
> > +		goto unlock;
> > +
> >  	ret = out_of_memory(&oc);
> > +
> > +unlock:
> >  	mutex_unlock(&oom_lock);
> >  	return ret;
> >  }
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks
  2019-01-07 14:38 ` [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks Michal Hocko
  2019-01-07 20:59   ` Tetsuo Handa
@ 2019-01-08  8:35   ` kbuild test robot
  2019-01-08  9:39     ` Michal Hocko
  1 sibling, 1 reply; 28+ messages in thread
From: kbuild test robot @ 2019-01-08  8:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kbuild-all, linux-mm, Tetsuo Handa, Johannes Weiner,
	Andrew Morton, LKML, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 8106 bytes --]

Hi Michal,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.0-rc1 next-20190108]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michal-Hocko/oom-memcg-do-not-report-racy-no-eligible-OOM/20190108-092805
config: x86_64-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   include/linux/rcupdate.h:659:9: warning: context imbalance in 'find_lock_task_mm' - wrong count at exit
   include/linux/sched/mm.h:141:37: warning: dereference of noderef expression
   mm/oom_kill.c:225:28: warning: context imbalance in 'oom_badness' - unexpected unlock
   mm/oom_kill.c:406:9: warning: context imbalance in 'dump_tasks' - different lock contexts for basic block
>> mm/oom_kill.c:918:17: warning: context imbalance in '__oom_kill_process' - unexpected unlock

vim +/__oom_kill_process +918 mm/oom_kill.c

1af8bb43 Michal Hocko          2016-07-28  845  
5989ad7b Roman Gushchin        2018-08-21  846  static void __oom_kill_process(struct task_struct *victim)
^1da177e Linus Torvalds        2005-04-16  847  {
5989ad7b Roman Gushchin        2018-08-21  848  	struct task_struct *p;
647f2bdf David Rientjes        2012-03-21  849  	struct mm_struct *mm;
bb29902a Tetsuo Handa          2016-03-25  850  	bool can_oom_reap = true;
^1da177e Linus Torvalds        2005-04-16  851  
6b0c81b3 David Rientjes        2012-07-31  852  	p = find_lock_task_mm(victim);
6b0c81b3 David Rientjes        2012-07-31  853  	if (!p) {
6b0c81b3 David Rientjes        2012-07-31  854  		put_task_struct(victim);
647f2bdf David Rientjes        2012-03-21  855  		return;
6b0c81b3 David Rientjes        2012-07-31  856  	} else if (victim != p) {
6b0c81b3 David Rientjes        2012-07-31  857  		get_task_struct(p);
6b0c81b3 David Rientjes        2012-07-31  858  		put_task_struct(victim);
6b0c81b3 David Rientjes        2012-07-31  859  		victim = p;
6b0c81b3 David Rientjes        2012-07-31  860  	}
647f2bdf David Rientjes        2012-03-21  861  
880b7689 Tetsuo Handa          2015-11-05  862  	/* Get a reference to safely compare mm after task_unlock(victim) */
647f2bdf David Rientjes        2012-03-21  863  	mm = victim->mm;
f1f10076 Vegard Nossum         2017-02-27  864  	mmgrab(mm);
8e675f7a Konstantin Khlebnikov 2017-07-06  865  
8e675f7a Konstantin Khlebnikov 2017-07-06  866  	/* Raise event before sending signal: task reaper must see this */
8e675f7a Konstantin Khlebnikov 2017-07-06  867  	count_vm_event(OOM_KILL);
fe6bdfc8 Roman Gushchin        2018-06-14  868  	memcg_memory_event_mm(mm, MEMCG_OOM_KILL);
8e675f7a Konstantin Khlebnikov 2017-07-06  869  
426fb5e7 Tetsuo Handa          2015-11-05  870  	/*
cd04ae1e Michal Hocko          2017-09-06  871  	 * We should send SIGKILL before granting access to memory reserves
cd04ae1e Michal Hocko          2017-09-06  872  	 * in order to prevent the OOM victim from depleting the memory
cd04ae1e Michal Hocko          2017-09-06  873  	 * reserves from the user space under its control.
426fb5e7 Tetsuo Handa          2015-11-05  874  	 */
079b22dc Eric W. Biederman     2018-09-03  875  	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
16e95196 Johannes Weiner       2015-06-24  876  	mark_oom_victim(victim);
eca56ff9 Jerome Marchand       2016-01-14  877  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
647f2bdf David Rientjes        2012-03-21  878  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
647f2bdf David Rientjes        2012-03-21  879  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
eca56ff9 Jerome Marchand       2016-01-14  880  		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
eca56ff9 Jerome Marchand       2016-01-14  881  		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
647f2bdf David Rientjes        2012-03-21  882  	task_unlock(victim);
647f2bdf David Rientjes        2012-03-21  883  
647f2bdf David Rientjes        2012-03-21  884  	/*
647f2bdf David Rientjes        2012-03-21  885  	 * Kill all user processes sharing victim->mm in other thread groups, if
647f2bdf David Rientjes        2012-03-21  886  	 * any.  They don't get access to memory reserves, though, to avoid
647f2bdf David Rientjes        2012-03-21  887  	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
647f2bdf David Rientjes        2012-03-21  888  	 * oom killed thread cannot exit because it requires the semaphore and
647f2bdf David Rientjes        2012-03-21  889  	 * its contended by another thread trying to allocate memory itself.
647f2bdf David Rientjes        2012-03-21  890  	 * That thread will now get access to memory reserves since it has a
647f2bdf David Rientjes        2012-03-21  891  	 * pending fatal signal.
647f2bdf David Rientjes        2012-03-21  892  	 */
4d4048be Oleg Nesterov         2014-01-21  893  	rcu_read_lock();
c319025a Oleg Nesterov         2015-11-05  894  	for_each_process(p) {
00508538 Michal Hocko          2019-01-07  895  		struct task_struct *t;
4d7b3394 Oleg Nesterov         2015-11-05  896  		if (!process_shares_mm(p, mm))
c319025a Oleg Nesterov         2015-11-05  897  			continue;
c319025a Oleg Nesterov         2015-11-05  898  		if (same_thread_group(p, victim))
c319025a Oleg Nesterov         2015-11-05  899  			continue;
1b51e65e Michal Hocko          2016-10-07  900  		if (is_global_init(p)) {
aac45363 Michal Hocko          2016-03-25  901  			can_oom_reap = false;
862e3073 Michal Hocko          2016-10-07  902  			set_bit(MMF_OOM_SKIP, &mm->flags);
a373966d Michal Hocko          2016-07-28  903  			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
a373966d Michal Hocko          2016-07-28  904  					task_pid_nr(victim), victim->comm,
a373966d Michal Hocko          2016-07-28  905  					task_pid_nr(p), p->comm);
647f2bdf David Rientjes        2012-03-21  906  			continue;
aac45363 Michal Hocko          2016-03-25  907  		}
1b51e65e Michal Hocko          2016-10-07  908  		/*
1b51e65e Michal Hocko          2016-10-07  909  		 * No use_mm() user needs to read from the userspace so we are
1b51e65e Michal Hocko          2016-10-07  910  		 * ok to reap it.
1b51e65e Michal Hocko          2016-10-07  911  		 */
1b51e65e Michal Hocko          2016-10-07  912  		if (unlikely(p->flags & PF_KTHREAD))
1b51e65e Michal Hocko          2016-10-07  913  			continue;
079b22dc Eric W. Biederman     2018-09-03  914  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
00508538 Michal Hocko          2019-01-07  915  		t = find_lock_task_mm(p);
00508538 Michal Hocko          2019-01-07  916  		if (!t)
00508538 Michal Hocko          2019-01-07  917  			continue;
00508538 Michal Hocko          2019-01-07 @918  		mark_oom_victim(t);
00508538 Michal Hocko          2019-01-07  919  		task_unlock(t);
647f2bdf David Rientjes        2012-03-21  920  	}
6b0c81b3 David Rientjes        2012-07-31  921  	rcu_read_unlock();
647f2bdf David Rientjes        2012-03-21  922  
aac45363 Michal Hocko          2016-03-25  923  	if (can_oom_reap)
36324a99 Michal Hocko          2016-03-25  924  		wake_oom_reaper(victim);
aac45363 Michal Hocko          2016-03-25  925  
880b7689 Tetsuo Handa          2015-11-05  926  	mmdrop(mm);
6b0c81b3 David Rientjes        2012-07-31  927  	put_task_struct(victim);
^1da177e Linus Torvalds        2005-04-16  928  }
647f2bdf David Rientjes        2012-03-21  929  #undef K
^1da177e Linus Torvalds        2005-04-16  930  

:::::: The code at line 918 was first introduced by commit
:::::: 00508538cb045f28a2f60e5d2dff98b77b0da725 mm, oom: marks all killed tasks as oom victims

:::::: TO: Michal Hocko <mhocko@suse.com>
:::::: CC: 0day robot <lkp@intel.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 67238 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks
  2019-01-08  8:35   ` kbuild test robot
@ 2019-01-08  9:39     ` Michal Hocko
  2019-01-11  0:23       ` [kbuild-all] " Rong Chen
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-08  9:39 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-mm, Tetsuo Handa, Johannes Weiner, Andrew Morton, LKML

On Tue 08-01-19 16:35:42, kbuild test robot wrote:
[...]
> All warnings (new ones prefixed by >>):
> 
>    include/linux/rcupdate.h:659:9: warning: context imbalance in 'find_lock_task_mm' - wrong count at exit
>    include/linux/sched/mm.h:141:37: warning: dereference of noderef expression
>    mm/oom_kill.c:225:28: warning: context imbalance in 'oom_badness' - unexpected unlock
>    mm/oom_kill.c:406:9: warning: context imbalance in 'dump_tasks' - different lock contexts for basic block
> >> mm/oom_kill.c:918:17: warning: context imbalance in '__oom_kill_process' - unexpected unlock

What exactly does this warning say? I do not see anything wrong about
the code. find_lock_task_mm returns a locked task when t != NULL and
mark_oom_victim doesn't do anything about the locking. Am I missing
something or the warning is just confused?

[...]
> 00508538 Michal Hocko          2019-01-07  915  		t = find_lock_task_mm(p);
> 00508538 Michal Hocko          2019-01-07  916  		if (!t)
> 00508538 Michal Hocko          2019-01-07  917  			continue;
> 00508538 Michal Hocko          2019-01-07 @918  		mark_oom_victim(t);
> 00508538 Michal Hocko          2019-01-07  919  		task_unlock(t);
> 647f2bdf David Rientjes        2012-03-21  920  	}

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks
  2019-01-08  8:14     ` Michal Hocko
@ 2019-01-08 10:39       ` Tetsuo Handa
  2019-01-08 11:46         ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-08 10:39 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Andrew Morton, LKML

On 2019/01/08 17:14, Michal Hocko wrote:
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index af7f18b32389..90eb2e2093e7 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -1387,10 +1387,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  		.gfp_mask = gfp_mask,
>>>  		.order = order,
>>>  	};
>>> -	bool ret;
>>> +	bool ret = true;
>>>  
>>>  	mutex_lock(&oom_lock);
>>
>> And because of "[PATCH 1/2] mm, oom: marks all killed tasks as oom
>> victims", mark_oom_victim() will be called on current thread even if
>> we used mutex_lock_killable(&oom_lock) here, like you said
>>
>>   mutex_lock_killable would take care of exiting task already. I would
>>   then still prefer to check for mark_oom_victim because that is not racy
>>   with the exit path clearing signals. I can update my patch to use
>>   _killable lock variant if we are really going with the memcg specific
>>   fix.
>>
>> . If current thread is not yet killed by the OOM killer but can terminate
>> without invoking the OOM killer, using mutex_lock_killable(&oom_lock) here
>> saves some processes. What is the race you are referring by "racy with the
>> exit path clearing signals" ?
> 
> This is unrelated to the patch.

Ultimately related! This is the reasoning why your patch should be preferred
over my patch.

For example, if memcg OOM events in different domains are pending, already
OOM-killed threads needlessly wait for pending memcg OOM events in different
domains. An out_of_memory() call is slow because it involves printk(). With
slow serial consoles, out_of_memory() might take more than a second. I consider
that allowing killed processes to call mmput() from exit_mm() from do_exit()
quickly (instead of waiting for pending memcg OOM events in different domains
at mem_cgroup_out_of_memory()) helps calling __mmput() (which can reclaim more
memory than the OOM reaper can reclaim) quickly. Unless what you call "racy" is
problematic, I don't see reasons not to apply my patch. So, please please answer
what you are referring to with "racy".


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks
  2019-01-08 10:39       ` Tetsuo Handa
@ 2019-01-08 11:46         ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2019-01-08 11:46 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Johannes Weiner, Andrew Morton, LKML

On Tue 08-01-19 19:39:58, Tetsuo Handa wrote:
> On 2019/01/08 17:14, Michal Hocko wrote:
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index af7f18b32389..90eb2e2093e7 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -1387,10 +1387,22 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>>  		.gfp_mask = gfp_mask,
> >>>  		.order = order,
> >>>  	};
> >>> -	bool ret;
> >>> +	bool ret = true;
> >>>  
> >>>  	mutex_lock(&oom_lock);
> >>
> >> And because of "[PATCH 1/2] mm, oom: marks all killed tasks as oom
> >> victims", mark_oom_victim() will be called on current thread even if
> >> we used mutex_lock_killable(&oom_lock) here, like you said
> >>
> >>   mutex_lock_killable would take care of exiting task already. I would
> >>   then still prefer to check for mark_oom_victim because that is not racy
> >>   with the exit path clearing signals. I can update my patch to use
> >>   _killable lock variant if we are really going with the memcg specific
> >>   fix.
> >>
> >> . If current thread is not yet killed by the OOM killer but can terminate
> >> without invoking the OOM killer, using mutex_lock_killable(&oom_lock) here
> >> saves some processes. What is the race you are referring by "racy with the
> >> exit path clearing signals" ?
> > 
> > This is unrelated to the patch.
> 
> Ultimately related! This is the reasoning why your patch should be preferred
> over my patch.

No! I've said I do not mind using mutex_lock_killable on top of this
patch. I just want to have this fix minimal.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/2] memcg: Facilitate termination of memcg OOM victims.
  2019-01-07 14:38 [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM Michal Hocko
  2019-01-07 14:38 ` [PATCH 1/2] mm, oom: marks all killed tasks as oom victims Michal Hocko
  2019-01-07 14:38 ` [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks Michal Hocko
@ 2019-01-08 14:21 ` Tetsuo Handa
  2019-01-08 14:38   ` Michal Hocko
  2019-01-09 11:03 ` [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM Michal Hocko
  3 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-08 14:21 UTC (permalink / raw)
  To: Michal Hocko, linux-mm; +Cc: Johannes Weiner, Andrew Morton, LKML

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

If memcg OOM events in different domains are pending, already OOM-killed
threads needlessly wait for pending memcg OOM events in different domains.
An out_of_memory() call is slow because it involves printk(). With slow
serial consoles, out_of_memory() might take more than a second. Therefore,
allowing killed processes to quickly call mmput() from exit_mm() from
do_exit() will help calling __mmput() (which can reclaim more memory than
the OOM reaper can reclaim) quickly.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/memcontrol.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90eb2e2..a7d3ba9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1389,14 +1389,19 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	};
 	bool ret = true;
 
-	mutex_lock(&oom_lock);
-
 	/*
-	 * multi-threaded tasks might race with oom_reaper and gain
-	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
-	 * to out_of_memory failure if the task is the last one in
-	 * memcg which would be a false possitive failure reported
+	 * Multi-threaded tasks might race with oom_reaper() and gain
+	 * MMF_OOM_SKIP before reaching out_of_memory(). But if current
+	 * thread was already killed or is ready to terminate, there is
+	 * no need to call out_of_memory() nor wait for oom_reaoer() to
+	 * set MMF_OOM_SKIP. These three checks minimize possibility of
+	 * needlessly calling out_of_memory() and try to call exit_mm()
+	 * as soon as possible.
 	 */
+	if (mutex_lock_killable(&oom_lock))
+		return true;
+	if (fatal_signal_pending(current))
+		goto unlock;
 	if (tsk_is_oom_victim(current))
 		goto unlock;
 
-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/2] memcg: Facilitate termination of memcg OOM victims.
  2019-01-08 14:21 ` [PATCH 3/2] memcg: Facilitate termination of memcg OOM victims Tetsuo Handa
@ 2019-01-08 14:38   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2019-01-08 14:38 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Johannes Weiner, Andrew Morton, LKML

On Tue 08-01-19 23:21:23, Tetsuo Handa wrote:
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> If memcg OOM events in different domains are pending, already OOM-killed
> threads needlessly wait for pending memcg OOM events in different domains.
> An out_of_memory() call is slow because it involves printk(). With slow
> serial consoles, out_of_memory() might take more than a second. Therefore,
> allowing killed processes to quickly call mmput() from exit_mm() from
> do_exit() will help calling __mmput() (which can reclaim more memory than
> the OOM reaper can reclaim) quickly.

Can you post it separately out of this thread please? It is really a
separate topic and I do not want to end with back and forth without
making a further progress.
 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/memcontrol.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 90eb2e2..a7d3ba9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1389,14 +1389,19 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	};
>  	bool ret = true;
>  
> -	mutex_lock(&oom_lock);
> -
>  	/*
> -	 * multi-threaded tasks might race with oom_reaper and gain
> -	 * MMF_OOM_SKIP before reaching out_of_memory which can lead
> -	 * to out_of_memory failure if the task is the last one in
> -	 * memcg which would be a false possitive failure reported
> +	 * Multi-threaded tasks might race with oom_reaper() and gain
> +	 * MMF_OOM_SKIP before reaching out_of_memory(). But if current
> +	 * thread was already killed or is ready to terminate, there is
> +	 * no need to call out_of_memory() nor wait for oom_reaoer() to
> +	 * set MMF_OOM_SKIP. These three checks minimize possibility of
> +	 * needlessly calling out_of_memory() and try to call exit_mm()
> +	 * as soon as possible.
>  	 */
> +	if (mutex_lock_killable(&oom_lock))
> +		return true;
> +	if (fatal_signal_pending(current))
> +		goto unlock;
>  	if (tsk_is_oom_victim(current))
>  		goto unlock;
>  
> -- 
> 1.8.3.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-07 14:38 [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM Michal Hocko
                   ` (2 preceding siblings ...)
  2019-01-08 14:21 ` [PATCH 3/2] memcg: Facilitate termination of memcg OOM victims Tetsuo Handa
@ 2019-01-09 11:03 ` Michal Hocko
  2019-01-09 11:34   ` Tetsuo Handa
  3 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-09 11:03 UTC (permalink / raw)
  To: linux-mm; +Cc: Tetsuo Handa, Johannes Weiner, Andrew Morton, LKML

Tetsuo,
can you confirm that these two patches are fixing the issue you have
reported please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-09 11:03 ` [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM Michal Hocko
@ 2019-01-09 11:34   ` Tetsuo Handa
  2019-01-09 12:02     ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-09 11:34 UTC (permalink / raw)
  To: Michal Hocko, linux-mm; +Cc: Johannes Weiner, Andrew Morton, LKML

On 2019/01/09 20:03, Michal Hocko wrote:
> Tetsuo,
> can you confirm that these two patches are fixing the issue you have
> reported please?
> 

My patch fixes the issue better than your "[PATCH 2/2] memcg: do not report racy no-eligible OOM tasks" does.

You can post "[PATCH 1/2] mm, oom: marks all killed tasks as oom victims"
based on a report that we needlessly select more OOM victims because
MMF_OOM_SKIP is quickly set by the OOM reaper. In fact, updating
oom_reap_task() / exit_mmap() to use

  mutex_lock(&oom_lock);
  set_bit(MMF_OOM_SKIP, &mm->flags);
  mutex_unlock(&oom_lock);

will mostly close the race as well.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-09 11:34   ` Tetsuo Handa
@ 2019-01-09 12:02     ` Michal Hocko
  2019-01-10 23:59       ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-09 12:02 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, Johannes Weiner, Andrew Morton, LKML

On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
> On 2019/01/09 20:03, Michal Hocko wrote:
> > Tetsuo,
> > can you confirm that these two patches are fixing the issue you have
> > reported please?
> > 
> 
> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
> report racy no-eligible OOM tasks" does.

OK, so we are stuck again. Hooray!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-09 12:02     ` Michal Hocko
@ 2019-01-10 23:59       ` Tetsuo Handa
  2019-01-11 10:25         ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-10 23:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, Johannes Weiner, LKML

Michal Hocko wrote:
> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
> > On 2019/01/09 20:03, Michal Hocko wrote:
> > > Tetsuo,
> > > can you confirm that these two patches are fixing the issue you have
> > > reported please?
> > > 
> > 
> > My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
> > report racy no-eligible OOM tasks" does.
> 
> OK, so we are stuck again. Hooray!

Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ?
Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch
does not close the race whereas my patch closes the race better.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [kbuild-all] [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks
  2019-01-08  9:39     ` Michal Hocko
@ 2019-01-11  0:23       ` " Rong Chen
  0 siblings, 0 replies; 28+ messages in thread
From: Rong Chen @ 2019-01-11  0:23 UTC (permalink / raw)
  To: Michal Hocko, kbuild test robot
  Cc: Tetsuo Handa, LKML, linux-mm, kbuild-all, Johannes Weiner, Andrew Morton



On 01/08/2019 05:39 PM, Michal Hocko wrote:
> On Tue 08-01-19 16:35:42, kbuild test robot wrote:
> [...]
>> All warnings (new ones prefixed by >>):
>>
>>     include/linux/rcupdate.h:659:9: warning: context imbalance in 'find_lock_task_mm' - wrong count at exit
>>     include/linux/sched/mm.h:141:37: warning: dereference of noderef expression
>>     mm/oom_kill.c:225:28: warning: context imbalance in 'oom_badness' - unexpected unlock
>>     mm/oom_kill.c:406:9: warning: context imbalance in 'dump_tasks' - different lock contexts for basic block
>>>> mm/oom_kill.c:918:17: warning: context imbalance in '__oom_kill_process' - unexpected unlock
> What exactly does this warning say? I do not see anything wrong about
> the code. find_lock_task_mm returns a locked task when t != NULL and
> mark_oom_victim doesn't do anything about the locking. Am I missing
> something or the warning is just confused?

Thanks for your reply. It looks like a false positive. We'll look into it.

Best Regards,
Rong Chen

>
> [...]
>> 00508538 Michal Hocko          2019-01-07  915  		t = find_lock_task_mm(p);
>> 00508538 Michal Hocko          2019-01-07  916  		if (!t)
>> 00508538 Michal Hocko          2019-01-07  917  			continue;
>> 00508538 Michal Hocko          2019-01-07 @918  		mark_oom_victim(t);
>> 00508538 Michal Hocko          2019-01-07  919  		task_unlock(t);
>> 647f2bdf David Rientjes        2012-03-21  920  	}


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-10 23:59       ` Tetsuo Handa
@ 2019-01-11 10:25         ` Tetsuo Handa
  2019-01-11 11:33           ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-11 10:25 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko; +Cc: linux-mm, Johannes Weiner, LKML

On 2019/01/11 8:59, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
>>> On 2019/01/09 20:03, Michal Hocko wrote:
>>>> Tetsuo,
>>>> can you confirm that these two patches are fixing the issue you have
>>>> reported please?
>>>>
>>>
>>> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
>>> report racy no-eligible OOM tasks" does.
>>
>> OK, so we are stuck again. Hooray!
> 
> Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ?
> Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
> when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch
> does not close the race whereas my patch closes the race better.
> 

I confirmed that mm-oom-marks-all-killed-tasks-as-oom-victims.patch and
memcg-do-not-report-racy-no-eligible-oom-tasks.patch are completely failing
to fix the issue I am reporting. :-(

Reproducer:
----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>

#define NUMTHREADS 256
#define MMAPSIZE 4 * 10485760
#define STACKSIZE 4096
static int pipe_fd[2] = { EOF, EOF };
static int memory_eater(void *unused)
{
	int fd = open("/dev/zero", O_RDONLY);
	char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ,
			 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
	read(pipe_fd[0], buf, 1);
	read(fd, buf, MMAPSIZE);
	pause();
	return 0;
}
int main(int argc, char *argv[])
{
	int i;
	char *stack;
	FILE *fp;
	const unsigned long size = 1048576UL * 200;
	mkdir("/sys/fs/cgroup/memory/test1", 0755);
	fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
	fprintf(fp, "%lu\n", size);
	fclose(fp);
	fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
	fprintf(fp, "%u\n", getpid());
	fclose(fp);
	if (setgid(-2) || setuid(-2) || pipe(pipe_fd))
		return 1;
	stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
		     MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
	for (i = 0; i < NUMTHREADS; i++)
		if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
			  CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1)
			break;
	close(pipe_fd[1]);
	pause(); // Manually enter Ctrl-C immediately after dump_header() started.
	return 0;
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20190111.txt.xz :
----------
[   71.146532][ T9694] a.out invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[   71.151647][ T9694] CPU: 1 PID: 9694 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc1-next-20190111 #272
(...snipped...)
[   71.304689][ T9694] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9692,uid=-2
[   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child
[   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB
[   71.328523][ T9748] a.out invoked oom-killer: gfp_mask=0x7080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
[   71.328552][ T9748] CPU: 4 PID: 9748 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc1-next-20190111 #272
(...snipped...)
[   71.328785][ T9748] Out of memory and no killable processes...
[   71.329194][ T9771] a.out invoked oom-killer: gfp_mask=0x7080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
(...snipped...)
[   99.696592][ T9924] Out of memory and no killable processes...
[   99.699001][ T9838] a.out invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
(...snipped...)
[   99.833413][ T9838] Out of memory and no killable processes...
----------

$ grep -F 'Out of memory and no killable processes...' serial-20190111.txt | wc -l
213

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-11 10:25         ` Tetsuo Handa
@ 2019-01-11 11:33           ` Michal Hocko
  2019-01-11 12:40             ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-11 11:33 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, linux-mm, Johannes Weiner, LKML

On Fri 11-01-19 19:25:22, Tetsuo Handa wrote:
> On 2019/01/11 8:59, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> >> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
> >>> On 2019/01/09 20:03, Michal Hocko wrote:
> >>>> Tetsuo,
> >>>> can you confirm that these two patches are fixing the issue you have
> >>>> reported please?
> >>>>
> >>>
> >>> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
> >>> report racy no-eligible OOM tasks" does.
> >>
> >> OK, so we are stuck again. Hooray!
> > 
> > Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ?
> > Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
> > when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch
> > does not close the race whereas my patch closes the race better.
> > 
> 
> I confirmed that mm-oom-marks-all-killed-tasks-as-oom-victims.patch and
> memcg-do-not-report-racy-no-eligible-oom-tasks.patch are completely failing
> to fix the issue I am reporting. :-(

OK, this is really interesting. This means that we are racing
when marking all the tasks sharing the mm with the clone syscall.
Does fatal_signal_pending handle this better?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-11 11:33           ` Michal Hocko
@ 2019-01-11 12:40             ` Tetsuo Handa
  2019-01-11 13:34               ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-11 12:40 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, Johannes Weiner, LKML

On 2019/01/11 20:33, Michal Hocko wrote:
> On Fri 11-01-19 19:25:22, Tetsuo Handa wrote:
>> On 2019/01/11 8:59, Tetsuo Handa wrote:
>>> Michal Hocko wrote:
>>>> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
>>>>> On 2019/01/09 20:03, Michal Hocko wrote:
>>>>>> Tetsuo,
>>>>>> can you confirm that these two patches are fixing the issue you have
>>>>>> reported please?
>>>>>>
>>>>>
>>>>> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
>>>>> report racy no-eligible OOM tasks" does.
>>>>
>>>> OK, so we are stuck again. Hooray!
>>>
>>> Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ?
>>> Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
>>> when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch
>>> does not close the race whereas my patch closes the race better.
>>>
>>
>> I confirmed that mm-oom-marks-all-killed-tasks-as-oom-victims.patch and
>> memcg-do-not-report-racy-no-eligible-oom-tasks.patch are completely failing
>> to fix the issue I am reporting. :-(
> 
> OK, this is really interesting. This means that we are racing
> when marking all the tasks sharing the mm with the clone syscall.

Nothing interesting. This is NOT a race between clone() and the OOM killer. :-(
By the moment the OOM killer is invoked, all clone() requests are already completed.

Did you notice that there is no

  "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n"

line between

  [   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child

and

  [   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB

? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside
__oom_kill_process() in the first round of out_of_memory() call because
find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim
complete exit_mm() before find_lock_task_mm() is called. Then, in the second round
of out_of_memory() call, [ T9750] (which is fatal_signal_pending() == T &&
tsk_is_oom_victim() == F) hit task_will_free_mem(current) path and called
mark_oom_victim() and woke up the OOM reaper. Then, before the third round of
out_of_memory() call starts, the OOM reaper set MMF_OOM_SKIP. When the third round
of out_of_memory() call started, [ T9748] could not hit task_will_free_mem(current)
path because MMF_OOM_SKIP was already set, and oom_badness() ignored any mm which
already has MMF_OOM_SKIP. As a result, [ T9748] failed to find a candidate. And this
step repeats for up to number of threads (213 times for this run).

> Does fatal_signal_pending handle this better?
> 

Of course. My patch handles it perfectly. Even if we raced with clone() requests,
why do we need to care about threads doing clone() requests? Such threads are not
inside try_charge(), and therefore such threads can't contribute to this issue
by calling out_of_memory() from try_charge().

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-11 12:40             ` Tetsuo Handa
@ 2019-01-11 13:34               ` Michal Hocko
  2019-01-11 14:31                 ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-11 13:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, linux-mm, Johannes Weiner, LKML

On Fri 11-01-19 21:40:52, Tetsuo Handa wrote:
[...]
> Did you notice that there is no
> 
>   "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n"
> 
> line between
> 
>   [   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child
> 
> and
> 
>   [   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB
> 
> ? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside
> __oom_kill_process() in the first round of out_of_memory() call because
> find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim
> complete exit_mm() before find_lock_task_mm() is called.

OK, so we haven't killed anything because the victim has exited by the
time we wanted to do so. We still have other tasks sharing that mm
pending and not killed because nothing has killed them yet, right?

How come the oom reaper could act on this oom event at all then?

What am I missing?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-11 13:34               ` Michal Hocko
@ 2019-01-11 14:31                 ` Tetsuo Handa
  2019-01-11 15:07                   ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-11 14:31 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, Johannes Weiner, LKML

On 2019/01/11 22:34, Michal Hocko wrote:
> On Fri 11-01-19 21:40:52, Tetsuo Handa wrote:
> [...]
>> Did you notice that there is no
>>
>>   "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n"
>>
>> line between
>>
>>   [   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child
>>
>> and
>>
>>   [   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB
>>
>> ? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside
>> __oom_kill_process() in the first round of out_of_memory() call because
>> find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim
>> complete exit_mm() before find_lock_task_mm() is called.
> 
> OK, so we haven't killed anything because the victim has exited by the
> time we wanted to do so. We still have other tasks sharing that mm
> pending and not killed because nothing has killed them yet, right?

The OOM killer invoked by [ T9694] called printk() but didn't kill anything.
Instead, SIGINT from Ctrl-C killed all thread groups sharing current->mm.

> 
> How come the oom reaper could act on this oom event at all then?
> 
> What am I missing?
> 

The OOM killer invoked by [ T9750] did not call printk() but hit
task_will_free_mem(current) in out_of_memory() and invoked the OOM reaper,
without calling mark_oom_victim() on all thread groups sharing current->mm.
Did you notice that I wrote that

  Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
  when task_will_free_mem() == true,

? :-(


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-11 14:31                 ` Tetsuo Handa
@ 2019-01-11 15:07                   ` Michal Hocko
  2019-01-11 15:37                     ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-11 15:07 UTC (permalink / raw)
  To: Andrew Morton, Tetsuo Handa; +Cc: linux-mm, Johannes Weiner, LKML

On Fri 11-01-19 23:31:18, Tetsuo Handa wrote:
> On 2019/01/11 22:34, Michal Hocko wrote:
> > On Fri 11-01-19 21:40:52, Tetsuo Handa wrote:
> > [...]
> >> Did you notice that there is no
> >>
> >>   "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n"
> >>
> >> line between
> >>
> >>   [   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child
> >>
> >> and
> >>
> >>   [   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB
> >>
> >> ? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside
> >> __oom_kill_process() in the first round of out_of_memory() call because
> >> find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim
> >> complete exit_mm() before find_lock_task_mm() is called.
> > 
> > OK, so we haven't killed anything because the victim has exited by the
> > time we wanted to do so. We still have other tasks sharing that mm
> > pending and not killed because nothing has killed them yet, right?
> 
> The OOM killer invoked by [ T9694] called printk() but didn't kill anything.
> Instead, SIGINT from Ctrl-C killed all thread groups sharing current->mm.

I still do not get it. Those other processes are not sharing signals.
Or is it due to injecting the signal too all of them with the proper
timing?
 
> > How come the oom reaper could act on this oom event at all then?
> > 
> > What am I missing?
> > 
> 
> The OOM killer invoked by [ T9750] did not call printk() but hit
> task_will_free_mem(current) in out_of_memory() and invoked the OOM reaper,
> without calling mark_oom_victim() on all thread groups sharing current->mm.
> Did you notice that I wrote that

OK, now it starts making sense to me finally. I got hooked up in
find_lock_task_mm failing in __oom_kill_process because we do see 
"Memory cgroup out of memory" and that happens _after_
task_will_free_mem. So the whole oom_reaper scenario didn't make much
sense to me.

>   Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
>   when task_will_free_mem() == true,
> 
> ? :-(

No, I got lost in your writeup.

While the task_will_free_mem is fixable but this would get us to even
uglier code so I agree that the approach by my two patches is not
feasible.

I really wanted to have this heuristic based on the oom victim rather
than signal pending because one lesson I've learned over time was that
checks for fatal signals can lead to odd corner cases. Memcg is less
prone to those issues because we can bypass the charge but still.

Anyway, could you update your patch and abstract 
	if (unlikely(tsk_is_oom_victim(current) ||
		     fatal_signal_pending(current) ||
		     current->flags & PF_EXITING))

in try_charge and reuse it in mem_cgroup_out_of_memory under the
oom_lock with an explanation please?

Andrew, please drop my 2 patches please.
                           
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-11 15:07                   ` Michal Hocko
@ 2019-01-11 15:37                     ` Tetsuo Handa
  2019-01-11 16:45                       ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-11 15:37 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton; +Cc: linux-mm, Johannes Weiner, LKML

On 2019/01/12 0:07, Michal Hocko wrote:
> On Fri 11-01-19 23:31:18, Tetsuo Handa wrote:
>> The OOM killer invoked by [ T9694] called printk() but didn't kill anything.
>> Instead, SIGINT from Ctrl-C killed all thread groups sharing current->mm.
> 
> I still do not get it. Those other processes are not sharing signals.
> Or is it due to injecting the signal too all of them with the proper
> timing?

Pressing Ctrl-C between after task_will_free_mem(p) in oom_kill_process() and
before __oom_kill_process() (e.g. dump_header()) made fatal_signal_pending() = T
for all of them.

> Anyway, could you update your patch and abstract 
> 	if (unlikely(tsk_is_oom_victim(current) ||
> 		     fatal_signal_pending(current) ||
> 		     current->flags & PF_EXITING))
> 
> in try_charge and reuse it in mem_cgroup_out_of_memory under the
> oom_lock with an explanation please?

I don't think doing so makes sense, for

  tsk_is_oom_victim(current) = T && fatal_signal_pending(current) == F

can't happen for mem_cgroup_out_of_memory() under the oom_lock, and
current->flags cannot get PF_EXITING when current is inside
mem_cgroup_out_of_memory(). fatal_signal_pending(current) alone is
appropriate for mem_cgroup_out_of_memory() under the oom_lock because

  tsk_is_oom_victim(current) = F && fatal_signal_pending(current) == T

can happen there.

Also, doing so might become wrong in future, for mem_cgroup_out_of_memory()
is also called from memory_max_write() which does not bail out upon
PF_EXITING. I don't think we can call memory_max_write() after current
thread got PF_EXITING, but nobody knows what change will happen in future.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-11 15:37                     ` Tetsuo Handa
@ 2019-01-11 16:45                       ` Michal Hocko
  2019-01-12 10:52                         ` Tetsuo Handa
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2019-01-11 16:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, linux-mm, Johannes Weiner, LKML

On Sat 12-01-19 00:37:05, Tetsuo Handa wrote:
> On 2019/01/12 0:07, Michal Hocko wrote:
> > On Fri 11-01-19 23:31:18, Tetsuo Handa wrote:
> >> The OOM killer invoked by [ T9694] called printk() but didn't kill anything.
> >> Instead, SIGINT from Ctrl-C killed all thread groups sharing current->mm.
> > 
> > I still do not get it. Those other processes are not sharing signals.
> > Or is it due to injecting the signal too all of them with the proper
> > timing?
> 
> Pressing Ctrl-C between after task_will_free_mem(p) in oom_kill_process() and
> before __oom_kill_process() (e.g. dump_header()) made fatal_signal_pending() = T
> for all of them.
> 
> > Anyway, could you update your patch and abstract 
> > 	if (unlikely(tsk_is_oom_victim(current) ||
> > 		     fatal_signal_pending(current) ||
> > 		     current->flags & PF_EXITING))
> > 
> > in try_charge and reuse it in mem_cgroup_out_of_memory under the
> > oom_lock with an explanation please?
> 
> I don't think doing so makes sense, for
> 
>   tsk_is_oom_victim(current) = T && fatal_signal_pending(current) == F
> 
> can't happen for mem_cgroup_out_of_memory() under the oom_lock, and
> current->flags cannot get PF_EXITING when current is inside
> mem_cgroup_out_of_memory(). fatal_signal_pending(current) alone is
> appropriate for mem_cgroup_out_of_memory() under the oom_lock because
> 
>   tsk_is_oom_victim(current) = F && fatal_signal_pending(current) == T
> 
> can happen there.

I meant to use the same check consistently. If we can bypass the charge
under a list of conditions in the charge path we should be surely be
able to the the same for the oom path. I will not insist but unless
there is a strong reason I would prefer that.

> Also, doing so might become wrong in future, for mem_cgroup_out_of_memory()
> is also called from memory_max_write() which does not bail out upon
> PF_EXITING. I don't think we can call memory_max_write() after current
> thread got PF_EXITING, but nobody knows what change will happen in future.

No, this makes no sense what so ever.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-11 16:45                       ` Michal Hocko
@ 2019-01-12 10:52                         ` Tetsuo Handa
  2019-01-13 17:36                           ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Tetsuo Handa @ 2019-01-12 10:52 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Andrew Morton, linux-mm, Johannes Weiner, LKML

On 2019/01/12 1:45, Michal Hocko wrote:
>>> Anyway, could you update your patch and abstract 
>>> 	if (unlikely(tsk_is_oom_victim(current) ||
>>> 		     fatal_signal_pending(current) ||
>>> 		     current->flags & PF_EXITING))
>>>
>>> in try_charge and reuse it in mem_cgroup_out_of_memory under the
>>> oom_lock with an explanation please?
>>
>> I don't think doing so makes sense, for
>>
>>   tsk_is_oom_victim(current) = T && fatal_signal_pending(current) == F
>>
>> can't happen for mem_cgroup_out_of_memory() under the oom_lock, and
>> current->flags cannot get PF_EXITING when current is inside
>> mem_cgroup_out_of_memory(). fatal_signal_pending(current) alone is
>> appropriate for mem_cgroup_out_of_memory() under the oom_lock because
>>
>>   tsk_is_oom_victim(current) = F && fatal_signal_pending(current) == T
>>
>> can happen there.
> 
> I meant to use the same check consistently. If we can bypass the charge
> under a list of conditions in the charge path we should be surely be
> able to the the same for the oom path. I will not insist but unless
> there is a strong reason I would prefer that.
> 

You mean something like this? I'm not sure this change is safe.

 mm/memcontrol.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 17189da..1733d019 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -248,6 +248,12 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
+static inline bool can_ignore_limit(void)
+{
+	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+		(current->flags & PF_EXITING);
+}
+
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
@@ -1395,7 +1401,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = fatal_signal_pending(current) || out_of_memory(&oc);
+	ret = can_ignore_limit() || out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 	return ret;
 }
@@ -1724,6 +1730,10 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 
 	mem_cgroup_unmark_under_oom(memcg);
 	if (mem_cgroup_out_of_memory(memcg, mask, order))
+		/*
+		 * Returning OOM_SUCCESS upon can_ignore_limit() is OK, for
+		 * the caller will check can_ignore_limit() again.
+		 */
 		ret = OOM_SUCCESS;
 	else
 		ret = OOM_FAILED;
@@ -1783,6 +1793,11 @@ bool mem_cgroup_oom_synchronize(bool handle)
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
 					 current->memcg_oom_order);
+		/*
+		 * Returning upon can_ignore_limit() is OK, for the caller is
+		 * already killed... CheckMe: Is this assumption correct?
+		 * Page fault can't happen after getting PF_EXITING?
+		 */
 	} else {
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
@@ -2215,9 +2230,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * bypass the last charges so that they can exit quickly and
 	 * free their memory.
 	 */
-	if (unlikely(tsk_is_oom_victim(current) ||
-		     fatal_signal_pending(current) ||
-		     current->flags & PF_EXITING))
+	if (unlikely(can_ignore_limit()))
 		goto force;
 
 	/*
@@ -5527,6 +5540,12 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 		memcg_memory_event(memcg, MEMCG_OOM);
 		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
 			break;
+		/*
+		 * There is no need to check can_ignore_limit() here, for
+		 * signal_pending(current) above will break anyway.
+		 */
+		if (unlikely(can_ignore_limit()))
+			break;
 	}
 
 	memcg_wb_domain_size_changed(memcg);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM
  2019-01-12 10:52                         ` Tetsuo Handa
@ 2019-01-13 17:36                           ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2019-01-13 17:36 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, linux-mm, Johannes Weiner, LKML

On Sat 12-01-19 19:52:50, Tetsuo Handa wrote:
> On 2019/01/12 1:45, Michal Hocko wrote:
> >>> Anyway, could you update your patch and abstract 
> >>> 	if (unlikely(tsk_is_oom_victim(current) ||
> >>> 		     fatal_signal_pending(current) ||
> >>> 		     current->flags & PF_EXITING))
> >>>
> >>> in try_charge and reuse it in mem_cgroup_out_of_memory under the
> >>> oom_lock with an explanation please?
> >>
> >> I don't think doing so makes sense, for
> >>
> >>   tsk_is_oom_victim(current) = T && fatal_signal_pending(current) == F
> >>
> >> can't happen for mem_cgroup_out_of_memory() under the oom_lock, and
> >> current->flags cannot get PF_EXITING when current is inside
> >> mem_cgroup_out_of_memory(). fatal_signal_pending(current) alone is
> >> appropriate for mem_cgroup_out_of_memory() under the oom_lock because
> >>
> >>   tsk_is_oom_victim(current) = F && fatal_signal_pending(current) == T
> >>
> >> can happen there.
> > 
> > I meant to use the same check consistently. If we can bypass the charge
> > under a list of conditions in the charge path we should be surely be
> > able to the the same for the oom path. I will not insist but unless
> > there is a strong reason I would prefer that.
> > 
> 
> You mean something like this? I'm not sure this change is safe.
> 
>  mm/memcontrol.c | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 17189da..1733d019 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -248,6 +248,12 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> +static inline bool can_ignore_limit(void)
> +{
> +	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> +		(current->flags & PF_EXITING);
> +}
> +
>  /* Some nice accessors for the vmpressure. */
>  struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
>  {
> @@ -1395,7 +1401,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = fatal_signal_pending(current) || out_of_memory(&oc);
> +	ret = can_ignore_limit() || out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
>  	return ret;
>  }
> @@ -2215,9 +2230,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * bypass the last charges so that they can exit quickly and
>  	 * free their memory.
>  	 */
> -	if (unlikely(tsk_is_oom_victim(current) ||
> -		     fatal_signal_pending(current) ||
> -		     current->flags & PF_EXITING))
> +	if (unlikely(can_ignore_limit()))
>  		goto force;
>  
>  	/*

I meant something as simple as this, indeed. I would just
s@can_ignore_limit@should_force_charge@ but this is a minor thing.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, back to index

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-07 14:38 [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM Michal Hocko
2019-01-07 14:38 ` [PATCH 1/2] mm, oom: marks all killed tasks as oom victims Michal Hocko
2019-01-07 20:58   ` Tetsuo Handa
2019-01-08  8:11     ` Michal Hocko
2019-01-07 14:38 ` [PATCH 2/2] memcg: do not report racy no-eligible OOM tasks Michal Hocko
2019-01-07 20:59   ` Tetsuo Handa
2019-01-08  8:14     ` Michal Hocko
2019-01-08 10:39       ` Tetsuo Handa
2019-01-08 11:46         ` Michal Hocko
2019-01-08  8:35   ` kbuild test robot
2019-01-08  9:39     ` Michal Hocko
2019-01-11  0:23       ` [kbuild-all] " Rong Chen
2019-01-08 14:21 ` [PATCH 3/2] memcg: Facilitate termination of memcg OOM victims Tetsuo Handa
2019-01-08 14:38   ` Michal Hocko
2019-01-09 11:03 ` [PATCH 0/2] oom, memcg: do not report racy no-eligible OOM Michal Hocko
2019-01-09 11:34   ` Tetsuo Handa
2019-01-09 12:02     ` Michal Hocko
2019-01-10 23:59       ` Tetsuo Handa
2019-01-11 10:25         ` Tetsuo Handa
2019-01-11 11:33           ` Michal Hocko
2019-01-11 12:40             ` Tetsuo Handa
2019-01-11 13:34               ` Michal Hocko
2019-01-11 14:31                 ` Tetsuo Handa
2019-01-11 15:07                   ` Michal Hocko
2019-01-11 15:37                     ` Tetsuo Handa
2019-01-11 16:45                       ` Michal Hocko
2019-01-12 10:52                         ` Tetsuo Handa
2019-01-13 17:36                           ` Michal Hocko

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox