Re: pids.current with invalid value for hours [5.0.0 rc3 git]

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
       [not found]     ` <96ef6615-a5df-30af-b4dc-417a18ca63f1@gmail.com>
@ 2019-01-25  7:52       ` Arkadiusz Miśkiewicz
  2019-01-25 16:37         ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Arkadiusz Miśkiewicz @ 2019-01-25  7:52 UTC (permalink / raw)
  To: cgroups
  Cc: Aleksa Sarai, Jay Kamat, Roman Gushchin, Michal Hocko,
	Johannes Weiner, linux-kernel

On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
>> On 17/01/2019 13:25, Aleksa Sarai wrote:
>>> On 2019-01-17, Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com> wrote:
>>>> Using kernel 4.19.13.
>>>>
>>>> For one cgroup I noticed weird behaviour:
>>>>
>>>> # cat pids.current
>>>> 60
>>>> # cat cgroup.procs
>>>> #
>>>
>>> Are there any zombies in the cgroup? pids.current is linked up directly
>>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct
>>> actually being freed will decrease it).
>>>
>>
>> There are no zombie processes.
>>
>> In mean time the problem shows on multiple servers and so far saw it
>> only in cgroups that were OOMed.
>>
>> What has changed on these servers (yesterday) is turning on
>> memory.oom.group=1 for all cgroups and changing memory.high from 1G to
>> "max" (leaving memory.max=2G limit only).
>>
>> Previously there was no such problem.
>>
> 
> I'm attaching reproducer. This time tried on different distribution
> kernel (arch linux).
> 
> After 60s pids.current still shows 37 processes even if there are no
> processes running (according to ps aux).


The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
reproduce bug. No processes in cgroup but pids.current reports 91.

memory.oom.group=0 - everything works fine, pids are counted properly
memory.oom.group=1 - bug becomes visible

[root@xps test]# python3 cg.py
Created cgroup: /sys/fs/cgroup/test_5277
Start: pids.current: 0
Start: cgroup.procs:
0: pids.current: 103
0: cgroup.procs:
1: pids.current: 91
1: cgroup.procs:
2: pids.current: 91
2: cgroup.procs:
3: pids.current: 91
3: cgroup.procs:
4: pids.current: 91
4: cgroup.procs:
5: pids.current: 91
5: cgroup.procs:
6: pids.current: 91
6: cgroup.procs:
7: pids.current: 91
7: cgroup.procs:
8: pids.current: 91
8: cgroup.procs:
9: pids.current: 91
9: cgroup.procs:
10: pids.current: 91
10: cgroup.procs:
11: pids.current: 91
11: cgroup.procs:
[root@xps test]# uname -a
Linux xps 5.0.0-rc3-00104-gc04e2a780caf #288 SMP PREEMPT Thu Jan 24
19:00:32 CET 2019 x86_64 Intel(R)_Core(TM)_i9-8950HK_CPU_@_2.90GHz PLD Linux


cc relevant people

script is here: https://www.spinics.net/lists/cgroups/msg21330.html

> 
> [root@warm ~]# uname -a
> Linux warm 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC
> 2019 x86_64 GNU/Linux
> [root@warm ~]# python3 cg.py
> Created cgroup: /sys/fs/cgroup/test_26207
> Start: pids.current: 0
> Start: cgroup.procs:
> 0: pids.current: 62
> 0: cgroup.procs:
> 1: pids.current: 37
> 1: cgroup.procs:
> 2: pids.current: 37
> 2: cgroup.procs:
> 3: pids.current: 37
> 3: cgroup.procs:
> 4: pids.current: 37
> 4: cgroup.procs:
> 5: pids.current: 37
> 5: cgroup.procs:
> 6: pids.current: 37
> 6: cgroup.procs:
> 7: pids.current: 37
> 7: cgroup.procs:
> 8: pids.current: 37
> 8: cgroup.procs:
> 9: pids.current: 37
> 9: cgroup.procs:
> 10: pids.current: 37
> 10: cgroup.procs:
> 11: pids.current: 37
> 11: cgroup.procs:
> 


-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-25  7:52       ` pids.current with invalid value for hours [5.0.0 rc3 git] Arkadiusz Miśkiewicz
@ 2019-01-25 16:37         ` Tejun Heo
  2019-01-25 19:47           ` Arkadiusz Miśkiewicz
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2019-01-25 16:37 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz
  Cc: cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin, Michal Hocko,
	Johannes Weiner, linux-kernel

On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote:
> On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
> > On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
> >> On 17/01/2019 13:25, Aleksa Sarai wrote:
> >>> On 2019-01-17, Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com> wrote:
> >>>> Using kernel 4.19.13.
> >>>>
> >>>> For one cgroup I noticed weird behaviour:
> >>>>
> >>>> # cat pids.current
> >>>> 60
> >>>> # cat cgroup.procs
> >>>> #
> >>>
> >>> Are there any zombies in the cgroup? pids.current is linked up directly
> >>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct
> >>> actually being freed will decrease it).
> >>>
> >>
> >> There are no zombie processes.
> >>
> >> In mean time the problem shows on multiple servers and so far saw it
> >> only in cgroups that were OOMed.
> >>
> >> What has changed on these servers (yesterday) is turning on
> >> memory.oom.group=1 for all cgroups and changing memory.high from 1G to
> >> "max" (leaving memory.max=2G limit only).
> >>
> >> Previously there was no such problem.
> >>
> > 
> > I'm attaching reproducer. This time tried on different distribution
> > kernel (arch linux).
> > 
> > After 60s pids.current still shows 37 processes even if there are no
> > processes running (according to ps aux).
> 
> 
> The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
> reproduce bug. No processes in cgroup but pids.current reports 91.

Can you please see whether the problem can be reproduced on the
current linux-next?

 git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-25 16:37         ` Tejun Heo
@ 2019-01-25 19:47           ` Arkadiusz Miśkiewicz
  2019-01-26  1:27             ` Tetsuo Handa
  2019-01-26  1:41             ` pids.current with invalid value for hours [5.0.0 rc3 git] Roman Gushchin
  0 siblings, 2 replies; 22+ messages in thread
From: Arkadiusz Miśkiewicz @ 2019-01-25 19:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin, Michal Hocko,
	Johannes Weiner, linux-kernel

On 25/01/2019 17:37, Tejun Heo wrote:
> On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote:
>> On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
>>> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
>>>> On 17/01/2019 13:25, Aleksa Sarai wrote:
>>>>> On 2019-01-17, Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com> wrote:
>>>>>> Using kernel 4.19.13.
>>>>>>
>>>>>> For one cgroup I noticed weird behaviour:
>>>>>>
>>>>>> # cat pids.current
>>>>>> 60
>>>>>> # cat cgroup.procs
>>>>>> #
>>>>>
>>>>> Are there any zombies in the cgroup? pids.current is linked up directly
>>>>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct
>>>>> actually being freed will decrease it).
>>>>>
>>>>
>>>> There are no zombie processes.
>>>>
>>>> In mean time the problem shows on multiple servers and so far saw it
>>>> only in cgroups that were OOMed.
>>>>
>>>> What has changed on these servers (yesterday) is turning on
>>>> memory.oom.group=1 for all cgroups and changing memory.high from 1G to
>>>> "max" (leaving memory.max=2G limit only).
>>>>
>>>> Previously there was no such problem.
>>>>
>>>
>>> I'm attaching reproducer. This time tried on different distribution
>>> kernel (arch linux).
>>>
>>> After 60s pids.current still shows 37 processes even if there are no
>>> processes running (according to ps aux).
>>
>>
>> The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
>> reproduce bug. No processes in cgroup but pids.current reports 91.
> 
> Can you please see whether the problem can be reproduced on the
> current linux-next?
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git

I can reproduce on next (5.0.0-rc3-next-20190125), too:

[root@xps test]# python3 cg.py
Created cgroup: /sys/fs/cgroup/test_2501
Start: pids.current: 0
Start: cgroup.procs:
0: pids.current: 65
0: cgroup.procs:
1: pids.current: 44
1: cgroup.procs:
2: pids.current: 44
2: cgroup.procs:
3: pids.current: 44
3: cgroup.procs:
4: pids.current: 44
4: cgroup.procs:
5: pids.current: 44
5: cgroup.procs:
6: pids.current: 44
6: cgroup.procs:
7: pids.current: 44
7: cgroup.procs:
8: pids.current: 44
8: cgroup.procs:
9: pids.current: 44
9: cgroup.procs:
10: pids.current: 44
10: cgroup.procs:
11: pids.current: 44
11: cgroup.procs:
[root@xps test]# uname -a
Linux xps 5.0.0-rc3-next-20190125 #2 SMP PREEMPT Fri Jan 25 19:11:40 CET
2019 x86_64 Intel(R)_Core(TM)_i9-8950HK_CPU_@_2.90GHz PLD Linux
[root@xps test]# mount |grep cgroup2
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)


I'm booting kernel with cgroup_no_v1=all

-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-25 19:47           ` Arkadiusz Miśkiewicz
@ 2019-01-26  1:27             ` Tetsuo Handa
  2019-01-26  2:41               ` Arkadiusz Miśkiewicz
  2019-01-26  1:41             ` pids.current with invalid value for hours [5.0.0 rc3 git] Roman Gushchin
  1 sibling, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2019-01-26  1:27 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz
  Cc: Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin,
	Michal Hocko, Johannes Weiner, linux-kernel, Linus Torvalds

On 2019/01/26 4:47, Arkadiusz Miśkiewicz wrote:
>> Can you please see whether the problem can be reproduced on the
>> current linux-next?
>>
>>  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> 
> I can reproduce on next (5.0.0-rc3-next-20190125), too:
> 

Please try this patch.

Subject: [PATCH v2] memcg: killed threads should not invoke memcg OOM killer
From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
To: Andrew Morton <akpm@linux-foundation.org>,
 Johannes Weiner <hannes@cmpxchg.org>, David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org,
 Kirill Tkhai <ktkhai@virtuozzo.com>,
 Linus Torvalds <torvalds@linux-foundation.org>
Message-ID: <01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp>
Date: Tue, 15 Jan 2019 19:17:27 +0900

If $N > $M, a single process with $N threads in a memcg group can easily
kill all $M processes in that memcg group, for mem_cgroup_out_of_memory()
does not check if current thread needs to invoke the memcg OOM killer.

  T1@P1     |T2...$N@P1|P2...$M   |OOM reaper
  ----------+----------+----------+----------
                        # all sleeping
  try_charge()
    mem_cgroup_out_of_memory()
      mutex_lock(oom_lock)
             try_charge()
               mem_cgroup_out_of_memory()
                 mutex_lock(oom_lock)
      out_of_memory()
        select_bad_process()
        oom_kill_process(P1)
        wake_oom_reaper()
                                   oom_reap_task() # ignores P1
      mutex_unlock(oom_lock)
                 out_of_memory()
                   select_bad_process(P2...$M)
                        # all killed by T2...$N@P1
                   wake_oom_reaper()
                                   oom_reap_task() # ignores P2...$M
                 mutex_unlock(oom_lock)

We don't need to invoke the memcg OOM killer if current thread was killed
when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) can count
on try_charge() when mem_cgroup_oom_synchronize(true) can not make forward
progress because try_charge() allows already killed/exiting threads to
make forward progress, and memory_max_write() can bail out upon signals.

At first Michal thought that fatal signal check is racy compared to
tsk_is_oom_victim() check. But an experiment showed that trying to call
mark_oom_victim() on all killed thread groups is more racy than fatal
signal check due to task_will_free_mem(current) path in out_of_memory().

Therefore, this patch changes mem_cgroup_out_of_memory() to bail out upon
should_force_charge() == T rather than upon fatal_signal_pending() == T,
for should_force_charge() == T && signal_pending(current) == F at
memory_max_write() can't happen because current thread won't call
memory_max_write() after getting PF_EXITING.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Fixes: 3100dab2aa09 ("mm: memcontrol: print proper OOM header when no eligible victim left")
Cc: stable@vger.kernel.org # 4.19+
---
 mm/memcontrol.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7f18b..79a7d2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -248,6 +248,12 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
+static inline bool should_force_charge(void)
+{
+	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+		(current->flags & PF_EXITING);
+}
+
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
@@ -1389,8 +1395,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	};
 	bool ret;
 
-	mutex_lock(&oom_lock);
-	ret = out_of_memory(&oc);
+	if (mutex_lock_killable(&oom_lock))
+		return true;
+	/*
+	 * A few threads which were not waiting at mutex_lock_killable() can
+	 * fail to bail out. Therefore, check again after holding oom_lock.
+	 */
+	ret = should_force_charge() || out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 	return ret;
 }
@@ -2209,9 +2220,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * bypass the last charges so that they can exit quickly and
 	 * free their memory.
 	 */
-	if (unlikely(tsk_is_oom_victim(current) ||
-		     fatal_signal_pending(current) ||
-		     current->flags & PF_EXITING))
+	if (unlikely(should_force_charge()))
 		goto force;
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-25 19:47           ` Arkadiusz Miśkiewicz
  2019-01-26  1:27             ` Tetsuo Handa
@ 2019-01-26  1:41             ` Roman Gushchin
  2019-01-26  2:28               ` Arkadiusz Miśkiewicz
  1 sibling, 1 reply; 22+ messages in thread
From: Roman Gushchin @ 2019-01-26  1:41 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz
  Cc: Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Michal Hocko,
	Johannes Weiner, linux-kernel

On Fri, Jan 25, 2019 at 08:47:57PM +0100, Arkadiusz Miśkiewicz wrote:
> On 25/01/2019 17:37, Tejun Heo wrote:
> > On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote:
> >> On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
> >>> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
> >>>> On 17/01/2019 13:25, Aleksa Sarai wrote:
> >>>>> On 2019-01-17, Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com> wrote:
> >>>>>> Using kernel 4.19.13.
> >>>>>>
> >>>>>> For one cgroup I noticed weird behaviour:
> >>>>>>
> >>>>>> # cat pids.current
> >>>>>> 60
> >>>>>> # cat cgroup.procs
> >>>>>> #
> >>>>>
> >>>>> Are there any zombies in the cgroup? pids.current is linked up directly
> >>>>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct
> >>>>> actually being freed will decrease it).
> >>>>>
> >>>>
> >>>> There are no zombie processes.
> >>>>
> >>>> In mean time the problem shows on multiple servers and so far saw it
> >>>> only in cgroups that were OOMed.
> >>>>
> >>>> What has changed on these servers (yesterday) is turning on
> >>>> memory.oom.group=1 for all cgroups and changing memory.high from 1G to
> >>>> "max" (leaving memory.max=2G limit only).
> >>>>
> >>>> Previously there was no such problem.
> >>>>
> >>>
> >>> I'm attaching reproducer. This time tried on different distribution
> >>> kernel (arch linux).
> >>>
> >>> After 60s pids.current still shows 37 processes even if there are no
> >>> processes running (according to ps aux).
> >>
> >>
> >> The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
> >> reproduce bug. No processes in cgroup but pids.current reports 91.
> > 
> > Can you please see whether the problem can be reproduced on the
> > current linux-next?
> > 
> >  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> 
> I can reproduce on next (5.0.0-rc3-next-20190125), too:

How reliably you can reproduce it? I've tried to run your reproducer
several times with different parameters, but wasn't lucky so far.
What's yours cpu number and total ram size?

Can you, please, provide the corresponding dmesg output?

I've checked the code again, and my wild guess is that these missing
tasks are waiting (maybe hopelessly) for the OOM reaper. Dmesg output
might be very useful here.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-26  1:41             ` pids.current with invalid value for hours [5.0.0 rc3 git] Roman Gushchin
@ 2019-01-26  2:28               ` Arkadiusz Miśkiewicz
  0 siblings, 0 replies; 22+ messages in thread
From: Arkadiusz Miśkiewicz @ 2019-01-26  2:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Michal Hocko,
	Johannes Weiner, linux-kernel

On 26/01/2019 02:41, Roman Gushchin wrote:
> On Fri, Jan 25, 2019 at 08:47:57PM +0100, Arkadiusz Miśkiewicz wrote:
>> On 25/01/2019 17:37, Tejun Heo wrote:
>>> On Fri, Jan 25, 2019 at 08:52:11AM +0100, Arkadiusz Miśkiewicz wrote:
>>>> On 24/01/2019 12:21, Arkadiusz Miśkiewicz wrote:
>>>>> On 17/01/2019 14:17, Arkadiusz Miśkiewicz wrote:
>>>>>> On 17/01/2019 13:25, Aleksa Sarai wrote:
>>>>>>> On 2019-01-17, Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com> wrote:
>>>>>>>> Using kernel 4.19.13.
>>>>>>>>
>>>>>>>> For one cgroup I noticed weird behaviour:
>>>>>>>>
>>>>>>>> # cat pids.current
>>>>>>>> 60
>>>>>>>> # cat cgroup.procs
>>>>>>>> #
>>>>>>>
>>>>>>> Are there any zombies in the cgroup? pids.current is linked up directly
>>>>>>> to __put_task_struct (so exit(2) won't decrease it, only the task_struct
>>>>>>> actually being freed will decrease it).
>>>>>>>
>>>>>>
>>>>>> There are no zombie processes.
>>>>>>
>>>>>> In mean time the problem shows on multiple servers and so far saw it
>>>>>> only in cgroups that were OOMed.
>>>>>>
>>>>>> What has changed on these servers (yesterday) is turning on
>>>>>> memory.oom.group=1 for all cgroups and changing memory.high from 1G to
>>>>>> "max" (leaving memory.max=2G limit only).
>>>>>>
>>>>>> Previously there was no such problem.
>>>>>>
>>>>>
>>>>> I'm attaching reproducer. This time tried on different distribution
>>>>> kernel (arch linux).
>>>>>
>>>>> After 60s pids.current still shows 37 processes even if there are no
>>>>> processes running (according to ps aux).
>>>>
>>>>
>>>> The same test on 5.0.0-rc3-00104-gc04e2a780caf and it's easy to
>>>> reproduce bug. No processes in cgroup but pids.current reports 91.
>>>
>>> Can you please see whether the problem can be reproduced on the
>>> current linux-next?
>>>
>>>  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>>
>> I can reproduce on next (5.0.0-rc3-next-20190125), too:
> 
> How reliably you can reproduce it? I've tried to run your reproducer
> several times with different parameters, but wasn't lucky so far.

Hmm, I'm able to reproduce each time using that python3 script.


> What's yours cpu number and total ram size?

On different machines:

1) old pc, 1 x intel E6600, 4GB of ram (arch linux 4.20.3 distro
kernel), using python script

2) virtualbox vm, 1 cpu with 2 cores, 8GB of ram (pld kernel and custom
built kernels like 5.0rc3 and 5.0rc3 next), using python script

3) server with 2 x intel E5405, 32GB of ram (4.19.13), with real life
scenario

4) server with 1 x intel E5-2630 v2, 64GB of ram (4.19.15), with real
life scenario
> 
> Can you, please, provide the corresponding dmesg output?

From my virtualbox vm, after 60s pids.current reports 7 despite no
processes:

http://ixion.pld-linux.org/~arekm/cgroup-oom-1.txt

kernel config:
http://ixion.pld-linux.org/~arekm/cgroup-oom-kernelconf-1.txt

Using 5.0.0-rc3-next-20190125 on that vm.

> I've checked the code again, and my wild guess is that these missing
> tasks are waiting (maybe hopelessly) for the OOM reaper. Dmesg output
> might be very useful here.



> 
> Thanks!
> 


-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-26  1:27             ` Tetsuo Handa
@ 2019-01-26  2:41               ` Arkadiusz Miśkiewicz
  2019-01-26  6:10                 ` Tetsuo Handa
  0 siblings, 1 reply; 22+ messages in thread
From: Arkadiusz Miśkiewicz @ 2019-01-26  2:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin,
	Michal Hocko, Johannes Weiner, linux-kernel, Linus Torvalds

On 26/01/2019 02:27, Tetsuo Handa wrote:
> On 2019/01/26 4:47, Arkadiusz Miśkiewicz wrote:
>>> Can you please see whether the problem can be reproduced on the
>>> current linux-next?
>>>
>>>  git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>>
>> I can reproduce on next (5.0.0-rc3-next-20190125), too:
>>
> 
> Please try this patch.

Doesn't help:

[root@xps test]# python3 cg.py
Created cgroup: /sys/fs/cgroup/test_2149
Start: pids.current: 0
Start: cgroup.procs:
0: pids.current: 97
0: cgroup.procs:
1: pids.current: 14
1: cgroup.procs:
2: pids.current: 14
2: cgroup.procs:
3: pids.current: 14
3: cgroup.procs:
4: pids.current: 14
4: cgroup.procs:
5: pids.current: 14
5: cgroup.procs:
6: pids.current: 14
6: cgroup.procs:
7: pids.current: 14
7: cgroup.procs:
8: pids.current: 14
8: cgroup.procs:
9: pids.current: 14
9: cgroup.procs:
10: pids.current: 14
10: cgroup.procs:
11: pids.current: 14
11: cgroup.procs:
[root@xps test]# ps aux|grep python
root      3160  0.0  0.0 234048  2160 pts/2    S+   03:34   0:00 grep python
[root@xps test]# uname -a
Linux xps 5.0.0-rc3-00104-gc04e2a780caf-dirty #289 SMP PREEMPT Sat Jan
26 03:29:45 CET 2019 x86_64 Intel(R)_Core(TM)_i9-8950HK_CPU_@_2.90GHz
PLD Linux


kernel config:
http://ixion.pld-linux.org/~arekm/cgroup-oom-kernelconf-2.txt

dmesg:
http://ixion.pld-linux.org/~arekm/cgroup-oom-2.txt


> 
> Subject: [PATCH v2] memcg: killed threads should not invoke memcg OOM killer
> From: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> To: Andrew Morton <akpm@linux-foundation.org>,
>  Johannes Weiner <hannes@cmpxchg.org>, David Rientjes <rientjes@google.com>
> Cc: Michal Hocko <mhocko@kernel.org>, linux-mm@kvack.org,
>  Kirill Tkhai <ktkhai@virtuozzo.com>,
>  Linus Torvalds <torvalds@linux-foundation.org>
> Message-ID: <01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp>
> Date: Tue, 15 Jan 2019 19:17:27 +0900
> 
> If $N > $M, a single process with $N threads in a memcg group can easily
> kill all $M processes in that memcg group, for mem_cgroup_out_of_memory()
> does not check if current thread needs to invoke the memcg OOM killer.
> 
>   T1@P1     |T2...$N@P1|P2...$M   |OOM reaper
>   ----------+----------+----------+----------
>                         # all sleeping
>   try_charge()
>     mem_cgroup_out_of_memory()
>       mutex_lock(oom_lock)
>              try_charge()
>                mem_cgroup_out_of_memory()
>                  mutex_lock(oom_lock)
>       out_of_memory()
>         select_bad_process()
>         oom_kill_process(P1)
>         wake_oom_reaper()
>                                    oom_reap_task() # ignores P1
>       mutex_unlock(oom_lock)
>                  out_of_memory()
>                    select_bad_process(P2...$M)
>                         # all killed by T2...$N@P1
>                    wake_oom_reaper()
>                                    oom_reap_task() # ignores P2...$M
>                  mutex_unlock(oom_lock)
> 
> We don't need to invoke the memcg OOM killer if current thread was killed
> when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) can count
> on try_charge() when mem_cgroup_oom_synchronize(true) can not make forward
> progress because try_charge() allows already killed/exiting threads to
> make forward progress, and memory_max_write() can bail out upon signals.
> 
> At first Michal thought that fatal signal check is racy compared to
> tsk_is_oom_victim() check. But an experiment showed that trying to call
> mark_oom_victim() on all killed thread groups is more racy than fatal
> signal check due to task_will_free_mem(current) path in out_of_memory().
> 
> Therefore, this patch changes mem_cgroup_out_of_memory() to bail out upon
> should_force_charge() == T rather than upon fatal_signal_pending() == T,
> for should_force_charge() == T && signal_pending(current) == F at
> memory_max_write() can't happen because current thread won't call
> memory_max_write() after getting PF_EXITING.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
> Fixes: 3100dab2aa09 ("mm: memcontrol: print proper OOM header when no eligible victim left")
> Cc: stable@vger.kernel.org # 4.19+
> ---
>  mm/memcontrol.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index af7f18b..79a7d2a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -248,6 +248,12 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> +static inline bool should_force_charge(void)
> +{
> +	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> +		(current->flags & PF_EXITING);
> +}
> +
>  /* Some nice accessors for the vmpressure. */
>  struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
>  {
> @@ -1389,8 +1395,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	};
>  	bool ret;
>  
> -	mutex_lock(&oom_lock);
> -	ret = out_of_memory(&oc);
> +	if (mutex_lock_killable(&oom_lock))
> +		return true;
> +	/*
> +	 * A few threads which were not waiting at mutex_lock_killable() can
> +	 * fail to bail out. Therefore, check again after holding oom_lock.
> +	 */
> +	ret = should_force_charge() || out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
>  	return ret;
>  }
> @@ -2209,9 +2220,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * bypass the last charges so that they can exit quickly and
>  	 * free their memory.
>  	 */
> -	if (unlikely(tsk_is_oom_victim(current) ||
> -		     fatal_signal_pending(current) ||
> -		     current->flags & PF_EXITING))
> +	if (unlikely(should_force_charge()))
>  		goto force;
>  
>  	/*
> 


-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-26  2:41               ` Arkadiusz Miśkiewicz
@ 2019-01-26  6:10                 ` Tetsuo Handa
  2019-01-26  7:55                   ` Tetsuo Handa
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2019-01-26  6:10 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz
  Cc: Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin,
	Michal Hocko, Johannes Weiner, linux-kernel, Linus Torvalds

On 2019/01/26 11:41, Arkadiusz Miśkiewicz wrote:
> dmesg:
> http://ixion.pld-linux.org/~arekm/cgroup-oom-2.txt

What is wrong with this output? It seems to me that the OOM killer is killing
processes as soon as python 3 fork()s one. Although "potentially unexpected
fatal signal 7." messages due to handling SIGBUS at get_signal() after the
OOM reaper reclaimed the OOM victim's memory is a bit noisy, I don't think
that something is wrong.

  Jan 26 03:33:49 xps kernel: python3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=0
  Jan 26 03:33:52 xps kernel: potentially unexpected fatal signal 7.

You seem to be monitoring cgroup statistic values for only 1 minute, but
since the fork bombs might take longer than 1 minute, I'm not sure that
the statistic values after terminating cg.py and waited for a while (e.g.
1 minute) is still showing unexpected values.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-26  6:10                 ` Tetsuo Handa
@ 2019-01-26  7:55                   ` Tetsuo Handa
  2019-01-26 11:09                     ` Tetsuo Handa
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2019-01-26  7:55 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz
  Cc: Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin,
	Michal Hocko, Johannes Weiner, linux-kernel, Linus Torvalds

On 2019/01/26 15:10, Tetsuo Handa wrote:
> On 2019/01/26 11:41, Arkadiusz Miśkiewicz wrote:
>> dmesg:
>> http://ixion.pld-linux.org/~arekm/cgroup-oom-2.txt

OK. There is a refcount leak bug in wake_oom_reaper()
which became visible by enabling oom_group setting.

static void wake_oom_reaper(struct task_struct *tsk)
{
	/* tsk is already queued? */
	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
		return;

	get_task_struct(tsk);

	spin_lock(&oom_reaper_lock);
	tsk->oom_reaper_list = oom_reaper_list;
	oom_reaper_list = tsk;
	spin_unlock(&oom_reaper_lock);
	trace_wake_reaper(tsk->pid);
	wake_up(&oom_reaper_wait);
}

(1) oom_reaper_list and tsk(*)->oom_reaper_list are initially NULL.
(2) Since tsk(2160) != oom_reaper_list && tsk(2160)->oom_reaper_list == NULL,
    get_task_struct(tsk(2160)) is called.
(3) tsk(2160)->oom_reaper_list = oom_reaper_list (which is NULL).
(4) oom_reaper_list = tsk(2160) (which is not NULL).
(5) Since tsk(2150) != oom_reaper_list (which is tsk(2160)) && tsk(2150)->oom_reaper_list == NULL,
    get_task_struct(tsk(2150)) is called.
(6) Step (5) repeats on tsk(2153, 2164. 2166, 2163, 2159, 2161, 2154).
(7) Since tsk(2160) != oom_reaper_list (which is tsk(2154)) && tsk(2160)->oom_reaper_list == NULL
    (because it was NULL as of (2)), get_task_struct(tsk(2160)) is called again.
(8) oom_reap_task() calls put_task_struct(tsk(2160)) for only once because
    tsk(2160) appears on the oom_reaper_list for only once; leaking refcount.

Jan 26 03:33:49 xps kernel: Memory cgroup out of memory: Kill process 2160 (python3) score 66 or sacrifice child
Jan 26 03:33:49 xps kernel: Killed process 2160 (python3) total-vm:272176kB, anon-rss:31668kB, file-rss:3248kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Tasks in /test_2149 are going to be killed due to memory.oom.group set
Jan 26 03:33:49 xps kernel: Killed process 2150 (python3) total-vm:272176kB, anon-rss:27572kB, file-rss:3248kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2153 (python3) total-vm:261936kB, anon-rss:17396kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2164 (python3) total-vm:272176kB, anon-rss:27572kB, file-rss:3248kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2166 (python3) total-vm:261936kB, anon-rss:15088kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2163 (python3) total-vm:261936kB, anon-rss:21496kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2159 (python3) total-vm:282420kB, anon-rss:23944kB, file-rss:3128kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2161 (python3) total-vm:251692kB, anon-rss:11172kB, file-rss:3060kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2154 (python3) total-vm:261936kB, anon-rss:15084kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2160 (python3) total-vm:272176kB, anon-rss:31668kB, file-rss:3248kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2155 (python3) total-vm:261936kB, anon-rss:17364kB, file-rss:3056kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2168 (python3) total-vm:272176kB, anon-rss:29620kB, file-rss:3108kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2156 (python3) total-vm:251692kB, anon-rss:11204kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2157 (python3) total-vm:251692kB, anon-rss:11172kB, file-rss:3088kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2169 (python3) total-vm:261936kB, anon-rss:19448kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2158 (python3) total-vm:272176kB, anon-rss:23944kB, file-rss:3192kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2170 (python3) total-vm:251692kB, anon-rss:13252kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2171 (python3) total-vm:261936kB, anon-rss:17396kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2173 (python3) total-vm:251692kB, anon-rss:11204kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2152 (python3) total-vm:261936kB, anon-rss:13724kB, file-rss:3100kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2174 (python3) total-vm:251692kB, anon-rss:11160kB, file-rss:3120kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2172 (python3) total-vm:251692kB, anon-rss:9128kB, file-rss:3116kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2187 (python3) total-vm:261936kB, anon-rss:23516kB, file-rss:3116kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2184 (python3) total-vm:251692kB, anon-rss:13500kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2151 (python3) total-vm:251692kB, anon-rss:9108kB, file-rss:3096kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2182 (python3) total-vm:251692kB, anon-rss:12196kB, file-rss:3140kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2175 (python3) total-vm:261936kB, anon-rss:13996kB, file-rss:3124kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2183 (python3) total-vm:251692kB, anon-rss:9124kB, file-rss:3112kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2177 (python3) total-vm:251692kB, anon-rss:9148kB, file-rss:3140kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2186 (python3) total-vm:261936kB, anon-rss:15336kB, file-rss:3152kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2189 (python3) total-vm:251692kB, anon-rss:9156kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2178 (python3) total-vm:251692kB, anon-rss:11204kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2185 (python3) total-vm:261936kB, anon-rss:15084kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2190 (python3) total-vm:251692kB, anon-rss:11204kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2179 (python3) total-vm:251692kB, anon-rss:9156kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2180 (python3) total-vm:251692kB, anon-rss:5060kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2181 (python3) total-vm:261936kB, anon-rss:21492kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2176 (python3) total-vm:261936kB, anon-rss:15348kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2188 (python3) total-vm:251692kB, anon-rss:7108kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2191 (python3) total-vm:261936kB, anon-rss:14556kB, file-rss:3188kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: Killed process 2192 (python3) total-vm:241448kB, anon-rss:3456kB, file-rss:2712kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2163 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2192 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2191 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2188 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2176 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2181 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2180 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2179 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2190 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2185 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2178 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2189 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2186 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2177 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2175 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2182 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2151 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2184 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2187 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2172 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2174 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2173 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2171 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2158 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2169 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2157 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2156 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2168 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2155 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2160 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2154 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:49 xps kernel: oom_reaper: reaped process 2161 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Jan 26 03:33:50 xps kernel: python3 invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=0
Jan 26 03:33:50 xps kernel: CPU: 1 PID: 2217 Comm: python3 Tainted: G            E   T 5.0.0-rc3-00104-gc04e2a780caf-dirty #289

Since we assumed that wake_oom_reaper() is called for only once
for one out_of_memory() request,

	/* tsk is already queued? */
	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
		return;

no longer works.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-26  7:55                   ` Tetsuo Handa
@ 2019-01-26 11:09                     ` Tetsuo Handa
  2019-01-26 11:29                       ` Arkadiusz Miśkiewicz
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2019-01-26 11:09 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz
  Cc: Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin,
	Michal Hocko, Johannes Weiner, linux-kernel, Linus Torvalds

Arkadiusz, will you try this patch?

From 48744b6339cf649a69b55997e138c17df1ecc897 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 26 Jan 2019 20:00:51 +0900
Subject: [PATCH] oom, oom_reaper: do not enqueue same task twice

Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite
the number of tasks in that memcg is not 0. It turned out that there
is a bug in wake_oom_reaper() which allows enqueuing same task twice
which makes impossible to decrease the number of tasks in that memcg
due to a refcount leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b018325737f76
("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
Since task_will_free_mem(p) == false if p->mm == NULL, we can assume that
p->mm != NULL when wake_oom_reaper() is called from task_will_free_mem()
paths. As a side effect of this patch, this patch also avoids enqueuing
multiple threads sharing memory via task_will_free_mem(current) path.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com>
Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
---
 mm/oom_kill.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f0e8cd9..457f240 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -505,14 +505,6 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	bool ret = true;
 
-	/*
-	 * Tell all users of get_user/copy_from_user etc... that the content
-	 * is no longer stable. No barriers really needed because unmapping
-	 * should imply barriers already and the reader would hit a page fault
-	 * if it stumbled over a reaped memory.
-	 */
-	set_bit(MMF_UNSTABLE, &mm->flags);
-
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (!can_madv_dontneed_vma(vma))
 			continue;
@@ -645,10 +637,15 @@ static int oom_reaper(void *unused)
 	return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+static void wake_oom_reaper(struct task_struct *tsk, struct mm_struct *mm)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	/*
+	 * Tell all users of get_user/copy_from_user etc... that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over a reaped memory.
+	 */
+	if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
 		return;
 
 	get_task_struct(tsk);
@@ -668,7 +665,8 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static inline void wake_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk,
+				   struct mm_struct *mm)
 {
 }
 #endif /* CONFIG_MMU */
@@ -915,7 +913,7 @@ static void __oom_kill_process(struct task_struct *victim)
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		wake_oom_reaper(victim);
+		wake_oom_reaper(victim, mm);
 
 	mmdrop(mm);
 	put_task_struct(victim);
@@ -955,7 +953,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	task_lock(p);
 	if (task_will_free_mem(p)) {
 		mark_oom_victim(p);
-		wake_oom_reaper(p);
+		wake_oom_reaper(p, p->mm);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -1085,7 +1083,7 @@ bool out_of_memory(struct oom_control *oc)
 	 */
 	if (task_will_free_mem(current)) {
 		mark_oom_victim(current);
-		wake_oom_reaper(current);
+		wake_oom_reaper(current, current->mm);
 		return true;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: pids.current with invalid value for hours [5.0.0 rc3 git]
  2019-01-26 11:09                     ` Tetsuo Handa
@ 2019-01-26 11:29                       ` Arkadiusz Miśkiewicz
  2019-01-26 13:10                         ` [PATCH v2] oom, oom_reaper: do not enqueue same task twice Tetsuo Handa
  0 siblings, 1 reply; 22+ messages in thread
From: Arkadiusz Miśkiewicz @ 2019-01-26 11:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin,
	Michal Hocko, Johannes Weiner, linux-kernel, Linus Torvalds

On 26/01/2019 12:09, Tetsuo Handa wrote:
> Arkadiusz, will you try this patch?


Works. Several tries and always getting 0 pids.current after ~1s.

Please use

Reported-by: Arkadiusz Miśkiewicz <arekm@maven.pl>

(using gmail for transport only since vger postmasters aren't reasonable)

And also
Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>

Testing was done on:
5.0.0-rc3-00104-gc04e2a780caf-dirty

Thanks!

> 
> From 48744b6339cf649a69b55997e138c17df1ecc897 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 26 Jan 2019 20:00:51 +0900
> Subject: [PATCH] oom, oom_reaper: do not enqueue same task twice
> 
> Arkadiusz reported that enabling memcg's group oom killing causes
> strange memcg statistics where there is no task in a memcg despite
> the number of tasks in that memcg is not 0. It turned out that there
> is a bug in wake_oom_reaper() which allows enqueuing same task twice
> which makes impossible to decrease the number of tasks in that memcg
> due to a refcount leak.
> 
> This bug existed since the OOM reaper became invokable from
> task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
> but memcg's group oom killing made it easier to trigger this bug by
> calling wake_oom_reaper() on the same task from one out_of_memory()
> request.
> 
> Fix this bug using an approach used by commit 855b018325737f76
> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
> Since task_will_free_mem(p) == false if p->mm == NULL, we can assume that
> p->mm != NULL when wake_oom_reaper() is called from task_will_free_mem()
> paths. As a side effect of this patch, this patch also avoids enqueuing
> multiple threads sharing memory via task_will_free_mem(current) path.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com>
> Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
> ---
>  mm/oom_kill.c | 28 +++++++++++++---------------
>  1 file changed, 13 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f0e8cd9..457f240 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -505,14 +505,6 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>  	struct vm_area_struct *vma;
>  	bool ret = true;
>  
> -	/*
> -	 * Tell all users of get_user/copy_from_user etc... that the content
> -	 * is no longer stable. No barriers really needed because unmapping
> -	 * should imply barriers already and the reader would hit a page fault
> -	 * if it stumbled over a reaped memory.
> -	 */
> -	set_bit(MMF_UNSTABLE, &mm->flags);
> -
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
>  		if (!can_madv_dontneed_vma(vma))
>  			continue;
> @@ -645,10 +637,15 @@ static int oom_reaper(void *unused)
>  	return 0;
>  }
>  
> -static void wake_oom_reaper(struct task_struct *tsk)
> +static void wake_oom_reaper(struct task_struct *tsk, struct mm_struct *mm)
>  {
> -	/* tsk is already queued? */
> -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> +	/*
> +	 * Tell all users of get_user/copy_from_user etc... that the content
> +	 * is no longer stable. No barriers really needed because unmapping
> +	 * should imply barriers already and the reader would hit a page fault
> +	 * if it stumbled over a reaped memory.
> +	 */
> +	if (test_and_set_bit(MMF_UNSTABLE, &mm->flags))
>  		return;
>  
>  	get_task_struct(tsk);
> @@ -668,7 +665,8 @@ static int __init oom_init(void)
>  }
>  subsys_initcall(oom_init)
>  #else
> -static inline void wake_oom_reaper(struct task_struct *tsk)
> +static inline void wake_oom_reaper(struct task_struct *tsk,
> +				   struct mm_struct *mm)
>  {
>  }
>  #endif /* CONFIG_MMU */
> @@ -915,7 +913,7 @@ static void __oom_kill_process(struct task_struct *victim)
>  	rcu_read_unlock();
>  
>  	if (can_oom_reap)
> -		wake_oom_reaper(victim);
> +		wake_oom_reaper(victim, mm);
>  
>  	mmdrop(mm);
>  	put_task_struct(victim);
> @@ -955,7 +953,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	task_lock(p);
>  	if (task_will_free_mem(p)) {
>  		mark_oom_victim(p);
> -		wake_oom_reaper(p);
> +		wake_oom_reaper(p, p->mm);
>  		task_unlock(p);
>  		put_task_struct(p);
>  		return;
> @@ -1085,7 +1083,7 @@ bool out_of_memory(struct oom_control *oc)
>  	 */
>  	if (task_will_free_mem(current)) {
>  		mark_oom_victim(current);
> -		wake_oom_reaper(current);
> +		wake_oom_reaper(current, current->mm);
>  		return true;
>  	}
>  
> 


-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2] oom, oom_reaper: do not enqueue same task twice
  2019-01-26 11:29                       ` Arkadiusz Miśkiewicz
@ 2019-01-26 13:10                         ` Tetsuo Handa
  2019-01-27  8:37                           ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2019-01-26 13:10 UTC (permalink / raw)
  To: Arkadiusz Miśkiewicz, Andrew Morton
  Cc: Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin,
	Michal Hocko, Johannes Weiner, linux-kernel, Linus Torvalds,
	linux-mm

On 2019/01/26 20:29, Arkadiusz Miśkiewicz wrote:
> On 26/01/2019 12:09, Tetsuo Handa wrote:
>> Arkadiusz, will you try this patch?
> 
> 
> Works. Several tries and always getting 0 pids.current after ~1s.
> 

Thank you for testing.

I updated this patch to use tsk->signal->oom_mm (a snapshot of
tsk->mm saved by mark_oom_victim(tsk)) rather than raw tsk->mm
so that we don't need to worry about possibility of changing
tsk->mm across multiple wake_oom_reaper(tsk) calls.

From 9c9e935fc038342c48461aabca666f1b544e32b1 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 26 Jan 2019 21:57:25 +0900
Subject: [PATCH v2] oom, oom_reaper: do not enqueue same task twice

Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite
the number of tasks in that memcg is not 0. It turned out that there
is a bug in wake_oom_reaper() which allows enqueuing same task twice
which makes impossible to decrease the number of tasks in that memcg
due to a refcount leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b018325737f76
("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
As a side effect of this patch, this patch also avoids enqueuing
multiple threads sharing memory via task_will_free_mem(current) path.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
---
 mm/oom_kill.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f0e8cd9..057bfee 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -505,14 +505,6 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	struct vm_area_struct *vma;
 	bool ret = true;

-	/*
-	 * Tell all users of get_user/copy_from_user etc... that the content
-	 * is no longer stable. No barriers really needed because unmapping
-	 * should imply barriers already and the reader would hit a page fault
-	 * if it stumbled over a reaped memory.
-	 */
-	set_bit(MMF_UNSTABLE, &mm->flags);
-
 	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 		if (!can_madv_dontneed_vma(vma))
 			continue;
@@ -647,8 +639,13 @@ static int oom_reaper(void *unused)

 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	/*
+	 * Tell all users of get_user/copy_from_user etc... that the content
+	 * is no longer stable. No barriers really needed because unmapping
+	 * should imply barriers already and the reader would hit a page fault
+	 * if it stumbled over a reaped memory.
+	 */
+	if (test_and_set_bit(MMF_UNSTABLE, &tsk->signal->oom_mm->flags))
 		return;

 	get_task_struct(tsk);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] oom, oom_reaper: do not enqueue same task twice
  2019-01-26 13:10                         ` [PATCH v2] oom, oom_reaper: do not enqueue same task twice Tetsuo Handa
@ 2019-01-27  8:37                           ` Michal Hocko
  2019-01-27 10:56                             ` Tetsuo Handa
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-01-27  8:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Arkadiusz Miśkiewicz, Andrew Morton, Tejun Heo, cgroups,
	Aleksa Sarai, Jay Kamat, Roman Gushchin, Johannes Weiner,
	linux-kernel, Linus Torvalds, linux-mm

On Sat 26-01-19 22:10:52, Tetsuo Handa wrote:
[...]
> >From 9c9e935fc038342c48461aabca666f1b544e32b1 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 26 Jan 2019 21:57:25 +0900
> Subject: [PATCH v2] oom, oom_reaper: do not enqueue same task twice
> 
> Arkadiusz reported that enabling memcg's group oom killing causes
> strange memcg statistics where there is no task in a memcg despite
> the number of tasks in that memcg is not 0. It turned out that there
> is a bug in wake_oom_reaper() which allows enqueuing same task twice
> which makes impossible to decrease the number of tasks in that memcg
> due to a refcount leak.
> 
> This bug existed since the OOM reaper became invokable from
> task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
> but memcg's group oom killing made it easier to trigger this bug by
> calling wake_oom_reaper() on the same task from one out_of_memory()
> request.
> 
> Fix this bug using an approach used by commit 855b018325737f76
> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
> As a side effect of this patch, this patch also avoids enqueuing
> multiple threads sharing memory via task_will_free_mem(current) path.

Thanks for the analysis and the patch. This should work, I believe but
I am not really thrilled to overload the meaning of the MMF_UNSTABLE.
The flag is meant to signal accessing address space is not stable and it
is not aimed to synchronize oom reaper with the oom path.

Can we make use mark_oom_victim directly? I didn't get to think that
through right now so I might be missing something but this should
prevent repeating queueing as well.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f0e8cd9edb1a..dac4f2197e53 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -690,7 +690,7 @@ static void mark_oom_victim(struct task_struct *tsk)
 	WARN_ON(oom_killer_disabled);
 	/* OOM killer might race with memcg OOM */
 	if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+		return false;
 
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
@@ -707,6 +707,8 @@ static void mark_oom_victim(struct task_struct *tsk)
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
 	trace_mark_victim(tsk->pid);
+
+	return true;
 }
 
 /**
@@ -873,7 +875,7 @@ static void __oom_kill_process(struct task_struct *victim)
 	 * reserves from the user space under its control.
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
-	mark_oom_victim(victim);
+	can_oom_reap = mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -954,8 +956,8 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 */
 	task_lock(p);
 	if (task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		wake_oom_reaper(p);
+		if (mark_oom_victim(p)
+			wake_oom_reaper(p);
 		task_unlock(p);
 		put_task_struct(p);
 		return;
@@ -1084,8 +1086,8 @@ bool out_of_memory(struct oom_control *oc)
 	 * quickly exit and free its memory.
 	 */
 	if (task_will_free_mem(current)) {
-		mark_oom_victim(current);
-		wake_oom_reaper(current);
+		if (mark_oom_victim(current))
+			wake_oom_reaper(current);
 		return true;
 	}
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] oom, oom_reaper: do not enqueue same task twice
  2019-01-27  8:37                           ` Michal Hocko
@ 2019-01-27 10:56                             ` Tetsuo Handa
  2019-01-27 11:40                               ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Tetsuo Handa @ 2019-01-27 10:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Arkadiusz Miśkiewicz, Andrew Morton, Tejun Heo, cgroups,
	Aleksa Sarai, Jay Kamat, Roman Gushchin, Johannes Weiner,
	linux-kernel, Linus Torvalds, linux-mm

On 2019/01/27 17:37, Michal Hocko wrote:
> Thanks for the analysis and the patch. This should work, I believe but
> I am not really thrilled to overload the meaning of the MMF_UNSTABLE.
> The flag is meant to signal accessing address space is not stable and it
> is not aimed to synchronize oom reaper with the oom path.
> 
> Can we make use mark_oom_victim directly? I didn't get to think that
> through right now so I might be missing something but this should
> prevent repeating queueing as well.

Yes, TIF_MEMDIE would work. But you are planning to remove TIF_MEMDIE. Also,
TIF_MEMDIE can't avoid enqueuing many threads sharing mm_struct to the OOM
reaper. There is no need to enqueue many threads sharing mm_struct because
the OOM reaper acts on mm_struct rather than task_struct. Thus, enqueuing
based on per mm_struct flag sounds better, but MMF_OOM_VICTIM cannot be
set from wake_oom_reaper(victim) because victim's mm might be already inside
exit_mmap() when wake_oom_reaper(victim) is called after task_unlock(victim).

We could reintroduce MMF_OOM_KILLED in commit 855b018325737f76
("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task")
if you don't like overloading the meaning of the MMF_UNSTABLE. But since
MMF_UNSTABLE is available in Linux 4.9+ kernels (which covers all LTS stable
versions with the OOM reaper support), we can temporarily use MMF_UNSTABLE
for ease of backporting.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] oom, oom_reaper: do not enqueue same task twice
  2019-01-27 10:56                             ` Tetsuo Handa
@ 2019-01-27 11:40                               ` Michal Hocko
  2019-01-27 14:57                                 ` [PATCH v3] " Tetsuo Handa
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-01-27 11:40 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Arkadiusz Miśkiewicz, Andrew Morton, Tejun Heo, cgroups,
	Aleksa Sarai, Jay Kamat, Roman Gushchin, Johannes Weiner,
	linux-kernel, Linus Torvalds, linux-mm

On Sun 27-01-19 19:56:06, Tetsuo Handa wrote:
> On 2019/01/27 17:37, Michal Hocko wrote:
> > Thanks for the analysis and the patch. This should work, I believe but
> > I am not really thrilled to overload the meaning of the MMF_UNSTABLE.
> > The flag is meant to signal accessing address space is not stable and it
> > is not aimed to synchronize oom reaper with the oom path.
> > 
> > Can we make use mark_oom_victim directly? I didn't get to think that
> > through right now so I might be missing something but this should
> > prevent repeating queueing as well.
> 
> Yes, TIF_MEMDIE would work. But you are planning to remove TIF_MEMDIE. Also,
> TIF_MEMDIE can't avoid enqueuing many threads sharing mm_struct to the OOM
> reaper. There is no need to enqueue many threads sharing mm_struct because
> the OOM reaper acts on mm_struct rather than task_struct. Thus, enqueuing
> based on per mm_struct flag sounds better, but MMF_OOM_VICTIM cannot be
> set from wake_oom_reaper(victim) because victim's mm might be already inside
> exit_mmap() when wake_oom_reaper(victim) is called after task_unlock(victim).
>
> We could reintroduce MMF_OOM_KILLED in commit 855b018325737f76
> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task")
> if you don't like overloading the meaning of the MMF_UNSTABLE. But since
> MMF_UNSTABLE is available in Linux 4.9+ kernels (which covers all LTS stable
> versions with the OOM reaper support), we can temporarily use MMF_UNSTABLE
> for ease of backporting.

I agree that a per-mm state is more optimal but I would rather fix the
issue in a clear way first and only then think about an optimization on
top. Queueing based on mark_oom_victim (whatever that uses to guarantee
the victim is marked atomically and only once) makes sense from the
conceptual point of view and it makes a lot of sense to start from
there. MMF_UNSTABLE has a completely different purpose. So unless you
see a correctness issue with that then I would rather go that way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3] oom, oom_reaper: do not enqueue same task twice
  2019-01-27 11:40                               ` Michal Hocko
@ 2019-01-27 14:57                                 ` Tetsuo Handa
  2019-01-27 16:58                                   ` Michal Hocko
                                                     ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Tetsuo Handa @ 2019-01-27 14:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Arkadiusz Miśkiewicz, Andrew Morton, Tejun Heo, cgroups,
	Aleksa Sarai, Jay Kamat, Roman Gushchin, Johannes Weiner,
	linux-kernel, Linus Torvalds, linux-mm

On 2019/01/27 20:40, Michal Hocko wrote:
> On Sun 27-01-19 19:56:06, Tetsuo Handa wrote:
>> On 2019/01/27 17:37, Michal Hocko wrote:
>>> Thanks for the analysis and the patch. This should work, I believe but
>>> I am not really thrilled to overload the meaning of the MMF_UNSTABLE.
>>> The flag is meant to signal accessing address space is not stable and it
>>> is not aimed to synchronize oom reaper with the oom path.
>>>
>>> Can we make use mark_oom_victim directly? I didn't get to think that
>>> through right now so I might be missing something but this should
>>> prevent repeating queueing as well.
>>
>> Yes, TIF_MEMDIE would work. But you are planning to remove TIF_MEMDIE. Also,
>> TIF_MEMDIE can't avoid enqueuing many threads sharing mm_struct to the OOM
>> reaper. There is no need to enqueue many threads sharing mm_struct because
>> the OOM reaper acts on mm_struct rather than task_struct. Thus, enqueuing
>> based on per mm_struct flag sounds better, but MMF_OOM_VICTIM cannot be
>> set from wake_oom_reaper(victim) because victim's mm might be already inside
>> exit_mmap() when wake_oom_reaper(victim) is called after task_unlock(victim).
>>
>> We could reintroduce MMF_OOM_KILLED in commit 855b018325737f76
>> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task")
>> if you don't like overloading the meaning of the MMF_UNSTABLE. But since
>> MMF_UNSTABLE is available in Linux 4.9+ kernels (which covers all LTS stable
>> versions with the OOM reaper support), we can temporarily use MMF_UNSTABLE
>> for ease of backporting.
> 
> I agree that a per-mm state is more optimal but I would rather fix the
> issue in a clear way first and only then think about an optimization on
> top. Queueing based on mark_oom_victim (whatever that uses to guarantee
> the victim is marked atomically and only once) makes sense from the
> conceptual point of view and it makes a lot of sense to start from
> there. MMF_UNSTABLE has a completely different purpose. So unless you
> see a correctness issue with that then I would rather go that way.
> 

Then, adding a per mm_struct flag is better. I don't see the difference
between reusing MMF_UNSTABLE as a flag for whether wake_oom_reaper() for
that victim's memory was already called (what you think as an overload)
and reusing TIF_MEMDIE as a flag for whether wake_oom_reaper() for that
victim thread can be called (what I think as an overload). We want to
remove TIF_MEMDIE, and we can actually remove TIF_MEMDIE if you stop
whack-a-mole "can you observe it in real workload/program?" game.
I don't see a correctness issue with TIF_MEMDIE but I don't want to go
TIF_MEMDIE way.



From 9c9e935fc038342c48461aabca666f1b544e32b1 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 27 Jan 2019 23:51:37 +0900
Subject: [PATCH v3] oom, oom_reaper: do not enqueue same task twice

Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite
the number of tasks in that memcg is not 0. It turned out that there
is a bug in wake_oom_reaper() which allows enqueuing same task twice
which makes impossible to decrease the number of tasks in that memcg
due to a refcount leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b018325737f76
("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
As a side effect of this patch, this patch also avoids enqueuing
multiple threads sharing memory via task_will_free_mem(current) path.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
---
 include/linux/sched/coredump.h | 1 +
 mm/oom_kill.c                  | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index ec912d0..ecdc654 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -71,6 +71,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
 #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
 #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
+#define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f0e8cd9..059e617 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -647,8 +647,8 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	/* mm is already queued? */
+	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
 		return;
 
 	get_task_struct(tsk);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] oom, oom_reaper: do not enqueue same task twice
  2019-01-27 14:57                                 ` [PATCH v3] " Tetsuo Handa
@ 2019-01-27 16:58                                   ` Michal Hocko
  2019-01-27 23:00                                   ` Roman Gushchin
                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2019-01-27 16:58 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Arkadiusz Miśkiewicz, Andrew Morton, Tejun Heo, cgroups,
	Aleksa Sarai, Jay Kamat, Roman Gushchin, Johannes Weiner,
	linux-kernel, Linus Torvalds, linux-mm

On Sun 27-01-19 23:57:38, Tetsuo Handa wrote:
[...]
> >From 9c9e935fc038342c48461aabca666f1b544e32b1 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 27 Jan 2019 23:51:37 +0900
> Subject: [PATCH v3] oom, oom_reaper: do not enqueue same task twice
> 
> Arkadiusz reported that enabling memcg's group oom killing causes
> strange memcg statistics where there is no task in a memcg despite
> the number of tasks in that memcg is not 0. It turned out that there
> is a bug in wake_oom_reaper() which allows enqueuing same task twice
> which makes impossible to decrease the number of tasks in that memcg
> due to a refcount leak.
> 
> This bug existed since the OOM reaper became invokable from
> task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
> but memcg's group oom killing made it easier to trigger this bug by
> calling wake_oom_reaper() on the same task from one out_of_memory()
> request.
> 
> Fix this bug using an approach used by commit 855b018325737f76
> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
> As a side effect of this patch, this patch also avoids enqueuing
> multiple threads sharing memory via task_will_free_mem(current) path.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
> Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
> Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/sched/coredump.h | 1 +
>  mm/oom_kill.c                  | 4 ++--
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index ec912d0..ecdc654 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -71,6 +71,7 @@ static inline int get_dumpable(struct mm_struct *mm)
>  #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
>  #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
>  #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
> +#define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
>  #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
>  
>  #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f0e8cd9..059e617 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -647,8 +647,8 @@ static int oom_reaper(void *unused)
>  
>  static void wake_oom_reaper(struct task_struct *tsk)
>  {
> -	/* tsk is already queued? */
> -	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
> +	/* mm is already queued? */
> +	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>  		return;
>  
>  	get_task_struct(tsk);
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] oom, oom_reaper: do not enqueue same task twice
  2019-01-27 14:57                                 ` [PATCH v3] " Tetsuo Handa
  2019-01-27 16:58                                   ` Michal Hocko
@ 2019-01-27 23:00                                   ` Roman Gushchin
  2019-01-28 18:15                                   ` Andrew Morton
  2019-01-28 21:53                                   ` Johannes Weiner
  3 siblings, 0 replies; 22+ messages in thread
From: Roman Gushchin @ 2019-01-27 23:00 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Arkadiusz Miśkiewicz, Andrew Morton,
	Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Johannes Weiner,
	linux-kernel, Linus Torvalds, linux-mm

On Sun, Jan 27, 2019 at 11:57:38PM +0900, Tetsuo Handa wrote:
> On 2019/01/27 20:40, Michal Hocko wrote:
> > On Sun 27-01-19 19:56:06, Tetsuo Handa wrote:
> >> On 2019/01/27 17:37, Michal Hocko wrote:
> >>> Thanks for the analysis and the patch. This should work, I believe but
> >>> I am not really thrilled to overload the meaning of the MMF_UNSTABLE.
> >>> The flag is meant to signal accessing address space is not stable and it
> >>> is not aimed to synchronize oom reaper with the oom path.
> >>>
> >>> Can we make use mark_oom_victim directly? I didn't get to think that
> >>> through right now so I might be missing something but this should
> >>> prevent repeating queueing as well.
> >>
> >> Yes, TIF_MEMDIE would work. But you are planning to remove TIF_MEMDIE. Also,
> >> TIF_MEMDIE can't avoid enqueuing many threads sharing mm_struct to the OOM
> >> reaper. There is no need to enqueue many threads sharing mm_struct because
> >> the OOM reaper acts on mm_struct rather than task_struct. Thus, enqueuing
> >> based on per mm_struct flag sounds better, but MMF_OOM_VICTIM cannot be
> >> set from wake_oom_reaper(victim) because victim's mm might be already inside
> >> exit_mmap() when wake_oom_reaper(victim) is called after task_unlock(victim).
> >>
> >> We could reintroduce MMF_OOM_KILLED in commit 855b018325737f76
> >> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task")
> >> if you don't like overloading the meaning of the MMF_UNSTABLE. But since
> >> MMF_UNSTABLE is available in Linux 4.9+ kernels (which covers all LTS stable
> >> versions with the OOM reaper support), we can temporarily use MMF_UNSTABLE
> >> for ease of backporting.
> > 
> > I agree that a per-mm state is more optimal but I would rather fix the
> > issue in a clear way first and only then think about an optimization on
> > top. Queueing based on mark_oom_victim (whatever that uses to guarantee
> > the victim is marked atomically and only once) makes sense from the
> > conceptual point of view and it makes a lot of sense to start from
> > there. MMF_UNSTABLE has a completely different purpose. So unless you
> > see a correctness issue with that then I would rather go that way.
> > 
> 
> Then, adding a per mm_struct flag is better. I don't see the difference
> between reusing MMF_UNSTABLE as a flag for whether wake_oom_reaper() for
> that victim's memory was already called (what you think as an overload)
> and reusing TIF_MEMDIE as a flag for whether wake_oom_reaper() for that
> victim thread can be called (what I think as an overload). We want to
> remove TIF_MEMDIE, and we can actually remove TIF_MEMDIE if you stop
> whack-a-mole "can you observe it in real workload/program?" game.
> I don't see a correctness issue with TIF_MEMDIE but I don't want to go
> TIF_MEMDIE way.
> 
> 
> 
> From 9c9e935fc038342c48461aabca666f1b544e32b1 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 27 Jan 2019 23:51:37 +0900
> Subject: [PATCH v3] oom, oom_reaper: do not enqueue same task twice
> 
> Arkadiusz reported that enabling memcg's group oom killing causes
> strange memcg statistics where there is no task in a memcg despite
> the number of tasks in that memcg is not 0. It turned out that there
> is a bug in wake_oom_reaper() which allows enqueuing same task twice
> which makes impossible to decrease the number of tasks in that memcg
> due to a refcount leak.
> 
> This bug existed since the OOM reaper became invokable from
> task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
> but memcg's group oom killing made it easier to trigger this bug by
> calling wake_oom_reaper() on the same task from one out_of_memory()
> request.
> 
> Fix this bug using an approach used by commit 855b018325737f76
> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
> As a side effect of this patch, this patch also avoids enqueuing
> multiple threads sharing memory via task_will_free_mem(current) path.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
> Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
> Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")

Thank you, Tetsuo!

Acked-by: Roman Gushchin <guro@fb.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] oom, oom_reaper: do not enqueue same task twice
  2019-01-27 14:57                                 ` [PATCH v3] " Tetsuo Handa
  2019-01-27 16:58                                   ` Michal Hocko
  2019-01-27 23:00                                   ` Roman Gushchin
@ 2019-01-28 18:15                                   ` Andrew Morton
  2019-01-28 18:42                                     ` Michal Hocko
  2019-01-28 21:53                                   ` Johannes Weiner
  3 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2019-01-28 18:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Arkadiusz Miśkiewicz, Tejun Heo, cgroups,
	Aleksa Sarai, Jay Kamat, Roman Gushchin, Johannes Weiner,
	linux-kernel, Linus Torvalds, linux-mm

On Sun, 27 Jan 2019 23:57:38 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:

> Arkadiusz reported that enabling memcg's group oom killing causes
> strange memcg statistics where there is no task in a memcg despite
> the number of tasks in that memcg is not 0. It turned out that there
> is a bug in wake_oom_reaper() which allows enqueuing same task twice
> which makes impossible to decrease the number of tasks in that memcg
> due to a refcount leak.
> 
> This bug existed since the OOM reaper became invokable from
> task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
> but memcg's group oom killing made it easier to trigger this bug by
> calling wake_oom_reaper() on the same task from one out_of_memory()
> request.
> 
> Fix this bug using an approach used by commit 855b018325737f76
> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
> As a side effect of this patch, this patch also avoids enqueuing
> multiple threads sharing memory via task_will_free_mem(current) path.
> 

Do we think this is serious enough to warrant a -stable backport?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] oom, oom_reaper: do not enqueue same task twice
  2019-01-28 18:15                                   ` Andrew Morton
@ 2019-01-28 18:42                                     ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2019-01-28 18:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tetsuo Handa, Arkadiusz Miśkiewicz, Tejun Heo, cgroups,
	Aleksa Sarai, Jay Kamat, Roman Gushchin, Johannes Weiner,
	linux-kernel, Linus Torvalds, linux-mm

On Mon 28-01-19 10:15:13, Andrew Morton wrote:
> On Sun, 27 Jan 2019 23:57:38 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:
> 
> > Arkadiusz reported that enabling memcg's group oom killing causes
> > strange memcg statistics where there is no task in a memcg despite
> > the number of tasks in that memcg is not 0. It turned out that there
> > is a bug in wake_oom_reaper() which allows enqueuing same task twice
> > which makes impossible to decrease the number of tasks in that memcg
> > due to a refcount leak.
> > 
> > This bug existed since the OOM reaper became invokable from
> > task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
> > but memcg's group oom killing made it easier to trigger this bug by
> > calling wake_oom_reaper() on the same task from one out_of_memory()
> > request.
> > 
> > Fix this bug using an approach used by commit 855b018325737f76
> > ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
> > As a side effect of this patch, this patch also avoids enqueuing
> > multiple threads sharing memory via task_will_free_mem(current) path.
> > 
> 
> Do we think this is serious enough to warrant a -stable backport?

Yes, I would go with stable backport.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] oom, oom_reaper: do not enqueue same task twice
  2019-01-27 14:57                                 ` [PATCH v3] " Tetsuo Handa
                                                     ` (2 preceding siblings ...)
  2019-01-28 18:15                                   ` Andrew Morton
@ 2019-01-28 21:53                                   ` Johannes Weiner
  2019-01-29 10:34                                     ` Tetsuo Handa
  3 siblings, 1 reply; 22+ messages in thread
From: Johannes Weiner @ 2019-01-28 21:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Arkadiusz Miśkiewicz, Andrew Morton,
	Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin,
	linux-kernel, Linus Torvalds, linux-mm

Hi Tetsuo,

On Sun, Jan 27, 2019 at 11:57:38PM +0900, Tetsuo Handa wrote:
> From 9c9e935fc038342c48461aabca666f1b544e32b1 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sun, 27 Jan 2019 23:51:37 +0900
> Subject: [PATCH v3] oom, oom_reaper: do not enqueue same task twice
> 
> Arkadiusz reported that enabling memcg's group oom killing causes
> strange memcg statistics where there is no task in a memcg despite
> the number of tasks in that memcg is not 0. It turned out that there
> is a bug in wake_oom_reaper() which allows enqueuing same task twice
> which makes impossible to decrease the number of tasks in that memcg
> due to a refcount leak.
> 
> This bug existed since the OOM reaper became invokable from
> task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
> but memcg's group oom killing made it easier to trigger this bug by
> calling wake_oom_reaper() on the same task from one out_of_memory()
> request.

This changelog seems a little terse compared to how tricky this is.

Can you please include an explanation here *how* this bug is possible?
I.e. the race condition that causes the function te be entered twice
and the existing re-entrance check in there to fail.

> Fix this bug using an approach used by commit 855b018325737f76
> ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task").
> As a side effect of this patch, this patch also avoids enqueuing
> multiple threads sharing memory via task_will_free_mem(current) path.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3] oom, oom_reaper: do not enqueue same task twice
  2019-01-28 21:53                                   ` Johannes Weiner
@ 2019-01-29 10:34                                     ` Tetsuo Handa
  0 siblings, 0 replies; 22+ messages in thread
From: Tetsuo Handa @ 2019-01-29 10:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Arkadiusz Miśkiewicz, Andrew Morton,
	Tejun Heo, cgroups, Aleksa Sarai, Jay Kamat, Roman Gushchin,
	linux-kernel, Linus Torvalds, linux-mm

Johannes Weiner wrote:
> On Sun, Jan 27, 2019 at 11:57:38PM +0900, Tetsuo Handa wrote:
> > This bug existed since the OOM reaper became invokable from
> > task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
> > but memcg's group oom killing made it easier to trigger this bug by
> > calling wake_oom_reaper() on the same task from one out_of_memory()
> > request.
> 
> This changelog seems a little terse compared to how tricky this is.
> 
> Can you please include an explanation here *how* this bug is possible?
> I.e. the race condition that causes the function te be entered twice
> and the existing re-entrance check in there to fail.

OK. Here is an updated patch. Only changelog part has changed.
I hope this will provide enough information to stable kernel maintainers.
----------
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Subject: oom, oom_reaper: do not enqueue same task twice

Arkadiusz reported that enabling memcg's group oom killing causes strange
memcg statistics where there is no task in a memcg despite the number of
tasks in that memcg is not 0.  It turned out that there is a bug in
wake_oom_reaper() which allows enqueuing same task twice which makes
impossible to decrease the number of tasks in that memcg due to a refcount
leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

  T1@P1     |T2@P1     |T3@P1     |OOM reaper
  ----------+----------+----------+------------
                                   # Processing an OOM victim in a different memcg domain.
                        try_charge()
                          mem_cgroup_out_of_memory()
                            mutex_lock(&oom_lock)
             try_charge()
               mem_cgroup_out_of_memory()
                 mutex_lock(&oom_lock)
  try_charge()
    mem_cgroup_out_of_memory()
      mutex_lock(&oom_lock)
                            out_of_memory()
                              oom_kill_process(P1)
                                do_send_sig_info(SIGKILL, @P1)
                                mark_oom_victim(T1@P1)
                                wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                            mutex_unlock(&oom_lock)
                 out_of_memory()
                   mark_oom_victim(T2@P1)
                   wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                 mutex_unlock(&oom_lock)
      out_of_memory()
        mark_oom_victim(T1@P1)
        wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
      mutex_unlock(&oom_lock)
                                   # Completed processing an OOM victim in a different memcg domain.
                                   spin_lock(&oom_reaper_lock)
                                   # T1P1 is dequeued.
                                   spin_unlock(&oom_reaper_lock)

but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b018325737f76 ("oom,
oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a side
effect of this patch, this patch also avoids enqueuing multiple threads
sharing memory via task_will_free_mem(current) path.

Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: Jay Kamat <jgkamat@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/sched/coredump.h | 1 +
 mm/oom_kill.c                  | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index ec912d0..ecdc654 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -71,6 +71,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HUGE_ZERO_PAGE	23      /* mm has ever used the global huge zero page */
 #define MMF_DISABLE_THP		24	/* disable THP for all VMAs */
 #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
+#define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f0e8cd9..059e617 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -647,8 +647,8 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	/* mm is already queued? */
+	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
 		return;
 
 	get_task_struct(tsk);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-01-29 10:34 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <df806a77-3327-9db5-8be2-976fde1c84e5@gmail.com>
     [not found] ` <20190117122535.njcbqhlmzozdkncw@mikami>
     [not found]   ` <1d36b181-cbaf-6694-1a31-2f7f55d15675@gmail.com>
     [not found]     ` <96ef6615-a5df-30af-b4dc-417a18ca63f1@gmail.com>
2019-01-25  7:52       ` pids.current with invalid value for hours [5.0.0 rc3 git] Arkadiusz Miśkiewicz
2019-01-25 16:37         ` Tejun Heo
2019-01-25 19:47           ` Arkadiusz Miśkiewicz
2019-01-26  1:27             ` Tetsuo Handa
2019-01-26  2:41               ` Arkadiusz Miśkiewicz
2019-01-26  6:10                 ` Tetsuo Handa
2019-01-26  7:55                   ` Tetsuo Handa
2019-01-26 11:09                     ` Tetsuo Handa
2019-01-26 11:29                       ` Arkadiusz Miśkiewicz
2019-01-26 13:10                         ` [PATCH v2] oom, oom_reaper: do not enqueue same task twice Tetsuo Handa
2019-01-27  8:37                           ` Michal Hocko
2019-01-27 10:56                             ` Tetsuo Handa
2019-01-27 11:40                               ` Michal Hocko
2019-01-27 14:57                                 ` [PATCH v3] " Tetsuo Handa
2019-01-27 16:58                                   ` Michal Hocko
2019-01-27 23:00                                   ` Roman Gushchin
2019-01-28 18:15                                   ` Andrew Morton
2019-01-28 18:42                                     ` Michal Hocko
2019-01-28 21:53                                   ` Johannes Weiner
2019-01-29 10:34                                     ` Tetsuo Handa
2019-01-26  1:41             ` pids.current with invalid value for hours [5.0.0 rc3 git] Roman Gushchin
2019-01-26  2:28               ` Arkadiusz Miśkiewicz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).