All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
       [not found] <CGME20181005063208epcms1p22959cd2f771ad017996e2b18266791ea@epcms1p2>
@ 2018-10-05  6:32 ` Yong-Taek Lee
  2018-10-09  6:23   ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Yong-Taek Lee @ 2018-10-05  6:32 UTC (permalink / raw)
  To: mhocko, linux-mm, linux-kernel; +Cc: Yong-Taek Lee

[-- Attachment #1: Type: text/html, Size: 5193 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-05  6:32 ` [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm Yong-Taek Lee
@ 2018-10-09  6:23   ` Michal Hocko
  2018-10-09  8:03     ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-10-09  6:23 UTC (permalink / raw)
  To: Yong-Taek Lee; +Cc: linux-mm, linux-kernel, Oleg Nesterov

[Cc Oleg]

On Fri 05-10-18 15:32:08, Yong-Taek Lee wrote:
> It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
> processes sharing mm have same view of oom_score_adj"). Most of
> user process's mm_users is bigger than 1 but only one thread group.
> In this case, for_each_process loop meaninglessly try to find processes
> which sharing same mm even though there is only one thread group.
> 
> My idea is that target task's nr thread is smaller than mm_users if there
> are more thread groups sharing the same mm. So we can skip loop

I remember trying to optimize this but ended up with nothing that would
work reliable. E.g. what prevents a thread terminating right after we
read mm reference count and result in early break and other process
not being updated properly?

> if mm_user and nr_thread are same.
> 
> test result
> while true; do count=0; time while [ $count -lt 10000 ]; do echo -1000 > /proc/
> 1457/oom_score_adj; count=$((count+1)); done; done;

Is this overhead noticeable in a real work usecases though? Or are you
updating oom_score_adj that often really?

> before patch
> 0m00.59s real     0m00.09s user     0m00.51s system
> 0m00.59s real     0m00.14s user     0m00.45s system
> 0m00.58s real     0m00.11s user     0m00.47s system
> 0m00.58s real     0m00.10s user     0m00.48s system
> 0m00.59s real     0m00.11s user     0m00.48s system
> 
> after patch
> 0m00.15s real     0m00.07s user     0m00.08s system
> 0m00.14s real     0m00.10s user     0m00.04s system
> 0m00.14s real     0m00.10s user     0m00.05s system
> 0m00.14s real     0m00.08s user     0m00.07s system
> 0m00.14s real     0m00.08s user     0m00.07s system
> 
> Signed-off-by: Lee YongTaek <ytk.lee@samsung.com>
> ---
>  fs/proc/base.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f9f72aee6d45..54b2fb5e9c51 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj,
> bool legacy)
>         struct mm_struct *mm = NULL;
>         struct task_struct *task;
>         int err = 0;
> +       int mm_users = 0;
> 
>         task = get_proc_task(file_inode(file));
>         if (!task)
> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj,
> bool legacy)
>                 struct task_struct *p = find_lock_task_mm(task);
> 
>                 if (p) {
> -                       if (atomic_read(&p->mm->mm_users) > 1) {
> +                       mm_users = atomic_read(&p->mm->mm_users);
> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p)))
> {
>                                 mm = p->mm;
>                                 atomic_inc(&mm->mm_count);
>                         }
> --
> 
> *

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09  6:23   ` Michal Hocko
@ 2018-10-09  8:03     ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2018-10-09  8:03 UTC (permalink / raw)
  To: Yong-Taek Lee; +Cc: linux-mm, linux-kernel, Oleg Nesterov

On Tue 09-10-18 08:23:30, Michal Hocko wrote:
> [Cc Oleg]

JFYI there was new submission http://lkml.kernel.org/r/20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8
> 
> On Fri 05-10-18 15:32:08, Yong-Taek Lee wrote:
> > It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
> > processes sharing mm have same view of oom_score_adj"). Most of
> > user process's mm_users is bigger than 1 but only one thread group.
> > In this case, for_each_process loop meaninglessly try to find processes
> > which sharing same mm even though there is only one thread group.
> > 
> > My idea is that target task's nr thread is smaller than mm_users if there
> > are more thread groups sharing the same mm. So we can skip loop
> 
> I remember trying to optimize this but ended up with nothing that would
> work reliable. E.g. what prevents a thread terminating right after we
> read mm reference count and result in early break and other process
> not being updated properly?
> 
> > if mm_user and nr_thread are same.
> > 
> > test result
> > while true; do count=0; time while [ $count -lt 10000 ]; do echo -1000 > /proc/
> > 1457/oom_score_adj; count=$((count+1)); done; done;
> 
> Is this overhead noticeable in a real work usecases though? Or are you
> updating oom_score_adj that often really?
> 
> > before patch
> > 0m00.59s real     0m00.09s user     0m00.51s system
> > 0m00.59s real     0m00.14s user     0m00.45s system
> > 0m00.58s real     0m00.11s user     0m00.47s system
> > 0m00.58s real     0m00.10s user     0m00.48s system
> > 0m00.59s real     0m00.11s user     0m00.48s system
> > 
> > after patch
> > 0m00.15s real     0m00.07s user     0m00.08s system
> > 0m00.14s real     0m00.10s user     0m00.04s system
> > 0m00.14s real     0m00.10s user     0m00.05s system
> > 0m00.14s real     0m00.08s user     0m00.07s system
> > 0m00.14s real     0m00.08s user     0m00.07s system
> > 
> > Signed-off-by: Lee YongTaek <ytk.lee@samsung.com>
> > ---
> >  fs/proc/base.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index f9f72aee6d45..54b2fb5e9c51 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj,
> > bool legacy)
> >         struct mm_struct *mm = NULL;
> >         struct task_struct *task;
> >         int err = 0;
> > +       int mm_users = 0;
> > 
> >         task = get_proc_task(file_inode(file));
> >         if (!task)
> > @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj,
> > bool legacy)
> >                 struct task_struct *p = find_lock_task_mm(task);
> > 
> >                 if (p) {
> > -                       if (atomic_read(&p->mm->mm_users) > 1) {
> > +                       mm_users = atomic_read(&p->mm->mm_users);
> > +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p)))
> > {
> >                                 mm = p->mm;
> >                                 atomic_inc(&mm->mm_count);
> >                         }
> > --
> > 
> > *
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09 13:51                             ` Tetsuo Handa
@ 2018-10-09 14:09                               ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2018-10-09 14:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On Tue 09-10-18 22:51:00, Tetsuo Handa wrote:
> On 2018/10/09 22:26, Michal Hocko wrote:
> > On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
> >> On 2018/10/09 21:58, Michal Hocko wrote:
> >>> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> >>>> On 2018/10/09 20:10, Michal Hocko wrote:
> >>>>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>>>>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>>>>>> reap the mm in the rare case of the race.
> >>>>>>
> >>>>>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
> >>>>>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> >>>>>
> >>>>> I do not follow.
> >>>>>
> >>>>
> >>>> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> >>>
> >>> Ahh, so you are not referring to the current upstream code. Do you see
> >>> any specific problem with the current one (well, except for the possible
> >>> race which I have tried to evaluate).
> >>>
> >>
> >> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to MMF_OOM_SKIP
> >> being already set" is a problem for clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
> >> with the current code.
> > 
> > a) I fail to see how that is related to your previous post and b) could
> > you be more specific. Is there any other scenario from the two described
> > in my earlier email?
> > 
> 
> I do not follow. Just reverting commit 44a70adec910d692 and commit 97fd49c2355ffded
> is sufficient for closing the copy_process() versus __set_oom_adj() race.

Please go back and see why this has been done in the first place.

> We went too far towards complete "struct mm_struct" based OOM handling. But stepping
> back to "struct signal_struct" based OOM handling solves Yong-Taek's for_each_process()
> latency problem and your copy_process() versus __set_oom_adj() race problem and my
> task_will_free_mem(current) race problem.

And again, I have put an evaluation of the race and try to see what is
the effect. Then you have started to fire hard to follow notes and it is
not clear whether the analysis/conclusions is wrong/incomplete.

So an we get back to that analysis and stick to the topic please?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09 13:26                           ` Michal Hocko
@ 2018-10-09 13:51                             ` Tetsuo Handa
  2018-10-09 14:09                               ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2018-10-09 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On 2018/10/09 22:26, Michal Hocko wrote:
> On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
>> On 2018/10/09 21:58, Michal Hocko wrote:
>>> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
>>>> On 2018/10/09 20:10, Michal Hocko wrote:
>>>>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>>>>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>>>>>> reap the mm in the rare case of the race.
>>>>>>
>>>>>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
>>>>>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
>>>>>
>>>>> I do not follow.
>>>>>
>>>>
>>>> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
>>>
>>> Ahh, so you are not referring to the current upstream code. Do you see
>>> any specific problem with the current one (well, except for the possible
>>> race which I have tried to evaluate).
>>>
>>
>> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to MMF_OOM_SKIP
>> being already set" is a problem for clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
>> with the current code.
> 
> a) I fail to see how that is related to your previous post and b) could
> you be more specific. Is there any other scenario from the two described
> in my earlier email?
> 

I do not follow. Just reverting commit 44a70adec910d692 and commit 97fd49c2355ffded
is sufficient for closing the copy_process() versus __set_oom_adj() race.

We went too far towards complete "struct mm_struct" based OOM handling. But stepping
back to "struct signal_struct" based OOM handling solves Yong-Taek's for_each_process()
latency problem and your copy_process() versus __set_oom_adj() race problem and my
task_will_free_mem(current) race problem.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09 13:14                         ` Tetsuo Handa
@ 2018-10-09 13:26                           ` Michal Hocko
  2018-10-09 13:51                             ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-10-09 13:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
> On 2018/10/09 21:58, Michal Hocko wrote:
> > On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> >> On 2018/10/09 20:10, Michal Hocko wrote:
> >>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>>>> reap the mm in the rare case of the race.
> >>>>
> >>>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
> >>>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> >>>
> >>> I do not follow.
> >>>
> >>
> >> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> > 
> > Ahh, so you are not referring to the current upstream code. Do you see
> > any specific problem with the current one (well, except for the possible
> > race which I have tried to evaluate).
> > 
> 
> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to MMF_OOM_SKIP
> being already set" is a problem for clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
> with the current code.

a) I fail to see how that is related to your previous post and b) could
you be more specific. Is there any other scenario from the two described
in my earlier email?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09 12:58                       ` Michal Hocko
@ 2018-10-09 13:14                         ` Tetsuo Handa
  2018-10-09 13:26                           ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2018-10-09 13:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On 2018/10/09 21:58, Michal Hocko wrote:
> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
>> On 2018/10/09 20:10, Michal Hocko wrote:
>>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>>>> reap the mm in the rare case of the race.
>>>>
>>>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
>>>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
>>>
>>> I do not follow.
>>>
>>
>> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> 
> Ahh, so you are not referring to the current upstream code. Do you see
> any specific problem with the current one (well, except for the possible
> race which I have tried to evaluate).
> 

Yes. "task_will_free_mem(current) in out_of_memory() returns false due to MMF_OOM_SKIP
being already set" is a problem for clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
with the current code.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09 12:52                     ` Tetsuo Handa
@ 2018-10-09 12:58                       ` Michal Hocko
  2018-10-09 13:14                         ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-10-09 12:58 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> On 2018/10/09 20:10, Michal Hocko wrote:
> > On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>> reap the mm in the rare case of the race.
> >>
> >> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
> >> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> > 
> > I do not follow.
> > 
> 
> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493

Ahh, so you are not referring to the current upstream code. Do you see
any specific problem with the current one (well, except for the possible
race which I have tried to evaluate).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09 11:10                   ` Michal Hocko
@ 2018-10-09 12:52                     ` Tetsuo Handa
  2018-10-09 12:58                       ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2018-10-09 12:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On 2018/10/09 20:10, Michal Hocko wrote:
> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>> reap the mm in the rare case of the race.
>>
>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> 
> I do not follow.
> 

http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493

[  177.722853] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  177.724956] a.out cpuset=/ mems_allowed=0
[  177.725735] CPU: 3 PID: 3962 Comm: a.out Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  177.802889] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  177.872248] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  177.874279] [ 3962]  1000  3962   541717   395780     784       6        0             0 a.out
[  177.876274] [ 3963]  1000  3963     1078       21       7       3        0          1000 a.out
[  177.878261] [ 3964]  1000  3964     1078       21       7       3        0          1000 a.out
[  177.880194] [ 3965]  1000  3965     1078       21       7       3        0          1000 a.out
[  177.882262] Out of memory: Kill process 3963 (a.out) score 998 or sacrifice child
[  177.884129] Killed process 3963 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  177.887100] oom_reaper: reaped process :3963 (a.out) anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
[  179.638399] crond invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[  179.647708] crond cpuset=/ mems_allowed=0
[  179.652996] CPU: 3 PID: 742 Comm: crond Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  179.771311] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  179.836221] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  179.838278] [ 3962]  1000  3962   541717   396308     785       6        0             0 a.out
[  179.840328] [ 3963]  1000  3963     1078        0       7       3        0         -1000 a.out
[  179.842443] [ 3965]  1000  3965     1078        0       7       3        0          1000 a.out
[  179.844557] Out of memory: Kill process 3965 (a.out) score 998 or sacrifice child
[  179.846404] Killed process 3965 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09 10:00                 ` Tetsuo Handa
@ 2018-10-09 11:10                   ` Michal Hocko
  2018-10-09 12:52                     ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-10-09 11:10 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> On 2018/10/09 16:50, Michal Hocko wrote:
[...]
> >                               Well, that is unfortunate indeed and it
> > breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
> > 1) do not care and encourage users to use a saner way to set
> > OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
> > setting it before [v]fork & exec. Btw. do we know about an actual user
> > who would care?
> 
> I'm not talking about [v]fork & exec. Why are you talking about [v]fork & exec ?

Because that is the only raceless way to set your oom_score_adj.

> > 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> > reap the mm in the rare case of the race.
> 
> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
> to -1000 (and allowed unprivileged users to OOM-lockup the system).

I do not follow.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09  7:50               ` Michal Hocko
@ 2018-10-09 10:00                 ` Tetsuo Handa
  2018-10-09 11:10                   ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2018-10-09 10:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On 2018/10/09 16:50, Michal Hocko wrote:
> On Tue 09-10-18 08:35:41, Michal Hocko wrote:
>> [I have only now noticed that the patch has been reposted]
>>
>> On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
>>> On 2018/10/08 17:38, Yong-Taek Lee wrote:
> [...]
>>>> Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these
>>>> issues separately because there are no dependency between race issue and my patch. As i already explained,
>>>> for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but 
>>>> no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? 
>>>
>>> Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure processes
>>> sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, oom:
>>> kill all tasks sharing the mm").
>>
>> This would require a lot of other work for something as border line as
>> weird threading model like this. I will think about something more
>> appropriate - e.g. we can take mmap_sem for read while doing this check
>> and that should prevent from races with [v]fork.
> 
> Not really. We do not even take the mmap_sem when CLONE_VM. So this is
> not the way. Doing a proper synchronization seems much harder. So let's
> consider what is the worst case scenario. We would basically hit a race
> window between copy_signal and copy_mm and the only relevant case would
> be OOM_SCORE_ADJ_MIN which wouldn't propagate to the new "thread".

The "between copy_signal() and copy_mm()" race window is merely whether we need
to run for_each_process() loop. The race window is much larger than that; it is
between "copy_signal() copies oom_score_adj/oom_score_adj_min" and "the created
thread becomes accessible from for_each_process() loop".

>                                                                    OOM
> killer could then pick up the "thread" and kill it along with the whole
> process group sharing the mm.

Just reverting commit 44a70adec910d692 and commit 97fd49c2355ffded is
sufficient.

>                               Well, that is unfortunate indeed and it
> breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
> 1) do not care and encourage users to use a saner way to set
> OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
> setting it before [v]fork & exec. Btw. do we know about an actual user
> who would care?

I'm not talking about [v]fork & exec. Why are you talking about [v]fork & exec ?

> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> reap the mm in the rare case of the race.

That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
to -1000 (and allowed unprivileged users to OOM-lockup the system). Now that we set
MMF_OOM_SKIP, there is no need to worry about "oom_score_adj != -1000" thread group
and "oom_score_adj == -1000" thread group sharing the same mm. Since updating
oom_score_adj to -1000 is a privileged operation, it is administrator's wish if
such case happened; the kernel should respect the administrator's wish.

> 
> I would prefer the firs but if this race really has to be addressed then
> the 2 sounds more reasonable than the wholesale revert.
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-09  6:35             ` Michal Hocko
@ 2018-10-09  7:50               ` Michal Hocko
  2018-10-09 10:00                 ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-10-09  7:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On Tue 09-10-18 08:35:41, Michal Hocko wrote:
> [I have only now noticed that the patch has been reposted]
> 
> On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
> > On 2018/10/08 17:38, Yong-Taek Lee wrote:
[...]
> > > Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these
> > > issues separately because there are no dependency between race issue and my patch. As i already explained,
> > > for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but 
> > > no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? 
> > 
> > Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure processes
> > sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, oom:
> > kill all tasks sharing the mm").
> 
> This would require a lot of other work for something as border line as
> weird threading model like this. I will think about something more
> appropriate - e.g. we can take mmap_sem for read while doing this check
> and that should prevent from races with [v]fork.

Not really. We do not even take the mmap_sem when CLONE_VM. So this is
not the way. Doing a proper synchronization seems much harder. So let's
consider what is the worst case scenario. We would basically hit a race
window between copy_signal and copy_mm and the only relevant case would
be OOM_SCORE_ADJ_MIN which wouldn't propagate to the new "thread". OOM
killer could then pick up the "thread" and kill it along with the whole
process group sharing the mm. Well, that is unfortunate indeed and it
breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
1) do not care and encourage users to use a saner way to set
OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
setting it before [v]fork & exec. Btw. do we know about an actual user
who would care?
2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
reap the mm in the rare case of the race.

I would prefer the firs but if this race really has to be addressed then
the 2 sounds more reasonable than the wholesale revert.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-08  9:27           ` Tetsuo Handa
@ 2018-10-09  6:35             ` Michal Hocko
  2018-10-09  7:50               ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2018-10-09  6:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: ytk.lee, linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

[I have only now noticed that the patch has been reposted]

On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
> On 2018/10/08 17:38, Yong-Taek Lee wrote:
> >>
> >> On 2018/10/08 15:14, Yong-Taek Lee wrote:
> >>>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
> >>>>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >>>>>         struct mm_struct *mm = NULL;
> >>>>>         struct task_struct *task;
> >>>>>         int err = 0;
> >>>>> +       int mm_users = 0;
> >>>>>
> >>>>>         task = get_proc_task(file_inode(file));
> >>>>>         if (!task)
> >>>>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >>>>>                 struct task_struct *p = find_lock_task_mm(task);
> >>>>>
> >>>>>                 if (p) {
> >>>>> -                       if (atomic_read(&p->mm->mm_users) > 1) {
> >>>>> +                       mm_users = atomic_read(&p->mm->mm_users);
> >>>>> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
> >>>>
> >>>> How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
> >>>> is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
> >>>> sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
> >>>> Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
> >>>> but one "struct mm_struct" shared by two thread groups.
> >>>>
> >>>
> >>> Are you talking about race between __set_oom_adj and copy_process?
> >>> If so, i agree with your opinion. It can not set oom_score_adj properly for copied process if __set_oom_adj
> >>> check mm_users before copy_process calls copy_mm after copy_signal. Please correct me if i misunderstood anything.
> >>
> >> You understand it correctly.
> >>
> >> Reversing copy_signal() and copy_mm() is not sufficient either. We need to use a read/write lock
> >> (read lock for copy_process() and write lock for __set_oom_adj()) in order to make sure that
> >> the thread created by clone() becomes reachable from for_each_process() path in __set_oom_adj().
> >>
> > 
> > Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these
> > issues separately because there are no dependency between race issue and my patch. As i already explained,
> > for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but 
> > no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? 
> 
> Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure processes
> sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, oom:
> kill all tasks sharing the mm").

This would require a lot of other work for something as border line as
weird threading model like this. I will think about something more
appropriate - e.g. we can take mmap_sem for read while doing this check
and that should prevent from races with [v]fork.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-08  8:38         ` Yong-Taek Lee
@ 2018-10-08  9:27           ` Tetsuo Handa
  2018-10-09  6:35             ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2018-10-08  9:27 UTC (permalink / raw)
  To: ytk.lee, mhocko, mhocko
  Cc: linux-mm, linux-kernel, Oleg Nesterov, David Rientjes,
	Vladimir Davydov, Andrew Morton, Linus Torvalds

On 2018/10/08 17:38, Yong-Taek Lee wrote:
>>
>> On 2018/10/08 15:14, Yong-Taek Lee wrote:
>>>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
>>>>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>>>         struct mm_struct *mm = NULL;
>>>>>         struct task_struct *task;
>>>>>         int err = 0;
>>>>> +       int mm_users = 0;
>>>>>
>>>>>         task = get_proc_task(file_inode(file));
>>>>>         if (!task)
>>>>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>>>                 struct task_struct *p = find_lock_task_mm(task);
>>>>>
>>>>>                 if (p) {
>>>>> -                       if (atomic_read(&p->mm->mm_users) > 1) {
>>>>> +                       mm_users = atomic_read(&p->mm->mm_users);
>>>>> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
>>>>
>>>> How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
>>>> is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
>>>> sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
>>>> Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
>>>> but one "struct mm_struct" shared by two thread groups.
>>>>
>>>
>>> Are you talking about race between __set_oom_adj and copy_process?
>>> If so, i agree with your opinion. It can not set oom_score_adj properly for copied process if __set_oom_adj
>>> check mm_users before copy_process calls copy_mm after copy_signal. Please correct me if i misunderstood anything.
>>
>> You understand it correctly.
>>
>> Reversing copy_signal() and copy_mm() is not sufficient either. We need to use a read/write lock
>> (read lock for copy_process() and write lock for __set_oom_adj()) in order to make sure that
>> the thread created by clone() becomes reachable from for_each_process() path in __set_oom_adj().
>>
> 
> Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these
> issues separately because there are no dependency between race issue and my patch. As i already explained,
> for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but 
> no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? 

Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure processes
sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, oom:
kill all tasks sharing the mm").

> 
>>>
>>>>>                                 mm = p->mm;
>>>>>                                 atomic_inc(&mm->mm_count);
>>>>>                         }
>>>
>>
> 




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-08  6:14     ` Yong-Taek Lee
@ 2018-10-08  6:22       ` Tetsuo Handa
       [not found]       ` <CGME20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p2>
  1 sibling, 0 replies; 17+ messages in thread
From: Tetsuo Handa @ 2018-10-08  6:22 UTC (permalink / raw)
  To: ytk.lee, mhocko, mhocko; +Cc: linux-mm, linux-kernel

On 2018/10/08 15:14, Yong-Taek Lee wrote:
>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
>>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>         struct mm_struct *mm = NULL;
>>>         struct task_struct *task;
>>>         int err = 0;
>>> +       int mm_users = 0;
>>>
>>>         task = get_proc_task(file_inode(file));
>>>         if (!task)
>>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>                 struct task_struct *p = find_lock_task_mm(task);
>>>
>>>                 if (p) {
>>> -                       if (atomic_read(&p->mm->mm_users) > 1) {
>>> +                       mm_users = atomic_read(&p->mm->mm_users);
>>> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
>>
>> How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
>> is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
>> sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
>> Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
>> but one "struct mm_struct" shared by two thread groups.
>>
> 
> Are you talking about race between __set_oom_adj and copy_process?
> If so, i agree with your opinion. It can not set oom_score_adj properly for copied process if __set_oom_adj
> check mm_users before copy_process calls copy_mm after copy_signal. Please correct me if i misunderstood anything.

You understand it correctly.

Reversing copy_signal() and copy_mm() is not sufficient either. We need to use a read/write lock
(read lock for copy_process() and write lock for __set_oom_adj()) in order to make sure that
the thread created by clone() becomes reachable from for_each_process() path in __set_oom_adj().

> 
>>>                                 mm = p->mm;
>>>                                 atomic_inc(&mm->mm_count);
>>>                         }
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
  2018-10-08  1:19 ` Yong-Taek Lee
@ 2018-10-08  2:52   ` Tetsuo Handa
       [not found]   ` <CGME20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p5>
  1 sibling, 0 replies; 17+ messages in thread
From: Tetsuo Handa @ 2018-10-08  2:52 UTC (permalink / raw)
  To: ytk.lee, mhocko, mhocko; +Cc: linux-mm, linux-kernel

On 2018/10/08 10:19, Yong-Taek Lee wrote:
> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>         struct mm_struct *mm = NULL;
>         struct task_struct *task;
>         int err = 0;
> +       int mm_users = 0;
> 
>         task = get_proc_task(file_inode(file));
>         if (!task)
> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>                 struct task_struct *p = find_lock_task_mm(task);
> 
>                 if (p) {
> -                       if (atomic_read(&p->mm->mm_users) > 1) {
> +                       mm_users = atomic_read(&p->mm->mm_users);
> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {

How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
but one "struct mm_struct" shared by two thread groups.

>                                 mm = p->mm;
>                                 atomic_inc(&mm->mm_count);
>                         }


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm
       [not found] <CGME20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8>
@ 2018-10-08  1:19 ` Yong-Taek Lee
  2018-10-08  2:52   ` Tetsuo Handa
       [not found]   ` <CGME20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p5>
  0 siblings, 2 replies; 17+ messages in thread
From: Yong-Taek Lee @ 2018-10-08  1:19 UTC (permalink / raw)
  To: mhocko, mhocko; +Cc: Yong-Taek Lee, linux-mm, linux-kernel

It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj"). Most of
user process's mm_users is bigger than 1 but only one thread group.
In this case, for_each_process loop meaninglessly try to find processes
which sharing same mm even though there is only one thread group.

My idea is that target task's nr thread is smaller than mm_users if there
are more thread groups sharing the same mm. So we can skip loop
if mm_user and nr_thread are same. 

test result
while true; do count=0; time while [ $count -lt 10000 ]; do echo -1000 > /proc/1457/oom_score_adj; count=$((count+1)); done; done;

before patch
0m00.59s real     0m00.09s user     0m00.51s system
0m00.59s real     0m00.14s user     0m00.45s system
0m00.58s real     0m00.11s user     0m00.47s system
0m00.58s real     0m00.10s user     0m00.48s system
0m00.59s real     0m00.11s user     0m00.48s system

after patch
0m00.15s real     0m00.07s user     0m00.08s system
0m00.14s real     0m00.10s user     0m00.04s system
0m00.14s real     0m00.10s user     0m00.05s system
0m00.14s real     0m00.08s user     0m00.07s system
0m00.14s real     0m00.08s user     0m00.07s system

Signed-off-by: Lee YongTaek <ytk.lee@samsung.com>
---
 fs/proc/base.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f9f72aee6d45..54b2fb5e9c51 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
        struct mm_struct *mm = NULL;
        struct task_struct *task;
        int err = 0;
+       int mm_users = 0;

        task = get_proc_task(file_inode(file));
        if (!task)
@@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
                struct task_struct *p = find_lock_task_mm(task);

                if (p) {
-                       if (atomic_read(&p->mm->mm_users) > 1) {
+                       mm_users = atomic_read(&p->mm->mm_users);
+                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
                                mm = p->mm;
                                atomic_inc(&mm->mm_count);
                        }
--

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-10-09 14:09 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20181005063208epcms1p22959cd2f771ad017996e2b18266791ea@epcms1p2>
2018-10-05  6:32 ` [PATCH] mm, oom_adj: avoid meaningless loop to find processes sharing mm Yong-Taek Lee
2018-10-09  6:23   ` Michal Hocko
2018-10-09  8:03     ` Michal Hocko
     [not found] <CGME20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8>
2018-10-08  1:19 ` Yong-Taek Lee
2018-10-08  2:52   ` Tetsuo Handa
     [not found]   ` <CGME20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p5>
2018-10-08  6:14     ` Yong-Taek Lee
2018-10-08  6:22       ` Tetsuo Handa
     [not found]       ` <CGME20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p2>
2018-10-08  8:38         ` Yong-Taek Lee
2018-10-08  9:27           ` Tetsuo Handa
2018-10-09  6:35             ` Michal Hocko
2018-10-09  7:50               ` Michal Hocko
2018-10-09 10:00                 ` Tetsuo Handa
2018-10-09 11:10                   ` Michal Hocko
2018-10-09 12:52                     ` Tetsuo Handa
2018-10-09 12:58                       ` Michal Hocko
2018-10-09 13:14                         ` Tetsuo Handa
2018-10-09 13:26                           ` Michal Hocko
2018-10-09 13:51                             ` Tetsuo Handa
2018-10-09 14:09                               ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.