All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
@ 2019-01-16 10:55 Tetsuo Handa
  2019-01-16 11:09 ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2019-01-16 10:55 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Johannes Weiner, David Rientjes, linux-mm, Tetsuo Handa, Yong-Taek Lee

This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj") and commit
97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
close a race and reduce the latency at __set_oom_adj(), and reduces the
warning at __oom_kill_process() in order to minimize the latency.

Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
to unmap the address space") introduced the worst case mentioned in
44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
only administrators can trigger the worst case.

Since 44a70adec910d692 did not take latency into account, we can hold RCU
for minutes and trigger RCU stall warnings by calling printk() on many
thousands of thread groups. Even without calling printk(), the latency is
mentioned by Yong-Taek Lee [1]. And I noticed that 44a70adec910d692 is
racy, and trying to fix the race will require a global lock which is too
costly for rare events.

If the worst case in 44a70adec910d692 happens, it is an administrator's
request. Therefore, tolerate the worst case and speed up __set_oom_adj().

[1] https://lkml.kernel.org/r/20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
---
 fs/proc/base.c     | 46 ----------------------------------------------
 include/linux/mm.h |  2 --
 mm/oom_kill.c      | 10 ++++++----
 3 files changed, 6 insertions(+), 52 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 633a634..41ece8f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1020,7 +1020,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
 	static DEFINE_MUTEX(oom_adj_mutex);
-	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	int err = 0;
 
@@ -1050,55 +1049,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
-	/*
-	 * Make sure we will check other processes sharing the mm if this is
-	 * not vfrok which wants its own oom_score_adj.
-	 * pin the mm so it doesn't go away and get reused after task_unlock
-	 */
-	if (!task->vfork_done) {
-		struct task_struct *p = find_lock_task_mm(task);
-
-		if (p) {
-			if (atomic_read(&p->mm->mm_users) > 1) {
-				mm = p->mm;
-				mmgrab(mm);
-			}
-			task_unlock(p);
-		}
-	}
-
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
 	trace_oom_score_adj_update(task);
-
-	if (mm) {
-		struct task_struct *p;
-
-		rcu_read_lock();
-		for_each_process(p) {
-			if (same_thread_group(task, p))
-				continue;
-
-			/* do not touch kernel threads or the global init */
-			if (p->flags & PF_KTHREAD || is_global_init(p))
-				continue;
-
-			task_lock(p);
-			if (!p->vfork_done && process_shares_mm(p, mm)) {
-				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
-						task_pid_nr(p), p->comm,
-						p->signal->oom_score_adj, oom_adj,
-						task_pid_nr(task), task->comm);
-				p->signal->oom_score_adj = oom_adj;
-				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
-					p->signal->oom_score_adj_min = (short)oom_adj;
-			}
-			task_unlock(p);
-		}
-		rcu_read_unlock();
-		mmdrop(mm);
-	}
 err_unlock:
 	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb640..28879c1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2690,8 +2690,6 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
-extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
-
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f0e8cd9..c7005b1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -478,7 +478,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
@@ -896,12 +896,14 @@ static void __oom_kill_process(struct task_struct *victim)
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (is_global_init(p)) {
+		if (is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 			can_oom_reap = false;
-			set_bit(MMF_OOM_SKIP, &mm->flags);
-			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
+			if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+				pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
+			set_bit(MMF_OOM_SKIP, &mm->flags);
 			continue;
 		}
 		/*
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-16 10:55 [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj Tetsuo Handa
@ 2019-01-16 11:09 ` Michal Hocko
  2019-01-16 11:30   ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-01-16 11:09 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm, Yong-Taek Lee

On Wed 16-01-19 19:55:21, Tetsuo Handa wrote:
> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
> processes sharing mm have same view of oom_score_adj") and commit
> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
> close a race and reduce the latency at __set_oom_adj(), and reduces the
> warning at __oom_kill_process() in order to minimize the latency.
> 
> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
> to unmap the address space") introduced the worst case mentioned in
> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
> only administrators can trigger the worst case.
> 
> Since 44a70adec910d692 did not take latency into account, we can hold RCU
> for minutes and trigger RCU stall warnings by calling printk() on many
> thousands of thread groups. Even without calling printk(), the latency is
> mentioned by Yong-Taek Lee [1]. And I noticed that 44a70adec910d692 is
> racy, and trying to fix the race will require a global lock which is too
> costly for rare events.
> 
> If the worst case in 44a70adec910d692 happens, it is an administrator's
> request. Therefore, tolerate the worst case and speed up __set_oom_adj().

I really do not think we care about latency. I consider the overal API
sanity much more important. Besides that the original report you are
referring to was never exaplained/shown to represent real world usecase.
oom_score_adj is not really a an interface to be tweaked in hot paths.

I can be convinced otherwise but that really requires some _real_
usecase with an explanation why there is no other way. Until then

Nacked-by: Michal Hocko <mhocko@suse.com>

> 
> [1] https://lkml.kernel.org/r/20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> ---
>  fs/proc/base.c     | 46 ----------------------------------------------
>  include/linux/mm.h |  2 --
>  mm/oom_kill.c      | 10 ++++++----
>  3 files changed, 6 insertions(+), 52 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 633a634..41ece8f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1020,7 +1020,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
>  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  {
>  	static DEFINE_MUTEX(oom_adj_mutex);
> -	struct mm_struct *mm = NULL;
>  	struct task_struct *task;
>  	int err = 0;
>  
> @@ -1050,55 +1049,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		}
>  	}
>  
> -	/*
> -	 * Make sure we will check other processes sharing the mm if this is
> -	 * not vfrok which wants its own oom_score_adj.
> -	 * pin the mm so it doesn't go away and get reused after task_unlock
> -	 */
> -	if (!task->vfork_done) {
> -		struct task_struct *p = find_lock_task_mm(task);
> -
> -		if (p) {
> -			if (atomic_read(&p->mm->mm_users) > 1) {
> -				mm = p->mm;
> -				mmgrab(mm);
> -			}
> -			task_unlock(p);
> -		}
> -	}
> -
>  	task->signal->oom_score_adj = oom_adj;
>  	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
>  		task->signal->oom_score_adj_min = (short)oom_adj;
>  	trace_oom_score_adj_update(task);
> -
> -	if (mm) {
> -		struct task_struct *p;
> -
> -		rcu_read_lock();
> -		for_each_process(p) {
> -			if (same_thread_group(task, p))
> -				continue;
> -
> -			/* do not touch kernel threads or the global init */
> -			if (p->flags & PF_KTHREAD || is_global_init(p))
> -				continue;
> -
> -			task_lock(p);
> -			if (!p->vfork_done && process_shares_mm(p, mm)) {
> -				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
> -						task_pid_nr(p), p->comm,
> -						p->signal->oom_score_adj, oom_adj,
> -						task_pid_nr(task), task->comm);
> -				p->signal->oom_score_adj = oom_adj;
> -				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> -					p->signal->oom_score_adj_min = (short)oom_adj;
> -			}
> -			task_unlock(p);
> -		}
> -		rcu_read_unlock();
> -		mmdrop(mm);
> -	}
>  err_unlock:
>  	mutex_unlock(&oom_adj_mutex);
>  	put_task_struct(task);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb640..28879c1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2690,8 +2690,6 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
>  }
>  #endif	/* __HAVE_ARCH_GATE_AREA */
>  
> -extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
> -
>  #ifdef CONFIG_SYSCTL
>  extern int sysctl_drop_caches;
>  int drop_caches_sysctl_handler(struct ctl_table *, int,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f0e8cd9..c7005b1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -478,7 +478,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
>   * task's threads: if one of those is using this mm then this task was also
>   * using it.
>   */
> -bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
> +static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
>  {
>  	struct task_struct *t;
>  
> @@ -896,12 +896,14 @@ static void __oom_kill_process(struct task_struct *victim)
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (is_global_init(p)) {
> +		if (is_global_init(p) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>  			can_oom_reap = false;
> -			set_bit(MMF_OOM_SKIP, &mm->flags);
> -			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
> +			if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> +				pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
>  					task_pid_nr(victim), victim->comm,
>  					task_pid_nr(p), p->comm);
> +			set_bit(MMF_OOM_SKIP, &mm->flags);
>  			continue;
>  		}
>  		/*
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-16 11:09 ` Michal Hocko
@ 2019-01-16 11:30   ` Tetsuo Handa
  2019-01-16 12:19     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2019-01-16 11:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm, Yong-Taek Lee

On 2019/01/16 20:09, Michal Hocko wrote:
> On Wed 16-01-19 19:55:21, Tetsuo Handa wrote:
>> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
>> processes sharing mm have same view of oom_score_adj") and commit
>> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
>> close a race and reduce the latency at __set_oom_adj(), and reduces the
>> warning at __oom_kill_process() in order to minimize the latency.
>>
>> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
>> to unmap the address space") introduced the worst case mentioned in
>> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
>> only administrators can trigger the worst case.
>>
>> Since 44a70adec910d692 did not take latency into account, we can hold RCU
>> for minutes and trigger RCU stall warnings by calling printk() on many
>> thousands of thread groups. Even without calling printk(), the latency is
>> mentioned by Yong-Taek Lee [1]. And I noticed that 44a70adec910d692 is
>> racy, and trying to fix the race will require a global lock which is too
>> costly for rare events.
>>
>> If the worst case in 44a70adec910d692 happens, it is an administrator's
>> request. Therefore, tolerate the worst case and speed up __set_oom_adj().
> 
> I really do not think we care about latency. I consider the overal API
> sanity much more important. Besides that the original report you are
> referring to was never exaplained/shown to represent real world usecase.
> oom_score_adj is not really a an interface to be tweaked in hot paths.

I do care about the latency. Holding RCU for more than 2 minutes is insane.

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>
#include <signal.h>

#define STACKSIZE 8192
static int child(void *unused)
{
        pause();
        return 0;
}
int main(int argc, char *argv[])
{
        int fd = open("/proc/self/oom_score_adj", O_WRONLY);
        int i;
        char *stack = mmap(NULL, STACKSIZE, PROT_WRITE | PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, EOF, 0);
        for (i = 0; i < 8192 * 4; i++)
                if (clone(child, stack + STACKSIZE, CLONE_VM, NULL) == -1)
                        break;
        write(fd, "0\n", 2);
        kill(0, SIGSEGV);
        return 0;
}
----------

> 
> I can be convinced otherwise but that really requires some _real_
> usecase with an explanation why there is no other way. Until then
> 
> Nacked-by: Michal Hocko <mhocko@suse.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-16 11:30   ` Tetsuo Handa
@ 2019-01-16 12:19     ` Michal Hocko
  2019-01-16 13:32       ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-01-16 12:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm, Yong-Taek Lee

On Wed 16-01-19 20:30:25, Tetsuo Handa wrote:
> On 2019/01/16 20:09, Michal Hocko wrote:
> > On Wed 16-01-19 19:55:21, Tetsuo Handa wrote:
> >> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
> >> processes sharing mm have same view of oom_score_adj") and commit
> >> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
> >> close a race and reduce the latency at __set_oom_adj(), and reduces the
> >> warning at __oom_kill_process() in order to minimize the latency.
> >>
> >> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
> >> to unmap the address space") introduced the worst case mentioned in
> >> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
> >> only administrators can trigger the worst case.
> >>
> >> Since 44a70adec910d692 did not take latency into account, we can hold RCU
> >> for minutes and trigger RCU stall warnings by calling printk() on many
> >> thousands of thread groups. Even without calling printk(), the latency is
> >> mentioned by Yong-Taek Lee [1]. And I noticed that 44a70adec910d692 is
> >> racy, and trying to fix the race will require a global lock which is too
> >> costly for rare events.
> >>
> >> If the worst case in 44a70adec910d692 happens, it is an administrator's
> >> request. Therefore, tolerate the worst case and speed up __set_oom_adj().
> > 
> > I really do not think we care about latency. I consider the overal API
> > sanity much more important. Besides that the original report you are
> > referring to was never exaplained/shown to represent real world usecase.
> > oom_score_adj is not really a an interface to be tweaked in hot paths.
> 
> I do care about the latency. Holding RCU for more than 2 minutes is insane.

Creating 8k threads could be considered insane as well. But more
seriously. I absolutely do not insist on holding a single RCU section
for the whole operation. But that doesn't really mean that we want to
revert these changes. for_each_process is by far not only called from
this path.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-16 12:19     ` Michal Hocko
@ 2019-01-16 13:32       ` Tetsuo Handa
  2019-01-16 13:41         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2019-01-16 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds

On 2019/01/16 21:19, Michal Hocko wrote:
> On Wed 16-01-19 20:30:25, Tetsuo Handa wrote:
>> On 2019/01/16 20:09, Michal Hocko wrote:
>>> On Wed 16-01-19 19:55:21, Tetsuo Handa wrote:
>>>> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
>>>> processes sharing mm have same view of oom_score_adj") and commit
>>>> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
>>>> close a race and reduce the latency at __set_oom_adj(), and reduces the
>>>> warning at __oom_kill_process() in order to minimize the latency.
>>>>
>>>> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
>>>> to unmap the address space") introduced the worst case mentioned in
>>>> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
>>>> only administrators can trigger the worst case.
>>>>
>>>> Since 44a70adec910d692 did not take latency into account, we can hold RCU
>>>> for minutes and trigger RCU stall warnings by calling printk() on many
>>>> thousands of thread groups. Even without calling printk(), the latency is
>>>> mentioned by Yong-Taek Lee [1]. And I noticed that 44a70adec910d692 is
>>>> racy, and trying to fix the race will require a global lock which is too
>>>> costly for rare events.
>>>>
>>>> If the worst case in 44a70adec910d692 happens, it is an administrator's
>>>> request. Therefore, tolerate the worst case and speed up __set_oom_adj().
>>>
>>> I really do not think we care about latency. I consider the overal API
>>> sanity much more important. Besides that the original report you are
>>> referring to was never exaplained/shown to represent real world usecase.
>>> oom_score_adj is not really a an interface to be tweaked in hot paths.
>>
>> I do care about the latency. Holding RCU for more than 2 minutes is insane.
> 
> Creating 8k threads could be considered insane as well. But more
> seriously. I absolutely do not insist on holding a single RCU section
> for the whole operation. But that doesn't really mean that we want to
> revert these changes. for_each_process is by far not only called from
> this path.

Unlike check_hung_uninterruptible_tasks() where failing to resume after
breaking RCU section is tolerable, failing to resume after breaking RCU
section for __set_oom_adj() is not tolerable; it leaves the possibility
of different oom_score_adj. Unless it is inevitable (e.g. SysRq-t), I think
that calling printk() on each thread from RCU section is a poor choice.

What if thousands of threads concurrently called __set_oom_adj() when
each __set_oom_adj() call involves printk() on thousands of threads
which can take more than 2 minutes? How long will it take to complete?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-16 13:32       ` Tetsuo Handa
@ 2019-01-16 13:41         ` Michal Hocko
  2019-01-17 10:40           ` Tetsuo Handa
  2019-01-17 15:51           ` Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Michal Hocko @ 2019-01-16 13:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds

On Wed 16-01-19 22:32:50, Tetsuo Handa wrote:
> On 2019/01/16 21:19, Michal Hocko wrote:
> > On Wed 16-01-19 20:30:25, Tetsuo Handa wrote:
> >> On 2019/01/16 20:09, Michal Hocko wrote:
> >>> On Wed 16-01-19 19:55:21, Tetsuo Handa wrote:
> >>>> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
> >>>> processes sharing mm have same view of oom_score_adj") and commit
> >>>> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
> >>>> close a race and reduce the latency at __set_oom_adj(), and reduces the
> >>>> warning at __oom_kill_process() in order to minimize the latency.
> >>>>
> >>>> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
> >>>> to unmap the address space") introduced the worst case mentioned in
> >>>> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
> >>>> only administrators can trigger the worst case.
> >>>>
> >>>> Since 44a70adec910d692 did not take latency into account, we can hold RCU
> >>>> for minutes and trigger RCU stall warnings by calling printk() on many
> >>>> thousands of thread groups. Even without calling printk(), the latency is
> >>>> mentioned by Yong-Taek Lee [1]. And I noticed that 44a70adec910d692 is
> >>>> racy, and trying to fix the race will require a global lock which is too
> >>>> costly for rare events.
> >>>>
> >>>> If the worst case in 44a70adec910d692 happens, it is an administrator's
> >>>> request. Therefore, tolerate the worst case and speed up __set_oom_adj().
> >>>
> >>> I really do not think we care about latency. I consider the overal API
> >>> sanity much more important. Besides that the original report you are
> >>> referring to was never exaplained/shown to represent real world usecase.
> >>> oom_score_adj is not really a an interface to be tweaked in hot paths.
> >>
> >> I do care about the latency. Holding RCU for more than 2 minutes is insane.
> > 
> > Creating 8k threads could be considered insane as well. But more
> > seriously. I absolutely do not insist on holding a single RCU section
> > for the whole operation. But that doesn't really mean that we want to
> > revert these changes. for_each_process is by far not only called from
> > this path.
> 
> Unlike check_hung_uninterruptible_tasks() where failing to resume after
> breaking RCU section is tolerable, failing to resume after breaking RCU
> section for __set_oom_adj() is not tolerable; it leaves the possibility
> of different oom_score_adj.

Then make sure that no threads are really missed. Really I fail to see
what you are actually arguing about. for_each_process is expensive. No
question about that. If you can replace it for this specific and odd
usecase then go ahead. But there is absolutely zero reason to have a
broken oom_score_adj semantic just because somebody might have thousands
of threads and want to update the score faster.

> Unless it is inevitable (e.g. SysRq-t), I think
> that calling printk() on each thread from RCU section is a poor choice.
> 
> What if thousands of threads concurrently called __set_oom_adj() when
> each __set_oom_adj() call involves printk() on thousands of threads
> which can take more than 2 minutes? How long will it take to complete?

I really do not mind removing printk if that is what really bothers
users. The primary purpose of this printk was to catch users who
wouldn't expect this change. There were exactly zero.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-16 13:41         ` Michal Hocko
@ 2019-01-17 10:40           ` Tetsuo Handa
  2019-01-17 15:51           ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Tetsuo Handa @ 2019-01-17 10:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds

On 2019/01/16 22:41, Michal Hocko wrote:
>>>> I do care about the latency. Holding RCU for more than 2 minutes is insane.
>>>
>>> Creating 8k threads could be considered insane as well. But more
>>> seriously. I absolutely do not insist on holding a single RCU section
>>> for the whole operation. But that doesn't really mean that we want to
>>> revert these changes. for_each_process is by far not only called from
>>> this path.
>>
>> Unlike check_hung_uninterruptible_tasks() where failing to resume after
>> breaking RCU section is tolerable, failing to resume after breaking RCU
>> section for __set_oom_adj() is not tolerable; it leaves the possibility
>> of different oom_score_adj.
> 
> Then make sure that no threads are really missed. Really I fail to see
> what you are actually arguing about.

Impossible unless we hold the global rw_semaphore for read during
copy_process()/do_exit() while hold the global rw_semaphore for write
during __set_oom_adj(). We won't accept such giant lock in order to close
the __set_oom_adj() race.

>                                      for_each_process is expensive. No
> question about that.

I'm saying that printk() is far more expensive. Current __set_oom_adj() code
allows wasting CPU by printing pointless message

  [ 1270.265958][ T8549] updating oom_score_adj for 30876 (a.out) from 0 to 0 because it shares mm with 8549 (a.out). Report if this is unexpected.
  [ 1270.265959][ T8549] updating oom_score_adj for 30877 (a.out) from 0 to 0 because it shares mm with 8549 (a.out). Report if this is unexpected.
  [ 1270.265961][ T8549] updating oom_score_adj for 30878 (a.out) from 0 to 0 because it shares mm with 8549 (a.out). Report if this is unexpected.
  [ 1270.265964][ T8549] updating oom_score_adj for 30879 (a.out) from 0 to 0 because it shares mm with 8549 (a.out). Report if this is unexpected.
  [ 1270.389516][ T8549] updating oom_score_adj for 30880 (a.out) from 0 to 0 because it shares mm with 8549 (a.out). Report if this is unexpected.
  [ 1270.395223][ T8549] updating oom_score_adj for 30881 (a.out) from 0 to 0 because it shares mm with 8549 (a.out). Report if this is unexpected.
  [ 1270.400871][ T8549] updating oom_score_adj for 30882 (a.out) from 0 to 0 because it shares mm with 8549 (a.out). Report if this is unexpected.
  [ 1270.406757][ T8549] updating oom_score_adj for 30883 (a.out) from 0 to 0 because it shares mm with 8549 (a.out). Report if this is unexpected.
  [ 1270.412509][ T8549] updating oom_score_adj for 30884 (a.out) from 0 to 0 because it shares mm with 8549 (a.out). Report if this is unexpected.

for _longer than one month_ ('2 minutes for one __set_oom_adj() call' x '32000
thread groups concurrently do "echo 0 > /proc/self/oom_score_adj"' = 44 days
to complete). This is nothing but a DoS attack vector.

>                      If you can replace it for this specific and odd
> usecase then go ahead. But there is absolutely zero reason to have a
> broken oom_score_adj semantic just because somebody might have thousands
> of threads and want to update the score faster.
> 
>> Unless it is inevitable (e.g. SysRq-t), I think
>> that calling printk() on each thread from RCU section is a poor choice.
>>
>> What if thousands of threads concurrently called __set_oom_adj() when
>> each __set_oom_adj() call involves printk() on thousands of threads
>> which can take more than 2 minutes? How long will it take to complete?
> 
> I really do not mind removing printk if that is what really bothers
> users. The primary purpose of this printk was to catch users who
> wouldn't expect this change. There were exactly zero.
> 

This printk() is pointless. There is no need to flood like above. Once is enough.
What is bad, it says "Report if this is unexpected." rather than "Report if you saw
this message.". If the user thinks that 'Oh, what a nifty caretaker. I need to do
"echo 0 > /proc/self/oom_score_adj" for only once.', that user won't report it.

And I estimate that we will need to wait for several more years to make sure that
all users upgrade their kernels to Linux 4.8+ which has __set_oom_adj() code. So far
"exactly zero" does not mean "changing oom_score_adj semantics is allowable". (But
so far "exactly zero" might suggest that there is absolutely no "CLONE_VM without
CLONE_SIGNAHD" user at all and thus preserving __oom_score_adj() code makes no sense.)



Given that said, I think that querying "CLONE_VM without CLONE_SIGNAHD" users at
copy_process() for the reason of that combination can improve the code.

--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1732,6 +1732,15 @@ static __latent_entropy struct task_struct *copy_process(
 	}
 
 	/*
+	 * Shared VM without signal handlers leads to complicated OOM-killer
+	 * handling. Let's ask such users why they want such combination.
+	 */
+	if ((clone_flags & CLONE_VM) && !(clone_flags & CLONE_SIGHAND))
+		pr_warn_once("***** %s(%d) is trying to create a thread sharing memory without signal handlers. Please be sure to report to linux-mm@kvack.org the reason why you want to use this combination. Otherwise, this combination will be forbidden in future kernels in order to simplify OOM-killer handling. *****\n",
+			     current->comm, task_pid_nr(current));
+
+
+	/*
 	 * Force any signals received before this point to be delivered
 	 * before the fork happens.  Collect up signals sent to multiple
 	 * processes that happen during the fork and delay them so that

If we waited enough period and there is no user, we can forbid that combination
and eliminate OOM handling code for "CLONE_VM without CLONE_SIGNAHD" which is
forcing current __set_oom_adj() code.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-16 13:41         ` Michal Hocko
  2019-01-17 10:40           ` Tetsuo Handa
@ 2019-01-17 15:51           ` Michal Hocko
  2019-01-30 22:49             ` [PATCH v2] " Tetsuo Handa
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-01-17 15:51 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds

[I am mostly offline for the rest of the week]

On Wed 16-01-19 14:41:31, Michal Hocko wrote:
> On Wed 16-01-19 22:32:50, Tetsuo Handa wrote:
> > On 2019/01/16 21:19, Michal Hocko wrote:
> > > On Wed 16-01-19 20:30:25, Tetsuo Handa wrote:
> > >> On 2019/01/16 20:09, Michal Hocko wrote:
> > >>> On Wed 16-01-19 19:55:21, Tetsuo Handa wrote:
> > >>>> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
> > >>>> processes sharing mm have same view of oom_score_adj") and commit
> > >>>> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
> > >>>> close a race and reduce the latency at __set_oom_adj(), and reduces the
> > >>>> warning at __oom_kill_process() in order to minimize the latency.
> > >>>>
> > >>>> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
> > >>>> to unmap the address space") introduced the worst case mentioned in
> > >>>> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
> > >>>> only administrators can trigger the worst case.
> > >>>>
> > >>>> Since 44a70adec910d692 did not take latency into account, we can hold RCU
> > >>>> for minutes and trigger RCU stall warnings by calling printk() on many
> > >>>> thousands of thread groups. Even without calling printk(), the latency is
> > >>>> mentioned by Yong-Taek Lee [1]. And I noticed that 44a70adec910d692 is
> > >>>> racy, and trying to fix the race will require a global lock which is too
> > >>>> costly for rare events.
> > >>>>
> > >>>> If the worst case in 44a70adec910d692 happens, it is an administrator's
> > >>>> request. Therefore, tolerate the worst case and speed up __set_oom_adj().
> > >>>
> > >>> I really do not think we care about latency. I consider the overal API
> > >>> sanity much more important. Besides that the original report you are
> > >>> referring to was never exaplained/shown to represent real world usecase.
> > >>> oom_score_adj is not really a an interface to be tweaked in hot paths.
> > >>
> > >> I do care about the latency. Holding RCU for more than 2 minutes is insane.
> > > 
> > > Creating 8k threads could be considered insane as well. But more
> > > seriously. I absolutely do not insist on holding a single RCU section
> > > for the whole operation. But that doesn't really mean that we want to
> > > revert these changes. for_each_process is by far not only called from
> > > this path.
> > 
> > Unlike check_hung_uninterruptible_tasks() where failing to resume after
> > breaking RCU section is tolerable, failing to resume after breaking RCU
> > section for __set_oom_adj() is not tolerable; it leaves the possibility
> > of different oom_score_adj.
> 
> Then make sure that no threads are really missed. Really I fail to see
> what you are actually arguing about. for_each_process is expensive. No
> question about that. If you can replace it for this specific and odd
> usecase then go ahead. But there is absolutely zero reason to have a
> broken oom_score_adj semantic just because somebody might have thousands
> of threads and want to update the score faster.

Btw. the current implementation annoyance is caused by the fact
that the oom_score_adj is per signal_struct rather than mm_struct. The
reason is that we really need:
	if (!vfork()) {
		set_oom_score_adj()
		exec()
	}
to work properly. One way around that is to special case oom_score_adj
for tasks in vfork and store their shadow value into the task_struct.
The shadow value would get transfered over to the mm struct once a new
one is allocated. So something very coarsly like

short tsk_get_oom_score_adj(struct task_struct *tsk)
{
	if (tsk->oom_score_adj != OOM_SCORE_ADJ_INVALID)
		return tsk->oom_score_adj;

	return tsk->signal->oom_score_adj;
}

use this helper instead of direct oom_score_adj usage. Then we need to
special case the setting in __set_oom_adj and dup_mm to copy the value
over instead of copy_signal.

I think this is doable.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-17 15:51           ` Michal Hocko
@ 2019-01-30 22:49             ` Tetsuo Handa
  2019-01-31  7:11               ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2019-01-30 22:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds, LKML

This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj") and commit
97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
close a race and reduce the latency at __set_oom_adj(), and reduces the
warning at __oom_kill_process() in order to minimize the latency.

Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
to unmap the address space") introduced the worst case mentioned in
44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
only administrators can trigger the worst case.

Since 44a70adec910d692 did not take latency into account, we can "hold RCU
for minutes and trigger RCU stall warnings" by calling printk() on many
thousands of thread groups. Also, current code becomes a DoS attack vector
which will allow "stalling for more than one month in unkillable state"
simply printk()ing same messages when many thousands of thread groups
tried to iterate __set_oom_adj() on each other.

I also noticed that 44a70adec910d692 is racy [1], and trying to fix the
race will require a global lock which is too costly for rare events. And
Michal Hocko is thinking to change the oom_score_adj implementation to per
mm_struct (with shadowed score stored in per task_struct in order to
support vfork() => __set_oom_adj() => execve() sequence) so that we don't
need the global lock.

If the worst case in 44a70adec910d692 happened, it is an administrator's
request. Therefore, before changing the oom_score_adj implementation,
let's eliminate the DoS attack vector first.

[1] https://lkml.kernel.org/r/20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Nacked-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c     | 46 ----------------------------------------------
 include/linux/mm.h |  2 --
 mm/oom_kill.c      | 10 ++++++----
 3 files changed, 6 insertions(+), 52 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 633a634..41ece8f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1020,7 +1020,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
 	static DEFINE_MUTEX(oom_adj_mutex);
-	struct mm_struct *mm = NULL;
 	struct task_struct *task;
 	int err = 0;
 
@@ -1050,55 +1049,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 		}
 	}
 
-	/*
-	 * Make sure we will check other processes sharing the mm if this is
-	 * not vfrok which wants its own oom_score_adj.
-	 * pin the mm so it doesn't go away and get reused after task_unlock
-	 */
-	if (!task->vfork_done) {
-		struct task_struct *p = find_lock_task_mm(task);
-
-		if (p) {
-			if (atomic_read(&p->mm->mm_users) > 1) {
-				mm = p->mm;
-				mmgrab(mm);
-			}
-			task_unlock(p);
-		}
-	}
-
 	task->signal->oom_score_adj = oom_adj;
 	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 		task->signal->oom_score_adj_min = (short)oom_adj;
 	trace_oom_score_adj_update(task);
-
-	if (mm) {
-		struct task_struct *p;
-
-		rcu_read_lock();
-		for_each_process(p) {
-			if (same_thread_group(task, p))
-				continue;
-
-			/* do not touch kernel threads or the global init */
-			if (p->flags & PF_KTHREAD || is_global_init(p))
-				continue;
-
-			task_lock(p);
-			if (!p->vfork_done && process_shares_mm(p, mm)) {
-				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
-						task_pid_nr(p), p->comm,
-						p->signal->oom_score_adj, oom_adj,
-						task_pid_nr(task), task->comm);
-				p->signal->oom_score_adj = oom_adj;
-				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
-					p->signal->oom_score_adj_min = (short)oom_adj;
-			}
-			task_unlock(p);
-		}
-		rcu_read_unlock();
-		mmdrop(mm);
-	}
 err_unlock:
 	mutex_unlock(&oom_adj_mutex);
 	put_task_struct(task);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb640..28879c1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2690,8 +2690,6 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
 }
 #endif	/* __HAVE_ARCH_GATE_AREA */
 
-extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
-
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f0e8cd9..c7005b1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -478,7 +478,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
 	struct task_struct *t;
 
@@ -896,12 +896,14 @@ static void __oom_kill_process(struct task_struct *victim)
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (is_global_init(p)) {
+		if (is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
 			can_oom_reap = false;
-			set_bit(MMF_OOM_SKIP, &mm->flags);
-			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
+			if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+				pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
 					task_pid_nr(victim), victim->comm,
 					task_pid_nr(p), p->comm);
+			set_bit(MMF_OOM_SKIP, &mm->flags);
 			continue;
 		}
 		/*
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-30 22:49             ` [PATCH v2] " Tetsuo Handa
@ 2019-01-31  7:11               ` Michal Hocko
  2019-01-31 20:59                 ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-01-31  7:11 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds, LKML

On Thu 31-01-19 07:49:35, Tetsuo Handa wrote:
> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
> processes sharing mm have same view of oom_score_adj") and commit
> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
> close a race and reduce the latency at __set_oom_adj(), and reduces the
> warning at __oom_kill_process() in order to minimize the latency.
> 
> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
> to unmap the address space") introduced the worst case mentioned in
> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
> only administrators can trigger the worst case.
> 
> Since 44a70adec910d692 did not take latency into account, we can "hold RCU
> for minutes and trigger RCU stall warnings" by calling printk() on many
> thousands of thread groups. Also, current code becomes a DoS attack vector
> which will allow "stalling for more than one month in unkillable state"
> simply printk()ing same messages when many thousands of thread groups
> tried to iterate __set_oom_adj() on each other.
> 
> I also noticed that 44a70adec910d692 is racy [1], and trying to fix the
> race will require a global lock which is too costly for rare events. And
> Michal Hocko is thinking to change the oom_score_adj implementation to per
> mm_struct (with shadowed score stored in per task_struct in order to
> support vfork() => __set_oom_adj() => execve() sequence) so that we don't
> need the global lock.
> 
> If the worst case in 44a70adec910d692 happened, it is an administrator's
> request. Therefore, before changing the oom_score_adj implementation,
> let's eliminate the DoS attack vector first.

This is really ridiculous. I have already nacked the previous version
and provided two ways around. The simplest one is to drop the printk.
The second one is to move oom_score_adj to the mm struct. Could you
explain why do you still push for this?

> [1] https://lkml.kernel.org/r/20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
> Nacked-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/proc/base.c     | 46 ----------------------------------------------
>  include/linux/mm.h |  2 --
>  mm/oom_kill.c      | 10 ++++++----
>  3 files changed, 6 insertions(+), 52 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 633a634..41ece8f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1020,7 +1020,6 @@ static ssize_t oom_adj_read(struct file *file, char __user *buf, size_t count,
>  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  {
>  	static DEFINE_MUTEX(oom_adj_mutex);
> -	struct mm_struct *mm = NULL;
>  	struct task_struct *task;
>  	int err = 0;
>  
> @@ -1050,55 +1049,10 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>  		}
>  	}
>  
> -	/*
> -	 * Make sure we will check other processes sharing the mm if this is
> -	 * not vfrok which wants its own oom_score_adj.
> -	 * pin the mm so it doesn't go away and get reused after task_unlock
> -	 */
> -	if (!task->vfork_done) {
> -		struct task_struct *p = find_lock_task_mm(task);
> -
> -		if (p) {
> -			if (atomic_read(&p->mm->mm_users) > 1) {
> -				mm = p->mm;
> -				mmgrab(mm);
> -			}
> -			task_unlock(p);
> -		}
> -	}
> -
>  	task->signal->oom_score_adj = oom_adj;
>  	if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
>  		task->signal->oom_score_adj_min = (short)oom_adj;
>  	trace_oom_score_adj_update(task);
> -
> -	if (mm) {
> -		struct task_struct *p;
> -
> -		rcu_read_lock();
> -		for_each_process(p) {
> -			if (same_thread_group(task, p))
> -				continue;
> -
> -			/* do not touch kernel threads or the global init */
> -			if (p->flags & PF_KTHREAD || is_global_init(p))
> -				continue;
> -
> -			task_lock(p);
> -			if (!p->vfork_done && process_shares_mm(p, mm)) {
> -				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
> -						task_pid_nr(p), p->comm,
> -						p->signal->oom_score_adj, oom_adj,
> -						task_pid_nr(task), task->comm);
> -				p->signal->oom_score_adj = oom_adj;
> -				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
> -					p->signal->oom_score_adj_min = (short)oom_adj;
> -			}
> -			task_unlock(p);
> -		}
> -		rcu_read_unlock();
> -		mmdrop(mm);
> -	}
>  err_unlock:
>  	mutex_unlock(&oom_adj_mutex);
>  	put_task_struct(task);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb640..28879c1 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2690,8 +2690,6 @@ static inline int in_gate_area(struct mm_struct *mm, unsigned long addr)
>  }
>  #endif	/* __HAVE_ARCH_GATE_AREA */
>  
> -extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
> -
>  #ifdef CONFIG_SYSCTL
>  extern int sysctl_drop_caches;
>  int drop_caches_sysctl_handler(struct ctl_table *, int,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f0e8cd9..c7005b1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -478,7 +478,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
>   * task's threads: if one of those is using this mm then this task was also
>   * using it.
>   */
> -bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
> +static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
>  {
>  	struct task_struct *t;
>  
> @@ -896,12 +896,14 @@ static void __oom_kill_process(struct task_struct *victim)
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (is_global_init(p)) {
> +		if (is_global_init(p) ||
> +		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
>  			can_oom_reap = false;
> -			set_bit(MMF_OOM_SKIP, &mm->flags);
> -			pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
> +			if (!test_bit(MMF_OOM_SKIP, &mm->flags))
> +				pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
>  					task_pid_nr(victim), victim->comm,
>  					task_pid_nr(p), p->comm);
> +			set_bit(MMF_OOM_SKIP, &mm->flags);
>  			continue;
>  		}
>  		/*
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-31  7:11               ` Michal Hocko
@ 2019-01-31 20:59                 ` Tetsuo Handa
  2019-02-01  9:14                   ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2019-01-31 20:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds, LKML

On 2019/01/31 16:11, Michal Hocko wrote:
> On Thu 31-01-19 07:49:35, Tetsuo Handa wrote:
>> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
>> processes sharing mm have same view of oom_score_adj") and commit
>> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
>> close a race and reduce the latency at __set_oom_adj(), and reduces the
>> warning at __oom_kill_process() in order to minimize the latency.
>>
>> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
>> to unmap the address space") introduced the worst case mentioned in
>> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
>> only administrators can trigger the worst case.
>>
>> Since 44a70adec910d692 did not take latency into account, we can "hold RCU
>> for minutes and trigger RCU stall warnings" by calling printk() on many
>> thousands of thread groups. Also, current code becomes a DoS attack vector
>> which will allow "stalling for more than one month in unkillable state"
>> simply printk()ing same messages when many thousands of thread groups
>> tried to iterate __set_oom_adj() on each other.
>>
>> I also noticed that 44a70adec910d692 is racy [1], and trying to fix the
>> race will require a global lock which is too costly for rare events. And
>> Michal Hocko is thinking to change the oom_score_adj implementation to per
>> mm_struct (with shadowed score stored in per task_struct in order to
>> support vfork() => __set_oom_adj() => execve() sequence) so that we don't
>> need the global lock.
>>
>> If the worst case in 44a70adec910d692 happened, it is an administrator's
>> request. Therefore, before changing the oom_score_adj implementation,
>> let's eliminate the DoS attack vector first.
> 
> This is really ridiculous. I have already nacked the previous version
> and provided two ways around. The simplest one is to drop the printk.
> The second one is to move oom_score_adj to the mm struct. Could you
> explain why do you still push for this?

Dropping printk() does not close the race.
You must propose an alternative patch if you dislike this patch.

> 
>> [1] https://lkml.kernel.org/r/20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8
>>
>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
>> Nacked-by: Michal Hocko <mhocko@suse.com>
>> ---
>>  fs/proc/base.c     | 46 ----------------------------------------------
>>  include/linux/mm.h |  2 --
>>  mm/oom_kill.c      | 10 ++++++----
>>  3 files changed, 6 insertions(+), 52 deletions(-)
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-01-31 20:59                 ` Tetsuo Handa
@ 2019-02-01  9:14                   ` Michal Hocko
  2019-02-02 11:06                     ` Tetsuo Handa
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2019-02-01  9:14 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds, LKML

On Fri 01-02-19 05:59:55, Tetsuo Handa wrote:
> On 2019/01/31 16:11, Michal Hocko wrote:
> > On Thu 31-01-19 07:49:35, Tetsuo Handa wrote:
> >> This patch reverts both commit 44a70adec910d692 ("mm, oom_adj: make sure
> >> processes sharing mm have same view of oom_score_adj") and commit
> >> 97fd49c2355ffded ("mm, oom: kill all tasks sharing the mm") in order to
> >> close a race and reduce the latency at __set_oom_adj(), and reduces the
> >> warning at __oom_kill_process() in order to minimize the latency.
> >>
> >> Commit 36324a990cf578b5 ("oom: clear TIF_MEMDIE after oom_reaper managed
> >> to unmap the address space") introduced the worst case mentioned in
> >> 44a70adec910d692. But since the OOM killer skips mm with MMF_OOM_SKIP set,
> >> only administrators can trigger the worst case.
> >>
> >> Since 44a70adec910d692 did not take latency into account, we can "hold RCU
> >> for minutes and trigger RCU stall warnings" by calling printk() on many
> >> thousands of thread groups. Also, current code becomes a DoS attack vector
> >> which will allow "stalling for more than one month in unkillable state"
> >> simply printk()ing same messages when many thousands of thread groups
> >> tried to iterate __set_oom_adj() on each other.
> >>
> >> I also noticed that 44a70adec910d692 is racy [1], and trying to fix the
> >> race will require a global lock which is too costly for rare events. And
> >> Michal Hocko is thinking to change the oom_score_adj implementation to per
> >> mm_struct (with shadowed score stored in per task_struct in order to
> >> support vfork() => __set_oom_adj() => execve() sequence) so that we don't
> >> need the global lock.
> >>
> >> If the worst case in 44a70adec910d692 happened, it is an administrator's
> >> request. Therefore, before changing the oom_score_adj implementation,
> >> let's eliminate the DoS attack vector first.
> > 
> > This is really ridiculous. I have already nacked the previous version
> > and provided two ways around. The simplest one is to drop the printk.
> > The second one is to move oom_score_adj to the mm struct. Could you
> > explain why do you still push for this?
> 
> Dropping printk() does not close the race.

But it does remove the source of a long operation from the RCU context.
If you are not willing to post such a trivial patch I will do so.

> You must propose an alternative patch if you dislike this patch.

I will eventually get there.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-02-01  9:14                   ` Michal Hocko
@ 2019-02-02 11:06                     ` Tetsuo Handa
  2019-02-11 15:07                       ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Tetsuo Handa @ 2019-02-02 11:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds, LKML

On 2019/02/01 18:14, Michal Hocko wrote:
> On Fri 01-02-19 05:59:55, Tetsuo Handa wrote:
>> On 2019/01/31 16:11, Michal Hocko wrote:
>>> This is really ridiculous. I have already nacked the previous version
>>> and provided two ways around. The simplest one is to drop the printk.
>>> The second one is to move oom_score_adj to the mm struct. Could you
>>> explain why do you still push for this?
>>
>> Dropping printk() does not close the race.
> 
> But it does remove the source of a long operation from the RCU context.
> If you are not willing to post such a trivial patch I will do so.
> 
>> You must propose an alternative patch if you dislike this patch.
> 
> I will eventually get there.
> 

This is really ridiculous. "eventually" cannot be justified as a reason for
rejecting this patch. I want a patch which can be easily backported _now_ .

If vfork() => __set_oom_adj() => execve() sequence is permitted, someone can
try vfork() => clone() => __set_oom_adj() => execve() sequence. And below
program demonstrates that task->vfork_done based exemption in __set_oom_adj()
is broken. It is not always the task_struct who called vfork() that will call
execve().

----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>

static int thread1(void *unused)
{
	char *args[3] = { "/bin/true", "true", NULL };
	int fd = open("/proc/self/oom_score_adj", O_WRONLY);
	write(fd, "1000", 4);
	close(fd);
	execve(args[0], args, NULL);
	return 0;
}
int main(int argc, char *argv[])
{
	printf("PID=%d\n", getpid());
	if (vfork() == 0) {
		clone(thread1, malloc(8192) + 8192,
		      CLONE_VM | CLONE_FS | CLONE_FILES, NULL);
		sleep(1);
		_exit(0);
	}
	return 0;
}
----------------------------------------

  PID=8802
  [ 1138.425255] updating oom_score_adj for 8802 (a.out) from 0 to 1000 because it shares mm with 8804 (a.out). Report if this is unexpected.

Current loop to enforce same oom_score_adj is 99%+ ending in vain.
And even your "eventually" will remove this loop.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj.
  2019-02-02 11:06                     ` Tetsuo Handa
@ 2019-02-11 15:07                       ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2019-02-11 15:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Johannes Weiner, David Rientjes, linux-mm,
	Yong-Taek Lee, Paul McKenney, Linus Torvalds, LKML

On Sat 02-02-19 20:06:07, Tetsuo Handa wrote:
> int main(int argc, char *argv[])
> {
> 	printf("PID=%d\n", getpid());
> 	if (vfork() == 0) {
> 		clone(thread1, malloc(8192) + 8192,
> 		      CLONE_VM | CLONE_FS | CLONE_FILES, NULL);
> 		sleep(1);
> 		_exit(0);
> 	}
> 	return 0;
> }

This program is not correct AFAIU:
   Standard description
       (From POSIX.1) The vfork() function has the same effect as
       fork(2), except that the behavior is undefined if the process
       created by vfork() either modifies any data other than a variable
       of type pid_t used to store the return value from vfork(), or
       returns from the function in which vfork() was called, or calls
       any other function before successfully calling _exit(2) or one of
       the exec(3) family of functions.
> 
>   PID=8802
>   [ 1138.425255] updating oom_score_adj for 8802 (a.out) from 0 to 1000 because it shares mm with 8804 (a.out). Report if this is unexpected.
> 
> Current loop to enforce same oom_score_adj is 99%+ ending in vain.
> And even your "eventually" will remove this loop.

But it keeps the semantic of the mm shared processes share the same
oomd_score_adj so that we do not have to add kludges to the OOM code to
handle with potential corner cases.

Really, this nagging is both unproductive and annoying. You are right
that the printk is overzealous and it can be dropped. The printk is
more than two years old and we haven't heard anybody to care. So the
first and the most obvious thing to do is to remove it. The patch is
trivial and if I was not buried in the backlog I would have posted it
already.  Regarding a potentially expensive for_each_process. This is
unfortunate but the thing we have to pay for in other paths as well
(e.g. exit path) so closing this only here just doesn't really help
much if you are concerned about security and potential stalls will
explode the machine scenarios.. So even though this sucks it is not
earth shattering. CLONE_VM withtout CLONE_SIGHAND simply sucks and
nobody should be using this threading model.

So, please calm down, try to be more productive and try to understand
what people try to tell you rather than shout around "i want my pony".
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-02-11 15:07 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-16 10:55 [PATCH] mm, oom: Tolerate processes sharing mm with different view of oom_score_adj Tetsuo Handa
2019-01-16 11:09 ` Michal Hocko
2019-01-16 11:30   ` Tetsuo Handa
2019-01-16 12:19     ` Michal Hocko
2019-01-16 13:32       ` Tetsuo Handa
2019-01-16 13:41         ` Michal Hocko
2019-01-17 10:40           ` Tetsuo Handa
2019-01-17 15:51           ` Michal Hocko
2019-01-30 22:49             ` [PATCH v2] " Tetsuo Handa
2019-01-31  7:11               ` Michal Hocko
2019-01-31 20:59                 ` Tetsuo Handa
2019-02-01  9:14                   ` Michal Hocko
2019-02-02 11:06                     ` Tetsuo Handa
2019-02-11 15:07                       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.