linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] opportunistic memory reclaim of a killed process
@ 2019-04-11  1:43 Suren Baghdasaryan
  2019-04-11  1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan
                   ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11  1:43 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes,
	penguin-kernel, ebiederm, shakeelb, christian, minchan,
	timmurray, dancol, joel, jannh, surenb, linux-mm, lsf-pc,
	linux-kernel, kernel-team

The time to kill a process and free its memory can be critical when the
killing was done to prevent memory shortages affecting system
responsiveness.

In the case of Android, where processes can be restarted easily, killing a
less important background process is preferred to delaying or throttling
an interactive foreground process. At the same time unnecessary kills
should be avoided as they cause delays when the killed process is needed
again. This requires a balanced decision from the system software about
how long a kill can be postponed in the hope that memory usage will
decrease without such drastic measures.

As killing a process and reclaiming its memory is not an instant operation,
a margin of free memory has to be maintained to prevent system performance
deterioration while memory of the killed process is being reclaimed. The
size of this margin depends on the minimum reclaim rate to cover the
worst-case scenario and this minimum rate should be deterministic.

Note that on asymmetric architectures like ARM big.LITTLE the reclaim rate
can vary dramatically depending on which core it’s performed at (see test
results). It’s a usual scenario that a non-essential victim process is
being restricted to a less performant or throttled CPU for power saving
purposes. This makes the worst-case reclaim rate scenario very probable.

The cases when victim’s memory reclaim can be delayed further due to
process being blocked in an uninterruptible sleep or when it performs a
time-consuming operation makes the reclaim time even more unpredictable.

Increasing memory reclaim rate and making it more deterministic would
allow for a smaller free memory margin and would lead to more opportunities
to avoid killing a process.

Note that while other strategies like throttling memory allocations are
viable and can be employed for other non-essential processes they would
affect user experience if applied towards an interactive process.

Proposed solution uses existing oom-reaper thread to increase memory
reclaim rate of a killed process and to make this rate more deterministic.
By no means the proposed solution is considered the best and was chosen
because it was simple to implement and allowed for test data collection.
The downside of this solution is that it requires additional “expedite”
hint for something which has to be fast in all cases. Would be great to
find a way that does not require additional hints.

Other possible approaches include:
- Implementing a dedicated syscall to perform opportunistic reclaim in the
context of the process waiting for the victim’s death. A natural boost
bonus occurs if the waiting process has high or RT priority and is not
limited by cpuset cgroup in its CPU choices.
- Implement a mechanism that would perform opportunistic reclaim if it’s
possible unconditionally (similar to checks in task_will_free_mem()).
- Implement opportunistic reclaim that uses shrinker interface, PSI or
other memory pressure indications as a hint to engage.

Test details:
Tests are performed on a Qualcomm® Snapdragon™ 845 8-core ARM big.LITTLE
system with 4 little cores (0.3-1.6GHz) and 4 big cores (0.8-2.5GHz)
running Android.
Memory reclaim speed was measured using signal/signal_generate,
kmem/rss_stat and sched/sched_process_exit traces.

Test results:
powersave governor, min freq
                        normal kills      expedited kills
        little          856 MB/sec        3236 MB/sec
        big             5084 MB/sec       6144 MB/sec

performance governor, max freq
                        normal kills      expedited kills
        little          5602 MB/sec       8144 MB/sec
        big             14656 MB/sec      12398 MB/sec

schedutil governor (default)
                        normal kills      expedited kills
        little          2386 MB/sec       3908 MB/sec
        big             7282 MB/sec       6820-16386 MB/sec
=================================================================
min reclaim speed:      856 MB/sec        3236 MB/sec

The patches are based on 5.1-rc1

Suren Baghdasaryan (2):
  mm: oom: expose expedite_reclaim to use oom_reaper outside of
    oom_kill.c
  signal: extend pidfd_send_signal() to allow expedited process killing

 include/linux/oom.h          |  1 +
 include/linux/sched/signal.h |  3 ++-
 include/linux/signal.h       | 11 ++++++++++-
 ipc/mqueue.c                 |  2 +-
 kernel/signal.c              | 37 ++++++++++++++++++++++++++++--------
 kernel/time/itimer.c         |  2 +-
 mm/oom_kill.c                | 15 +++++++++++++++
 7 files changed, 59 insertions(+), 12 deletions(-)

-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c
  2019-04-11  1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan
@ 2019-04-11  1:43 ` Suren Baghdasaryan
  2019-04-25 21:12   ` Tetsuo Handa
  2019-04-11  1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11  1:43 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes,
	penguin-kernel, ebiederm, shakeelb, christian, minchan,
	timmurray, dancol, joel, jannh, surenb, linux-mm, lsf-pc,
	linux-kernel, kernel-team

Create an API to allow users outside of oom_kill.c to mark a victim and
wake up oom_reaper thread for expedited memory reclaim of the process being
killed.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/oom.h |  1 +
 mm/oom_kill.c       | 15 +++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index d07992009265..6c043c7518c1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -112,6 +112,7 @@ extern unsigned long oom_badness(struct task_struct *p,
 		unsigned long totalpages);
 
 extern bool out_of_memory(struct oom_control *oc);
+extern bool expedite_reclaim(struct task_struct *task);
 
 extern void exit_oom_victim(void);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3a2484884cfd..6449710c8a06 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1102,6 +1102,21 @@ bool out_of_memory(struct oom_control *oc)
 	return !!oc->chosen;
 }
 
+bool expedite_reclaim(struct task_struct *task)
+{
+	bool res = false;
+
+	task_lock(task);
+	if (task_will_free_mem(task)) {
+		mark_oom_victim(task);
+		wake_oom_reaper(task);
+		res = true;
+	}
+	task_unlock(task);
+
+	return res;
+}
+
 /*
  * The pagefault handler calls here because it is out of memory, so kill a
  * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11  1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan
  2019-04-11  1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan
@ 2019-04-11  1:43 ` Suren Baghdasaryan
  2019-04-11 10:30   ` Christian Brauner
  2019-04-11 15:33   ` Matthew Wilcox
  2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko
  2019-04-11 11:51 ` [Lsf-pc] " Rik van Riel
  3 siblings, 2 replies; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11  1:43 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes,
	penguin-kernel, ebiederm, shakeelb, christian, minchan,
	timmurray, dancol, joel, jannh, surenb, linux-mm, lsf-pc,
	linux-kernel, kernel-team

Add new SS_EXPEDITE flag to be used when sending SIGKILL via
pidfd_send_signal() syscall to allow expedited memory reclaim of the
victim process. The usage of this flag is currently limited to SIGKILL
signal and only to privileged users.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/sched/signal.h |  3 ++-
 include/linux/signal.h       | 11 ++++++++++-
 ipc/mqueue.c                 |  2 +-
 kernel/signal.c              | 37 ++++++++++++++++++++++++++++--------
 kernel/time/itimer.c         |  2 +-
 5 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index e412c092c1e8..8a227633a058 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *);
 extern void force_sigsegv(int sig, struct task_struct *p);
 extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *);
 extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp);
-extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid);
+extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
+				bool expedite);
 extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *,
 				const struct cred *);
 extern int kill_pgrp(struct pid *pid, int sig, int priv);
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 9702016734b1..34b7852aa4a0 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long);
 } while (0);
 
 #ifdef CONFIG_PROC_FS
+
+/*
+ * SS_FLAGS values used in pidfd_send_signal:
+ *
+ * SS_EXPEDITE indicates desire to expedite the operation.
+ */
+#define SS_EXPEDITE	0x00000001
+
 struct seq_file;
 extern void render_sigset_t(struct seq_file *, const char *, sigset_t *);
-#endif
+
+#endif /* CONFIG_PROC_FS */
 
 #endif /* _LINUX_SIGNAL_H */
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index aea30530c472..27c66296e08e 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info)
 			rcu_read_unlock();
 
 			kill_pid_info(info->notify.sigev_signo,
-				      &sig_i, info->notify_owner);
+				      &sig_i, info->notify_owner, false);
 			break;
 		case SIGEV_THREAD:
 			set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
diff --git a/kernel/signal.c b/kernel/signal.c
index f98448cf2def..02ed4332d17c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -43,6 +43,7 @@
 #include <linux/compiler.h>
 #include <linux/posix-timers.h>
 #include <linux/livepatch.h>
+#include <linux/oom.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/signal.h>
@@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp)
 	return success ? 0 : retval;
 }
 
-int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
+int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
+				  bool expedite)
 {
 	int error = -ESRCH;
 	struct task_struct *p;
@@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
 	for (;;) {
 		rcu_read_lock();
 		p = pid_task(pid, PIDTYPE_PID);
-		if (p)
+		if (p) {
 			error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
+
+			/*
+			 * Ignore expedite_reclaim return value, it is best
+			 * effort only.
+			 */
+			if (!error && expedite)
+				expedite_reclaim(p);
+		}
+
 		rcu_read_unlock();
 		if (likely(!p || error != -ESRCH))
 			return error;
@@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid)
 {
 	int error;
 	rcu_read_lock();
-	error = kill_pid_info(sig, info, find_vpid(pid));
+	error = kill_pid_info(sig, info, find_vpid(pid), false);
 	rcu_read_unlock();
 	return error;
 }
@@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid)
 
 	if (pid > 0) {
 		rcu_read_lock();
-		ret = kill_pid_info(sig, info, find_vpid(pid));
+		ret = kill_pid_info(sig, info, find_vpid(pid), false);
 		rcu_read_unlock();
 		return ret;
 	}
@@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp);
 
 int kill_pid(struct pid *pid, int sig, int priv)
 {
-	return kill_pid_info(sig, __si_special(priv), pid);
+	return kill_pid_info(sig, __si_special(priv), pid, false);
 }
 EXPORT_SYMBOL(kill_pid);
 
@@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 	struct pid *pid;
 	kernel_siginfo_t kinfo;
 
-	/* Enforce flags be set to 0 until we add an extension. */
-	if (flags)
+	/* Enforce no unknown flags. */
+	if (flags & ~SS_EXPEDITE)
 		return -EINVAL;
 
+	if (flags & SS_EXPEDITE) {
+		/* Enforce SS_EXPEDITE to be used with SIGKILL only. */
+		if (sig != SIGKILL)
+			return -EINVAL;
+
+		/* Limit expedited killing to privileged users only. */
+		if (!capable(CAP_SYS_NICE))
+			return -EPERM;
+	}
+
 	f = fdget_raw(pidfd);
 	if (!f.file)
 		return -EBADF;
@@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
 		prepare_kill_siginfo(sig, &kinfo);
 	}
 
-	ret = kill_pid_info(sig, &kinfo, pid);
+	ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0);
 
 err:
 	fdput(f);
diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c
index 02068b2d5862..c926483cdb53 100644
--- a/kernel/time/itimer.c
+++ b/kernel/time/itimer.c
@@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
 	struct pid *leader_pid = sig->pids[PIDTYPE_TGID];
 
 	trace_itimer_expire(ITIMER_REAL, leader_pid, 0);
-	kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid);
+	kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false);
 
 	return HRTIMER_NORESTART;
 }
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11  1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan
@ 2019-04-11 10:30   ` Christian Brauner
  2019-04-11 10:34     ` Christian Brauner
  2019-04-11 15:18     ` Suren Baghdasaryan
  2019-04-11 15:33   ` Matthew Wilcox
  1 sibling, 2 replies; 43+ messages in thread
From: Christian Brauner @ 2019-04-11 10:30 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro,
	hannes, penguin-kernel, ebiederm, shakeelb, minchan, timmurray,
	dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team

On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> pidfd_send_signal() syscall to allow expedited memory reclaim of the
> victim process. The usage of this flag is currently limited to SIGKILL
> signal and only to privileged users.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  include/linux/sched/signal.h |  3 ++-
>  include/linux/signal.h       | 11 ++++++++++-
>  ipc/mqueue.c                 |  2 +-
>  kernel/signal.c              | 37 ++++++++++++++++++++++++++++--------
>  kernel/time/itimer.c         |  2 +-
>  5 files changed, 43 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index e412c092c1e8..8a227633a058 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *);
>  extern void force_sigsegv(int sig, struct task_struct *p);
>  extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *);
>  extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp);
> -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid);
> +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
> +				bool expedite);
>  extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *,
>  				const struct cred *);
>  extern int kill_pgrp(struct pid *pid, int sig, int priv);
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 9702016734b1..34b7852aa4a0 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long);
>  } while (0);
>  
>  #ifdef CONFIG_PROC_FS
> +
> +/*
> + * SS_FLAGS values used in pidfd_send_signal:
> + *
> + * SS_EXPEDITE indicates desire to expedite the operation.
> + */
> +#define SS_EXPEDITE	0x00000001

Does this make sense as an SS_* flag?
How does this relate to the signal stack?
Is there any intention to ever use this flag with stack_t?

New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be
PIDFD_SIGNAL_THREAD.)
And since this is exposed to userspace in contrast to the mm internal
naming it should be something more easily understandable like
PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something.

> +
>  struct seq_file;
>  extern void render_sigset_t(struct seq_file *, const char *, sigset_t *);
> -#endif
> +
> +#endif /* CONFIG_PROC_FS */
>  
>  #endif /* _LINUX_SIGNAL_H */
> diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> index aea30530c472..27c66296e08e 100644
> --- a/ipc/mqueue.c
> +++ b/ipc/mqueue.c
> @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info)
>  			rcu_read_unlock();
>  
>  			kill_pid_info(info->notify.sigev_signo,
> -				      &sig_i, info->notify_owner);
> +				      &sig_i, info->notify_owner, false);
>  			break;
>  		case SIGEV_THREAD:
>  			set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index f98448cf2def..02ed4332d17c 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -43,6 +43,7 @@
>  #include <linux/compiler.h>
>  #include <linux/posix-timers.h>
>  #include <linux/livepatch.h>
> +#include <linux/oom.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/signal.h>
> @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp)
>  	return success ? 0 : retval;
>  }
>  
> -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
> +int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
> +				  bool expedite)
>  {
>  	int error = -ESRCH;
>  	struct task_struct *p;
> @@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
>  	for (;;) {
>  		rcu_read_lock();
>  		p = pid_task(pid, PIDTYPE_PID);
> -		if (p)
> +		if (p) {
>  			error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
> +
> +			/*
> +			 * Ignore expedite_reclaim return value, it is best
> +			 * effort only.
> +			 */
> +			if (!error && expedite)
> +				expedite_reclaim(p);

SIGKILL will take the whole thread group down so the reclaim should make
sense here.

> +		}
> +
>  		rcu_read_unlock();
>  		if (likely(!p || error != -ESRCH))
>  			return error;
> @@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid)
>  {
>  	int error;
>  	rcu_read_lock();
> -	error = kill_pid_info(sig, info, find_vpid(pid));
> +	error = kill_pid_info(sig, info, find_vpid(pid), false);
>  	rcu_read_unlock();
>  	return error;
>  }
> @@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid)
>  
>  	if (pid > 0) {
>  		rcu_read_lock();
> -		ret = kill_pid_info(sig, info, find_vpid(pid));
> +		ret = kill_pid_info(sig, info, find_vpid(pid), false);
>  		rcu_read_unlock();
>  		return ret;
>  	}
> @@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp);
>  
>  int kill_pid(struct pid *pid, int sig, int priv)
>  {
> -	return kill_pid_info(sig, __si_special(priv), pid);
> +	return kill_pid_info(sig, __si_special(priv), pid, false);
>  }
>  EXPORT_SYMBOL(kill_pid);
>  
> @@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
>  	struct pid *pid;
>  	kernel_siginfo_t kinfo;
>  
> -	/* Enforce flags be set to 0 until we add an extension. */
> -	if (flags)
> +	/* Enforce no unknown flags. */
> +	if (flags & ~SS_EXPEDITE)
>  		return -EINVAL;
>  
> +	if (flags & SS_EXPEDITE) {
> +		/* Enforce SS_EXPEDITE to be used with SIGKILL only. */
> +		if (sig != SIGKILL)
> +			return -EINVAL;

Not super fond of this being a SIGKILL specific flag but I get why.

> +
> +		/* Limit expedited killing to privileged users only. */
> +		if (!capable(CAP_SYS_NICE))
> +			return -EPERM;

Do you have a specific (DOS or other) attack vector in mind that renders
ns_capable unsuitable?

> +	}
> +
>  	f = fdget_raw(pidfd);
>  	if (!f.file)
>  		return -EBADF;
> @@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
>  		prepare_kill_siginfo(sig, &kinfo);
>  	}
>  
> -	ret = kill_pid_info(sig, &kinfo, pid);
> +	ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0);
>  
>  err:
>  	fdput(f);
> diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c
> index 02068b2d5862..c926483cdb53 100644
> --- a/kernel/time/itimer.c
> +++ b/kernel/time/itimer.c
> @@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
>  	struct pid *leader_pid = sig->pids[PIDTYPE_TGID];
>  
>  	trace_itimer_expire(ITIMER_REAL, leader_pid, 0);
> -	kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid);
> +	kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false);
>  
>  	return HRTIMER_NORESTART;
>  }
> -- 
> 2.21.0.392.gf8f6787159e-goog
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 10:30   ` Christian Brauner
@ 2019-04-11 10:34     ` Christian Brauner
  2019-04-11 15:18     ` Suren Baghdasaryan
  1 sibling, 0 replies; 43+ messages in thread
From: Christian Brauner @ 2019-04-11 10:34 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro,
	hannes, penguin-kernel, ebiederm, shakeelb, minchan, timmurray,
	dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team,
	oleg

Cc: Oleg too

On Thu, Apr 11, 2019 at 12:30:18PM +0200, Christian Brauner wrote:
> On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > victim process. The usage of this flag is currently limited to SIGKILL
> > signal and only to privileged users.
> > 
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/sched/signal.h |  3 ++-
> >  include/linux/signal.h       | 11 ++++++++++-
> >  ipc/mqueue.c                 |  2 +-
> >  kernel/signal.c              | 37 ++++++++++++++++++++++++++++--------
> >  kernel/time/itimer.c         |  2 +-
> >  5 files changed, 43 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> > index e412c092c1e8..8a227633a058 100644
> > --- a/include/linux/sched/signal.h
> > +++ b/include/linux/sched/signal.h
> > @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *);
> >  extern void force_sigsegv(int sig, struct task_struct *p);
> >  extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *);
> >  extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp);
> > -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid);
> > +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
> > +				bool expedite);
> >  extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *,
> >  				const struct cred *);
> >  extern int kill_pgrp(struct pid *pid, int sig, int priv);
> > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > index 9702016734b1..34b7852aa4a0 100644
> > --- a/include/linux/signal.h
> > +++ b/include/linux/signal.h
> > @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long);
> >  } while (0);
> >  
> >  #ifdef CONFIG_PROC_FS
> > +
> > +/*
> > + * SS_FLAGS values used in pidfd_send_signal:
> > + *
> > + * SS_EXPEDITE indicates desire to expedite the operation.
> > + */
> > +#define SS_EXPEDITE	0x00000001
> 
> Does this make sense as an SS_* flag?
> How does this relate to the signal stack?
> Is there any intention to ever use this flag with stack_t?
> 
> New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be
> PIDFD_SIGNAL_THREAD.)
> And since this is exposed to userspace in contrast to the mm internal
> naming it should be something more easily understandable like
> PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something.
> 
> > +
> >  struct seq_file;
> >  extern void render_sigset_t(struct seq_file *, const char *, sigset_t *);
> > -#endif
> > +
> > +#endif /* CONFIG_PROC_FS */
> >  
> >  #endif /* _LINUX_SIGNAL_H */
> > diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> > index aea30530c472..27c66296e08e 100644
> > --- a/ipc/mqueue.c
> > +++ b/ipc/mqueue.c
> > @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info)
> >  			rcu_read_unlock();
> >  
> >  			kill_pid_info(info->notify.sigev_signo,
> > -				      &sig_i, info->notify_owner);
> > +				      &sig_i, info->notify_owner, false);
> >  			break;
> >  		case SIGEV_THREAD:
> >  			set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index f98448cf2def..02ed4332d17c 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -43,6 +43,7 @@
> >  #include <linux/compiler.h>
> >  #include <linux/posix-timers.h>
> >  #include <linux/livepatch.h>
> > +#include <linux/oom.h>
> >  
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/signal.h>
> > @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp)
> >  	return success ? 0 : retval;
> >  }
> >  
> > -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
> > +int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
> > +				  bool expedite)
> >  {
> >  	int error = -ESRCH;
> >  	struct task_struct *p;
> > @@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
> >  	for (;;) {
> >  		rcu_read_lock();
> >  		p = pid_task(pid, PIDTYPE_PID);
> > -		if (p)
> > +		if (p) {
> >  			error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
> > +
> > +			/*
> > +			 * Ignore expedite_reclaim return value, it is best
> > +			 * effort only.
> > +			 */
> > +			if (!error && expedite)
> > +				expedite_reclaim(p);
> 
> SIGKILL will take the whole thread group down so the reclaim should make
> sense here.
> 
> > +		}
> > +
> >  		rcu_read_unlock();
> >  		if (likely(!p || error != -ESRCH))
> >  			return error;
> > @@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid)
> >  {
> >  	int error;
> >  	rcu_read_lock();
> > -	error = kill_pid_info(sig, info, find_vpid(pid));
> > +	error = kill_pid_info(sig, info, find_vpid(pid), false);
> >  	rcu_read_unlock();
> >  	return error;
> >  }
> > @@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid)
> >  
> >  	if (pid > 0) {
> >  		rcu_read_lock();
> > -		ret = kill_pid_info(sig, info, find_vpid(pid));
> > +		ret = kill_pid_info(sig, info, find_vpid(pid), false);
> >  		rcu_read_unlock();
> >  		return ret;
> >  	}
> > @@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp);
> >  
> >  int kill_pid(struct pid *pid, int sig, int priv)
> >  {
> > -	return kill_pid_info(sig, __si_special(priv), pid);
> > +	return kill_pid_info(sig, __si_special(priv), pid, false);
> >  }
> >  EXPORT_SYMBOL(kill_pid);
> >  
> > @@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >  	struct pid *pid;
> >  	kernel_siginfo_t kinfo;
> >  
> > -	/* Enforce flags be set to 0 until we add an extension. */
> > -	if (flags)
> > +	/* Enforce no unknown flags. */
> > +	if (flags & ~SS_EXPEDITE)
> >  		return -EINVAL;
> >  
> > +	if (flags & SS_EXPEDITE) {
> > +		/* Enforce SS_EXPEDITE to be used with SIGKILL only. */
> > +		if (sig != SIGKILL)
> > +			return -EINVAL;
> 
> Not super fond of this being a SIGKILL specific flag but I get why.
> 
> > +
> > +		/* Limit expedited killing to privileged users only. */
> > +		if (!capable(CAP_SYS_NICE))
> > +			return -EPERM;
> 
> Do you have a specific (DOS or other) attack vector in mind that renders
> ns_capable unsuitable?
> 
> > +	}
> > +
> >  	f = fdget_raw(pidfd);
> >  	if (!f.file)
> >  		return -EBADF;
> > @@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >  		prepare_kill_siginfo(sig, &kinfo);
> >  	}
> >  
> > -	ret = kill_pid_info(sig, &kinfo, pid);
> > +	ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0);
> >  
> >  err:
> >  	fdput(f);
> > diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c
> > index 02068b2d5862..c926483cdb53 100644
> > --- a/kernel/time/itimer.c
> > +++ b/kernel/time/itimer.c
> > @@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
> >  	struct pid *leader_pid = sig->pids[PIDTYPE_TGID];
> >  
> >  	trace_itimer_expire(ITIMER_REAL, leader_pid, 0);
> > -	kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid);
> > +	kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false);
> >  
> >  	return HRTIMER_NORESTART;
> >  }
> > -- 
> > 2.21.0.392.gf8f6787159e-goog
> > 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11  1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan
  2019-04-11  1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan
  2019-04-11  1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan
@ 2019-04-11 10:51 ` Michal Hocko
  2019-04-11 16:18   ` Joel Fernandes
                     ` (3 more replies)
  2019-04-11 11:51 ` [Lsf-pc] " Rik van Riel
  3 siblings, 4 replies; 43+ messages in thread
From: Michal Hocko @ 2019-04-11 10:51 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes,
	penguin-kernel, ebiederm, shakeelb, christian, minchan,
	timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel,
	kernel-team

On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
[...]
> Proposed solution uses existing oom-reaper thread to increase memory
> reclaim rate of a killed process and to make this rate more deterministic.
> By no means the proposed solution is considered the best and was chosen
> because it was simple to implement and allowed for test data collection.
> The downside of this solution is that it requires additional “expedite”
> hint for something which has to be fast in all cases. Would be great to
> find a way that does not require additional hints.

I have to say I do not like this much. It is abusing an implementation
detail of the OOM implementation and makes it an official API. Also
there are some non trivial assumptions to be fullfilled to use the
current oom_reaper. First of all all the process groups that share the
address space have to be killed. How do you want to guarantee/implement
that with a simply kill to a thread/process group?

> Other possible approaches include:
> - Implementing a dedicated syscall to perform opportunistic reclaim in the
> context of the process waiting for the victim’s death. A natural boost
> bonus occurs if the waiting process has high or RT priority and is not
> limited by cpuset cgroup in its CPU choices.
> - Implement a mechanism that would perform opportunistic reclaim if it’s
> possible unconditionally (similar to checks in task_will_free_mem()).
> - Implement opportunistic reclaim that uses shrinker interface, PSI or
> other memory pressure indications as a hint to engage.

I would question whether we really need this at all? Relying on the exit
speed sounds like a fundamental design problem of anything that relies
on it. Sure task exit might be slow, but async mm tear down is just a
mere optimization this is not guaranteed to really help in speading
things up. OOM killer uses it as a guarantee for a forward progress in a
finite time rather than as soon as possible.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf-pc] [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11  1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko
@ 2019-04-11 11:51 ` Rik van Riel
  2019-04-11 12:16   ` Michal Hocko
  3 siblings, 1 reply; 43+ messages in thread
From: Rik van Riel @ 2019-04-11 11:51 UTC (permalink / raw)
  To: Suren Baghdasaryan, akpm
  Cc: dancol, mhocko, jannh, minchan, penguin-kernel, kernel-team,
	rientjes, linux-kernel, willy, linux-mm, hannes, shakeelb,
	jrdr.linux, yuzhoujian, joel, timmurray, lsf-pc, guro, christian,
	ebiederm

[-- Attachment #1: Type: text/plain, Size: 561 bytes --]

On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote:
> The time to kill a process and free its memory can be critical when
> the
> killing was done to prevent memory shortages affecting system
> responsiveness.

The OOM killer is fickle, and often takes a fairly
long time to trigger. Speeding up what happens after
that seems like the wrong thing to optimize.

Have you considered using something like oomd to
proactively kill tasks when memory gets low, so
you do not have to wait for an OOM kill?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf-pc] [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 11:51 ` [Lsf-pc] " Rik van Riel
@ 2019-04-11 12:16   ` Michal Hocko
  2019-04-11 16:54     ` Suren Baghdasaryan
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2019-04-11 12:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Suren Baghdasaryan, akpm, dancol, jannh, minchan, penguin-kernel,
	kernel-team, rientjes, linux-kernel, willy, linux-mm, hannes,
	shakeelb, jrdr.linux, yuzhoujian, joel, timmurray, lsf-pc, guro,
	christian, ebiederm

On Thu 11-04-19 07:51:21, Rik van Riel wrote:
> On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote:
> > The time to kill a process and free its memory can be critical when
> > the
> > killing was done to prevent memory shortages affecting system
> > responsiveness.
> 
> The OOM killer is fickle, and often takes a fairly
> long time to trigger. Speeding up what happens after
> that seems like the wrong thing to optimize.
> 
> Have you considered using something like oomd to
> proactively kill tasks when memory gets low, so
> you do not have to wait for an OOM kill?

AFAIU, this is the point here. They probably have a user space OOM
killer implementation and want to achieve killing to be as swift as
possible.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 10:30   ` Christian Brauner
  2019-04-11 10:34     ` Christian Brauner
@ 2019-04-11 15:18     ` Suren Baghdasaryan
  2019-04-11 15:23       ` Suren Baghdasaryan
  1 sibling, 1 reply; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11 15:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrew Morton, mhocko, David Rientjes, Matthew Wilcox,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Tetsuo Handa, ebiederm, Shakeel Butt, Minchan Kim, Tim Murray,
	Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc,
	LKML, kernel-team, Oleg Nesterov

Thanks for the feedback!
Just to be clear, this implementation is used in this RFC as a
reference to explain the intent. To be honest I don't think it will be
adopted as is even if the idea survives scrutiny.

On Thu, Apr 11, 2019 at 3:30 AM Christian Brauner <christian@brauner.io> wrote:
>
> On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > victim process. The usage of this flag is currently limited to SIGKILL
> > signal and only to privileged users.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  include/linux/sched/signal.h |  3 ++-
> >  include/linux/signal.h       | 11 ++++++++++-
> >  ipc/mqueue.c                 |  2 +-
> >  kernel/signal.c              | 37 ++++++++++++++++++++++++++++--------
> >  kernel/time/itimer.c         |  2 +-
> >  5 files changed, 43 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> > index e412c092c1e8..8a227633a058 100644
> > --- a/include/linux/sched/signal.h
> > +++ b/include/linux/sched/signal.h
> > @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *);
> >  extern void force_sigsegv(int sig, struct task_struct *p);
> >  extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *);
> >  extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp);
> > -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid);
> > +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
> > +                             bool expedite);
> >  extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *,
> >                               const struct cred *);
> >  extern int kill_pgrp(struct pid *pid, int sig, int priv);
> > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > index 9702016734b1..34b7852aa4a0 100644
> > --- a/include/linux/signal.h
> > +++ b/include/linux/signal.h
> > @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long);
> >  } while (0);
> >
> >  #ifdef CONFIG_PROC_FS
> > +
> > +/*
> > + * SS_FLAGS values used in pidfd_send_signal:
> > + *
> > + * SS_EXPEDITE indicates desire to expedite the operation.
> > + */
> > +#define SS_EXPEDITE  0x00000001
>
> Does this make sense as an SS_* flag?
> How does this relate to the signal stack?

It doesn't, so I agree that the name should be changed.
PIDFD_SIGNAL_EXPEDITE_MM_RECLAIM would seem appropriate.

> Is there any intention to ever use this flag with stack_t?
>
> New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be
> PIDFD_SIGNAL_THREAD.)
> And since this is exposed to userspace in contrast to the mm internal
> naming it should be something more easily understandable like
> PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something.
>
> > +
> >  struct seq_file;
> >  extern void render_sigset_t(struct seq_file *, const char *, sigset_t *);
> > -#endif
> > +
> > +#endif /* CONFIG_PROC_FS */
> >
> >  #endif /* _LINUX_SIGNAL_H */
> > diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> > index aea30530c472..27c66296e08e 100644
> > --- a/ipc/mqueue.c
> > +++ b/ipc/mqueue.c
> > @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info)
> >                       rcu_read_unlock();
> >
> >                       kill_pid_info(info->notify.sigev_signo,
> > -                                   &sig_i, info->notify_owner);
> > +                                   &sig_i, info->notify_owner, false);
> >                       break;
> >               case SIGEV_THREAD:
> >                       set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index f98448cf2def..02ed4332d17c 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -43,6 +43,7 @@
> >  #include <linux/compiler.h>
> >  #include <linux/posix-timers.h>
> >  #include <linux/livepatch.h>
> > +#include <linux/oom.h>
> >
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/signal.h>
> > @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp)
> >       return success ? 0 : retval;
> >  }
> >
> > -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
> > +int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
> > +                               bool expedite)
> >  {
> >       int error = -ESRCH;
> >       struct task_struct *p;
> > @@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
> >       for (;;) {
> >               rcu_read_lock();
> >               p = pid_task(pid, PIDTYPE_PID);
> > -             if (p)
> > +             if (p) {
> >                       error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
> > +
> > +                     /*
> > +                      * Ignore expedite_reclaim return value, it is best
> > +                      * effort only.
> > +                      */
> > +                     if (!error && expedite)
> > +                             expedite_reclaim(p);
>
> SIGKILL will take the whole thread group down so the reclaim should make
> sense here.
>

This sounds like confirmation. I hope I'm not missing some flaw that
you are trying to point out.

> > +             }
> > +
> >               rcu_read_unlock();
> >               if (likely(!p || error != -ESRCH))
> >                       return error;
> > @@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid)
> >  {
> >       int error;
> >       rcu_read_lock();
> > -     error = kill_pid_info(sig, info, find_vpid(pid));
> > +     error = kill_pid_info(sig, info, find_vpid(pid), false);
> >       rcu_read_unlock();
> >       return error;
> >  }
> > @@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid)
> >
> >       if (pid > 0) {
> >               rcu_read_lock();
> > -             ret = kill_pid_info(sig, info, find_vpid(pid));
> > +             ret = kill_pid_info(sig, info, find_vpid(pid), false);
> >               rcu_read_unlock();
> >               return ret;
> >       }
> > @@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp);
> >
> >  int kill_pid(struct pid *pid, int sig, int priv)
> >  {
> > -     return kill_pid_info(sig, __si_special(priv), pid);
> > +     return kill_pid_info(sig, __si_special(priv), pid, false);
> >  }
> >  EXPORT_SYMBOL(kill_pid);
> >
> > @@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >       struct pid *pid;
> >       kernel_siginfo_t kinfo;
> >
> > -     /* Enforce flags be set to 0 until we add an extension. */
> > -     if (flags)
> > +     /* Enforce no unknown flags. */
> > +     if (flags & ~SS_EXPEDITE)
> >               return -EINVAL;
> >
> > +     if (flags & SS_EXPEDITE) {
> > +             /* Enforce SS_EXPEDITE to be used with SIGKILL only. */
> > +             if (sig != SIGKILL)
> > +                     return -EINVAL;
>
> Not super fond of this being a SIGKILL specific flag but I get why.

Understood. I was thinking that EXPEDITE flag might make sense for
other signals in the future but from internal feedback sounds like if
we go this way the flag name should be more specific.

> > +
> > +             /* Limit expedited killing to privileged users only. */
> > +             if (!capable(CAP_SYS_NICE))
> > +                     return -EPERM;
>
> Do you have a specific (DOS or other) attack vector in mind that renders
> ns_capable unsuitable?
>
> > +     }
> > +
> >       f = fdget_raw(pidfd);
> >       if (!f.file)
> >               return -EBADF;
> > @@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> >               prepare_kill_siginfo(sig, &kinfo);
> >       }
> >
> > -     ret = kill_pid_info(sig, &kinfo, pid);
> > +     ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0);
> >
> >  err:
> >       fdput(f);
> > diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c
> > index 02068b2d5862..c926483cdb53 100644
> > --- a/kernel/time/itimer.c
> > +++ b/kernel/time/itimer.c
> > @@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
> >       struct pid *leader_pid = sig->pids[PIDTYPE_TGID];
> >
> >       trace_itimer_expire(ITIMER_REAL, leader_pid, 0);
> > -     kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid);
> > +     kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false);
> >
> >       return HRTIMER_NORESTART;
> >  }
> > --
> > 2.21.0.392.gf8f6787159e-goog
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 15:18     ` Suren Baghdasaryan
@ 2019-04-11 15:23       ` Suren Baghdasaryan
  2019-04-11 16:25         ` Daniel Colascione
  0 siblings, 1 reply; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11 15:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrew Morton, mhocko, David Rientjes, Matthew Wilcox,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Tetsuo Handa, ebiederm, Shakeel Butt, Minchan Kim, Tim Murray,
	Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc,
	LKML, kernel-team, Oleg Nesterov

On Thu, Apr 11, 2019 at 8:18 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Thanks for the feedback!
> Just to be clear, this implementation is used in this RFC as a
> reference to explain the intent. To be honest I don't think it will be
> adopted as is even if the idea survives scrutiny.
>
> On Thu, Apr 11, 2019 at 3:30 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > victim process. The usage of this flag is currently limited to SIGKILL
> > > signal and only to privileged users.
> > >
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  include/linux/sched/signal.h |  3 ++-
> > >  include/linux/signal.h       | 11 ++++++++++-
> > >  ipc/mqueue.c                 |  2 +-
> > >  kernel/signal.c              | 37 ++++++++++++++++++++++++++++--------
> > >  kernel/time/itimer.c         |  2 +-
> > >  5 files changed, 43 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> > > index e412c092c1e8..8a227633a058 100644
> > > --- a/include/linux/sched/signal.h
> > > +++ b/include/linux/sched/signal.h
> > > @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *);
> > >  extern void force_sigsegv(int sig, struct task_struct *p);
> > >  extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *);
> > >  extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp);
> > > -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid);
> > > +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
> > > +                             bool expedite);
> > >  extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *,
> > >                               const struct cred *);
> > >  extern int kill_pgrp(struct pid *pid, int sig, int priv);
> > > diff --git a/include/linux/signal.h b/include/linux/signal.h
> > > index 9702016734b1..34b7852aa4a0 100644
> > > --- a/include/linux/signal.h
> > > +++ b/include/linux/signal.h
> > > @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long);
> > >  } while (0);
> > >
> > >  #ifdef CONFIG_PROC_FS
> > > +
> > > +/*
> > > + * SS_FLAGS values used in pidfd_send_signal:
> > > + *
> > > + * SS_EXPEDITE indicates desire to expedite the operation.
> > > + */
> > > +#define SS_EXPEDITE  0x00000001
> >
> > Does this make sense as an SS_* flag?
> > How does this relate to the signal stack?
>
> It doesn't, so I agree that the name should be changed.
> PIDFD_SIGNAL_EXPEDITE_MM_RECLAIM would seem appropriate.
>
> > Is there any intention to ever use this flag with stack_t?
> >
> > New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be
> > PIDFD_SIGNAL_THREAD.)
> > And since this is exposed to userspace in contrast to the mm internal
> > naming it should be something more easily understandable like
> > PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something.
> >
> > > +
> > >  struct seq_file;
> > >  extern void render_sigset_t(struct seq_file *, const char *, sigset_t *);
> > > -#endif
> > > +
> > > +#endif /* CONFIG_PROC_FS */
> > >
> > >  #endif /* _LINUX_SIGNAL_H */
> > > diff --git a/ipc/mqueue.c b/ipc/mqueue.c
> > > index aea30530c472..27c66296e08e 100644
> > > --- a/ipc/mqueue.c
> > > +++ b/ipc/mqueue.c
> > > @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info)
> > >                       rcu_read_unlock();
> > >
> > >                       kill_pid_info(info->notify.sigev_signo,
> > > -                                   &sig_i, info->notify_owner);
> > > +                                   &sig_i, info->notify_owner, false);
> > >                       break;
> > >               case SIGEV_THREAD:
> > >                       set_cookie(info->notify_cookie, NOTIFY_WOKENUP);
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index f98448cf2def..02ed4332d17c 100644
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -43,6 +43,7 @@
> > >  #include <linux/compiler.h>
> > >  #include <linux/posix-timers.h>
> > >  #include <linux/livepatch.h>
> > > +#include <linux/oom.h>
> > >
> > >  #define CREATE_TRACE_POINTS
> > >  #include <trace/events/signal.h>
> > > @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp)
> > >       return success ? 0 : retval;
> > >  }
> > >
> > > -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
> > > +int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid,
> > > +                               bool expedite)
> > >  {
> > >       int error = -ESRCH;
> > >       struct task_struct *p;
> > > @@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
> > >       for (;;) {
> > >               rcu_read_lock();
> > >               p = pid_task(pid, PIDTYPE_PID);
> > > -             if (p)
> > > +             if (p) {
> > >                       error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
> > > +
> > > +                     /*
> > > +                      * Ignore expedite_reclaim return value, it is best
> > > +                      * effort only.
> > > +                      */
> > > +                     if (!error && expedite)
> > > +                             expedite_reclaim(p);
> >
> > SIGKILL will take the whole thread group down so the reclaim should make
> > sense here.
> >
>
> This sounds like confirmation. I hope I'm not missing some flaw that
> you are trying to point out.
>
> > > +             }
> > > +
> > >               rcu_read_unlock();
> > >               if (likely(!p || error != -ESRCH))
> > >                       return error;
> > > @@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid)
> > >  {
> > >       int error;
> > >       rcu_read_lock();
> > > -     error = kill_pid_info(sig, info, find_vpid(pid));
> > > +     error = kill_pid_info(sig, info, find_vpid(pid), false);
> > >       rcu_read_unlock();
> > >       return error;
> > >  }
> > > @@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid)
> > >
> > >       if (pid > 0) {
> > >               rcu_read_lock();
> > > -             ret = kill_pid_info(sig, info, find_vpid(pid));
> > > +             ret = kill_pid_info(sig, info, find_vpid(pid), false);
> > >               rcu_read_unlock();
> > >               return ret;
> > >       }
> > > @@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp);
> > >
> > >  int kill_pid(struct pid *pid, int sig, int priv)
> > >  {
> > > -     return kill_pid_info(sig, __si_special(priv), pid);
> > > +     return kill_pid_info(sig, __si_special(priv), pid, false);
> > >  }
> > >  EXPORT_SYMBOL(kill_pid);
> > >
> > > @@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> > >       struct pid *pid;
> > >       kernel_siginfo_t kinfo;
> > >
> > > -     /* Enforce flags be set to 0 until we add an extension. */
> > > -     if (flags)
> > > +     /* Enforce no unknown flags. */
> > > +     if (flags & ~SS_EXPEDITE)
> > >               return -EINVAL;
> > >
> > > +     if (flags & SS_EXPEDITE) {
> > > +             /* Enforce SS_EXPEDITE to be used with SIGKILL only. */
> > > +             if (sig != SIGKILL)
> > > +                     return -EINVAL;
> >
> > Not super fond of this being a SIGKILL specific flag but I get why.
>
> Understood. I was thinking that EXPEDITE flag might make sense for
> other signals in the future but from internal feedback sounds like if
> we go this way the flag name should be more specific.
>
> > > +
> > > +             /* Limit expedited killing to privileged users only. */
> > > +             if (!capable(CAP_SYS_NICE))
> > > +                     return -EPERM;
> >
> > Do you have a specific (DOS or other) attack vector in mind that renders
> > ns_capable unsuitable?
> >

Missed this one. I was thinking of oom-reaper thread as a limited
system resource (one thread which maintains a kill list and reaps
process mms one at a time) and therefore should be protected from
abuse.

> > > +     }
> > > +
> > >       f = fdget_raw(pidfd);
> > >       if (!f.file)
> > >               return -EBADF;
> > > @@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig,
> > >               prepare_kill_siginfo(sig, &kinfo);
> > >       }
> > >
> > > -     ret = kill_pid_info(sig, &kinfo, pid);
> > > +     ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0);
> > >
> > >  err:
> > >       fdput(f);
> > > diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c
> > > index 02068b2d5862..c926483cdb53 100644
> > > --- a/kernel/time/itimer.c
> > > +++ b/kernel/time/itimer.c
> > > @@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer)
> > >       struct pid *leader_pid = sig->pids[PIDTYPE_TGID];
> > >
> > >       trace_itimer_expire(ITIMER_REAL, leader_pid, 0);
> > > -     kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid);
> > > +     kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false);
> > >
> > >       return HRTIMER_NORESTART;
> > >  }
> > > --
> > > 2.21.0.392.gf8f6787159e-goog
> > >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11  1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan
  2019-04-11 10:30   ` Christian Brauner
@ 2019-04-11 15:33   ` Matthew Wilcox
  2019-04-11 17:05     ` Johannes Weiner
                       ` (2 more replies)
  1 sibling, 3 replies; 43+ messages in thread
From: Matthew Wilcox @ 2019-04-11 15:33 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, rientjes, yuzhoujian, jrdr.linux, guro, hannes,
	penguin-kernel, ebiederm, shakeelb, christian, minchan,
	timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel,
	kernel-team

On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> pidfd_send_signal() syscall to allow expedited memory reclaim of the
> victim process. The usage of this flag is currently limited to SIGKILL
> signal and only to privileged users.

What is the downside of doing expedited memory reclaim?  ie why not do it
every time a process is going to die?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko
@ 2019-04-11 16:18   ` Joel Fernandes
  2019-04-11 18:12     ` Michal Hocko
  2019-04-11 16:20   ` Sandeep Patil
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Joel Fernandes @ 2019-04-11 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes,
	Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner,
	penguin-kernel, ebiederm, shakeelb, Christian Brauner,
	Minchan Kim, Tim Murray, Daniel Colascione,
	Joel Fernandes (Google),
	Jann Horn, open list:MEMORY MANAGEMENT, lsf-pc, LKML,
	Cc: Android Kernel

On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> [...]
> > Proposed solution uses existing oom-reaper thread to increase memory
> > reclaim rate of a killed process and to make this rate more deterministic.
> > By no means the proposed solution is considered the best and was chosen
> > because it was simple to implement and allowed for test data collection.
> > The downside of this solution is that it requires additional “expedite”
> > hint for something which has to be fast in all cases. Would be great to
> > find a way that does not require additional hints.
>
> I have to say I do not like this much. It is abusing an implementation
> detail of the OOM implementation and makes it an official API. Also
> there are some non trivial assumptions to be fullfilled to use the
> current oom_reaper. First of all all the process groups that share the
> address space have to be killed. How do you want to guarantee/implement
> that with a simply kill to a thread/process group?

Will task_will_free_mem() not bail out in such cases because of
process_shares_mm() returning true? AFAIU, Suren's patch calls that.
Also, if I understand correctly, this patch is opportunistic and knows
what it may not be possible to reap in advance this way in all cases.
        /*
         * Make sure that all tasks which share the mm with the given tasks
         * are dying as well to make sure that a) nobody pins its mm and
         * b) the task is also reapable by the oom reaper.
         */
        rcu_read_lock();
        for_each_process(p) {
                if (!process_shares_mm(p, mm))

> > Other possible approaches include:
> > - Implementing a dedicated syscall to perform opportunistic reclaim in the
> > context of the process waiting for the victim’s death. A natural boost
> > bonus occurs if the waiting process has high or RT priority and is not
> > limited by cpuset cgroup in its CPU choices.
> > - Implement a mechanism that would perform opportunistic reclaim if it’s
> > possible unconditionally (similar to checks in task_will_free_mem()).
> > - Implement opportunistic reclaim that uses shrinker interface, PSI or
> > other memory pressure indications as a hint to engage.
>
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it. Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.

Per the data collected by Suren, it does speed things up. It would be
nice if we can reuse this mechanism, or come up with a similar
mechanism.

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko
  2019-04-11 16:18   ` Joel Fernandes
@ 2019-04-11 16:20   ` Sandeep Patil
  2019-04-11 16:47   ` Suren Baghdasaryan
  2019-04-11 17:19   ` Johannes Weiner
  3 siblings, 0 replies; 43+ messages in thread
From: Sandeep Patil @ 2019-04-11 16:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, akpm, rientjes, willy, yuzhoujian,
	jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb,
	christian, minchan, timmurray, dancol, joel, jannh, linux-mm,
	lsf-pc, linux-kernel, kernel-team

On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote:
> On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> [...]
> > Proposed solution uses existing oom-reaper thread to increase memory
> > reclaim rate of a killed process and to make this rate more deterministic.
> > By no means the proposed solution is considered the best and was chosen
> > because it was simple to implement and allowed for test data collection.
> > The downside of this solution is that it requires additional “expedite”
> > hint for something which has to be fast in all cases. Would be great to
> > find a way that does not require additional hints.
> 
> I have to say I do not like this much. It is abusing an implementation
> detail of the OOM implementation and makes it an official API. Also
> there are some non trivial assumptions to be fullfilled to use the
> current oom_reaper. First of all all the process groups that share the
> address space have to be killed. How do you want to guarantee/implement
> that with a simply kill to a thread/process group?
> 
> > Other possible approaches include:
> > - Implementing a dedicated syscall to perform opportunistic reclaim in the
> > context of the process waiting for the victim’s death. A natural boost
> > bonus occurs if the waiting process has high or RT priority and is not
> > limited by cpuset cgroup in its CPU choices.
> > - Implement a mechanism that would perform opportunistic reclaim if it’s
> > possible unconditionally (similar to checks in task_will_free_mem()).
> > - Implement opportunistic reclaim that uses shrinker interface, PSI or
> > other memory pressure indications as a hint to engage.
> 
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it.

OTOH, we want to keep as many processes around as possible for recency. In which
case, the exit path (particularly the memory reclaim) becomes critical to
maintain interactivity for phones.

Android keeps processes around because cold starting applications is much
slower than simply bringing them up from background. This obviously presents
the problem when a background application _is_ killed, it is almost always to
address sudden spike in memory needs by something else much more important
and user visible. e.g. a foreground application or critical system process.

> Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.

With OOM killer, things are already really bad. When lmkd[1] kills processes,
it is doing so to serve the immediate needs of the system while trying to
avoid the OOM killer.


- ssp

1] https://android.googlesource.com/platform/system/core/+/refs/heads/master/lmkd/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 15:23       ` Suren Baghdasaryan
@ 2019-04-11 16:25         ` Daniel Colascione
  0 siblings, 0 replies; 43+ messages in thread
From: Daniel Colascione @ 2019-04-11 16:25 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Christian Brauner, Andrew Morton, Michal Hocko, David Rientjes,
	Matthew Wilcox, yuzhoujian, Souptick Joarder, Roman Gushchin,
	Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt,
	Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes,
	Jann Horn, linux-mm, lsf-pc, LKML, kernel-team, Oleg Nesterov

On Thu, Apr 11, 2019 at 8:23 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > signal and only to privileged users.

FWIW, I like Suren's general idea, but I was thinking of a different
way of exposing the same general functionality to userspace. The way I
look at it, it's very useful for an auto-balancing memory system like
Android (or, I presume, something that uses oomd) to recover memory
*immediately* after a SIGKILL instead of waiting for the process to
kill itself: a process's death can be delayed for a long time due to
factors like scheduling and being blocked in various uninterruptible
kernel-side operations. Suren's proposal is basically about pulling
forward in time page reclaimation that would happen anyway.

What if we let userspace control exactly when this reclaimation
happens? I'm imagining a new* kernel facility that basically looks
like this. It lets lmkd determine for itself how much work the system
should expend on reclaiming memory from dying processes.

size_t try_reap_dying_process(
  int pidfd,
  int flags /* must be zero */,
  size_t maximum_bytes_to_reap);

Precondition: process is pending group-exit (someone already sent it SIGKILL)
Postcondition: some memory reclaimed from dying process
Invariant: doesn't sleep; stops reaping after MAXIMUM_BYTES_TO_REAP

-> success: return number of bytes reaped
-> failure: (size_t)-1

EBUSY: couldn't get mmap_sem
EINVAL: PIDFD isn't a pidfd or otherwise invalid arguments
EPERM: process hasn't been send SIGKILL: try_reap_dying_process on a
process that isn't dying is illegal

Kernel-side, try_reap_dying_process would try-acquire mmap_sem and
just fail if it couldn't get it. Once acquired, it would release
"easy" pages (using the same check the oom reaper uses) until it
either ran out of pages or hit the MAXIMUM_BYTES_TO_REAP cap. The
purpose of MAXIMUM_BYTES_TO_REAP is to let userspace bound-above the
amount of time we spend reclaiming pages. It'd be up to userspace to
set policy on retries, the actual value of the reap cap, the priority
at which we run TRY_REAP_DYING_PROCESS, and so on. We return the
number of bytes we managed to free this way so that lmkd can make an
informed decision about what to do next, e.g., kill something else or
wait a little while.

Personally, I like th approach a bit more that recruiting the oom
reaper through because it doesn't affect any kind of  emergency memory
reserve permission and because it frees us from having to think about
whether the oom reaper's thread priority is right for this particular
job.

It also occurred to me that try_reap_dying_process might make a decent
shrinker callback. Shrinkers are there, AIUI, to reclaim memory that's
easy to free and that's not essential for correct kernel operation.
Usually, it's some kind of cache that meets these criteria. But the
private pages of a dying process also meet the criteria, don't they?
I'm imagining the shrinker just picking an arbitrary doomed (dying but
not yet dead) process and freeing some of its pages. I know there are
concerns about slow shrinkers causing havoc throughout the system, but
since this shrinker would be bounded above on CPU time and would never
block, I feel like it'd be pretty safe.

* insert standard missive about system calls being cheap, but we can
talk about the way in which we expose this functionality after we
agree that it's a good idea generally

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko
  2019-04-11 16:18   ` Joel Fernandes
  2019-04-11 16:20   ` Sandeep Patil
@ 2019-04-11 16:47   ` Suren Baghdasaryan
  2019-04-11 18:19     ` Michal Hocko
  2019-04-11 17:19   ` Johannes Weiner
  3 siblings, 1 reply; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11 16:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian,
	Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa,
	ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim,
	Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn,
	linux-mm, lsf-pc, LKML, kernel-team

Thanks for the feedback!

On Thu, Apr 11, 2019 at 3:51 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> [...]
> > Proposed solution uses existing oom-reaper thread to increase memory
> > reclaim rate of a killed process and to make this rate more deterministic.
> > By no means the proposed solution is considered the best and was chosen
> > because it was simple to implement and allowed for test data collection.
> > The downside of this solution is that it requires additional “expedite”
> > hint for something which has to be fast in all cases. Would be great to
> > find a way that does not require additional hints.
>
> I have to say I do not like this much. It is abusing an implementation
> detail of the OOM implementation and makes it an official API.

I agree with you that this particular implementation is abusing oom
internal machinery and I don't think it is acceptable as is (hence
this is sent as an RFC). I would like to discuss the viability of the
idea of reaping kill victim's mm asynchronously. If we agree that this
is worth our time, only then I would love to get into more details on
how to implement it. The implementation in this RFC is a convenient
way to illustrate the idea and to collect test data.

> Also
> there are some non trivial assumptions to be fullfilled to use the
> current oom_reaper. First of all all the process groups that share the
> address space have to be killed. How do you want to guarantee/implement
> that with a simply kill to a thread/process group?
>

I'm not sure I understood this correctly but if you are asking how do
we know that the mm we are reaping is not shared with processes that
are not being killed then I think your task_will_free_mem() checks for
that. Or have I misunderstood your question?

> > Other possible approaches include:
> > - Implementing a dedicated syscall to perform opportunistic reclaim in the
> > context of the process waiting for the victim’s death. A natural boost
> > bonus occurs if the waiting process has high or RT priority and is not
> > limited by cpuset cgroup in its CPU choices.
> > - Implement a mechanism that would perform opportunistic reclaim if it’s
> > possible unconditionally (similar to checks in task_will_free_mem()).
> > - Implement opportunistic reclaim that uses shrinker interface, PSI or
> > other memory pressure indications as a hint to engage.
>
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it.

Relying on it is wrong, I agree. There are protections like allocation
throttling that we can fall back to stop memory depletion. However
having a way to free up resources that are not needed by a dying
process quickly would help to avoid throttling which hurts user
experience.
I agree that this is an optimization which is beneficial in a specific
case - when we kill to free up resources. However this is an important
optimization for systems with low memory resources like embedded
systems, phones, etc. The only way to prevent being cornered into
throttling is to increase the free memory margin that system needs to
maintain (I describe this in my cover letter). And with limited
overall memory resources memory space is at a premium, so we try to
decrease that margin.
I think the other and arguably even more important issue than the
speed of memory reclaim is that this speed depends on what the victim
is doing at the time of a kill. This introduces non-determinism in how
fast we can free up resource and at this point we don't even know how
much safety margin we need.

> Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [Lsf-pc] [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 12:16   ` Michal Hocko
@ 2019-04-11 16:54     ` Suren Baghdasaryan
  0 siblings, 0 replies; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11 16:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Suren Baghdasaryan, Andrew Morton,
	Daniel Colascione, Jann Horn, Minchan Kim, Tetsuo Handa,
	kernel-team, David Rientjes, LKML, Matthew Wilcox, linux-mm,
	Johannes Weiner, Shakeel Butt, Souptick Joarder, yuzhoujian,
	Joel Fernandes, Tim Murray, lsf-pc, Roman Gushchin,
	Christian Brauner, ebiederm

On Thu, Apr 11, 2019 at 5:16 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 07:51:21, Rik van Riel wrote:
> > On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote:
> > > The time to kill a process and free its memory can be critical when
> > > the
> > > killing was done to prevent memory shortages affecting system
> > > responsiveness.
> >
> > The OOM killer is fickle, and often takes a fairly
> > long time to trigger. Speeding up what happens after
> > that seems like the wrong thing to optimize.
> >
> > Have you considered using something like oomd to
> > proactively kill tasks when memory gets low, so
> > you do not have to wait for an OOM kill?
>
> AFAIU, this is the point here. They probably have a user space OOM
> killer implementation and want to achieve killing to be as swift as
> possible.

That is correct. Android has a userspace daemon called lmkd (low
memory killer daemon) to respond to memory pressure before things get
bad enough for kernel oom-killer to get involved. So this asynchronous
reclaim optimization would allow lmkd do its job more efficiently.

> --
> Michal Hocko
> SUSE Labs
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 15:33   ` Matthew Wilcox
@ 2019-04-11 17:05     ` Johannes Weiner
  2019-04-11 17:09     ` Suren Baghdasaryan
  2019-04-12  6:53     ` Michal Hocko
  2 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2019-04-11 17:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, akpm, mhocko, rientjes, yuzhoujian,
	jrdr.linux, guro, penguin-kernel, ebiederm, shakeelb, christian,
	minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc,
	linux-kernel, kernel-team

On Thu, Apr 11, 2019 at 08:33:13AM -0700, Matthew Wilcox wrote:
> On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > victim process. The usage of this flag is currently limited to SIGKILL
> > signal and only to privileged users.
> 
> What is the downside of doing expedited memory reclaim?  ie why not do it
> every time a process is going to die?

I agree with this. The oom reaper mostly does work the exiting task
would have to do anyway, so this shouldn't be measurably more
expensive to do it per default. I wouldn't want to add a new interface
unless we really do need that type of control.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 15:33   ` Matthew Wilcox
  2019-04-11 17:05     ` Johannes Weiner
@ 2019-04-11 17:09     ` Suren Baghdasaryan
  2019-04-11 17:33       ` Daniel Colascione
  2019-04-11 21:45       ` Roman Gushchin
  2019-04-12  6:53     ` Michal Hocko
  2 siblings, 2 replies; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11 17:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, mhocko, David Rientjes, yuzhoujian,
	Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa,
	ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim,
	Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn,
	linux-mm, lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > victim process. The usage of this flag is currently limited to SIGKILL
> > signal and only to privileged users.
>
> What is the downside of doing expedited memory reclaim?  ie why not do it
> every time a process is going to die?

I think with an implementation that does not use/abuse oom-reaper
thread this could be done for any kill. As I mentioned oom-reaper is a
limited resource which has access to memory reserves and should not be
abused in the way I do in this reference implementation.
While there might be downsides that I don't know of, I'm not sure it's
required to hurry every kill's memory reclaim. I think there are cases
when resource deallocation is critical, for example when we kill to
relieve resource shortage and there are kills when reclaim speed is
not essential. It would be great if we can identify urgent cases
without userspace hints, so I'm open to suggestions that do not
involve additional flags.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko
                     ` (2 preceding siblings ...)
  2019-04-11 16:47   ` Suren Baghdasaryan
@ 2019-04-11 17:19   ` Johannes Weiner
  3 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2019-04-11 17:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, akpm, rientjes, willy, yuzhoujian,
	jrdr.linux, guro, penguin-kernel, ebiederm, shakeelb, christian,
	minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc,
	linux-kernel, kernel-team

On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote:
> I would question whether we really need this at all? Relying on the exit
> speed sounds like a fundamental design problem of anything that relies
> on it. Sure task exit might be slow, but async mm tear down is just a
> mere optimization this is not guaranteed to really help in speading
> things up. OOM killer uses it as a guarantee for a forward progress in a
> finite time rather than as soon as possible.

I don't think it's flawed, it's just optimizing the user experience as
best as it can. You don't want to kill things prematurely, but once
there is pressure you want to rectify it quickly. That's valid.

We have a tool that does this, side effect or not, so I think it's
fair to try to make use of it when oom killing from userspace (which
we explictily support with oom_control in cgroup1 and memory.high in
cgroup2, and it's not just an Android thing).

The question is how explicit a contract we want to make with
userspace, and I would much prefer to not overpromise on a best-effort
thing like this, or even making the oom reaper ABI.

If unconditionally reaping killed tasks is too expensive, I'd much
prefer a simple kill hint over an explicit task reclaim interface.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 17:09     ` Suren Baghdasaryan
@ 2019-04-11 17:33       ` Daniel Colascione
  2019-04-11 17:36         ` Matthew Wilcox
  2019-04-11 21:45       ` Roman Gushchin
  1 sibling, 1 reply; 43+ messages in thread
From: Daniel Colascione @ 2019-04-11 17:33 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, Andrew Morton, Michal Hocko, David Rientjes,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner,
	Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes,
	Jann Horn, linux-mm, lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > victim process. The usage of this flag is currently limited to SIGKILL
> > > signal and only to privileged users.
> >
> > What is the downside of doing expedited memory reclaim?  ie why not do it
> > every time a process is going to die?
>
> I think with an implementation that does not use/abuse oom-reaper
> thread this could be done for any kill. As I mentioned oom-reaper is a
> limited resource which has access to memory reserves and should not be
> abused in the way I do in this reference implementation.
> While there might be downsides that I don't know of, I'm not sure it's
> required to hurry every kill's memory reclaim. I think there are cases
> when resource deallocation is critical, for example when we kill to
> relieve resource shortage and there are kills when reclaim speed is
> not essential. It would be great if we can identify urgent cases
> without userspace hints, so I'm open to suggestions that do not
> involve additional flags.

I was imagining a PI-ish approach where we'd reap in case an RT
process was waiting on the death of some other process. I'd still
prefer the API I proposed in the other message because it gets the
kernel out of the business of deciding what the right signal is. I'm a
huge believer in "mechanism, not policy".

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 17:33       ` Daniel Colascione
@ 2019-04-11 17:36         ` Matthew Wilcox
  2019-04-11 17:47           ` Daniel Colascione
  2019-04-11 17:52           ` Suren Baghdasaryan
  0 siblings, 2 replies; 43+ messages in thread
From: Matthew Wilcox @ 2019-04-11 17:36 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Suren Baghdasaryan, Andrew Morton, Michal Hocko, David Rientjes,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner,
	Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm,
	lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote:
> On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > signal and only to privileged users.
> > >
> > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > every time a process is going to die?
> >
> > I think with an implementation that does not use/abuse oom-reaper
> > thread this could be done for any kill. As I mentioned oom-reaper is a
> > limited resource which has access to memory reserves and should not be
> > abused in the way I do in this reference implementation.
> > While there might be downsides that I don't know of, I'm not sure it's
> > required to hurry every kill's memory reclaim. I think there are cases
> > when resource deallocation is critical, for example when we kill to
> > relieve resource shortage and there are kills when reclaim speed is
> > not essential. It would be great if we can identify urgent cases
> > without userspace hints, so I'm open to suggestions that do not
> > involve additional flags.
> 
> I was imagining a PI-ish approach where we'd reap in case an RT
> process was waiting on the death of some other process. I'd still
> prefer the API I proposed in the other message because it gets the
> kernel out of the business of deciding what the right signal is. I'm a
> huge believer in "mechanism, not policy".

It's not a question of the kernel deciding what the right signal is.
The kernel knows whether a signal is fatal to a particular process or not.
The question is whether the killing process should do the work of reaping
the dying process's resources sometimes, always or never.  Currently,
that is never (the process reaps its own resources); Suren is suggesting
sometimes, and I'm asking "Why not always?"

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 17:36         ` Matthew Wilcox
@ 2019-04-11 17:47           ` Daniel Colascione
  2019-04-12  6:49             ` Michal Hocko
  2019-04-12 21:03             ` Matthew Wilcox
  2019-04-11 17:52           ` Suren Baghdasaryan
  1 sibling, 2 replies; 43+ messages in thread
From: Daniel Colascione @ 2019-04-11 17:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, Andrew Morton, Michal Hocko, David Rientjes,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner,
	Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm,
	lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote:
> > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > > signal and only to privileged users.
> > > >
> > > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > > every time a process is going to die?
> > >
> > > I think with an implementation that does not use/abuse oom-reaper
> > > thread this could be done for any kill. As I mentioned oom-reaper is a
> > > limited resource which has access to memory reserves and should not be
> > > abused in the way I do in this reference implementation.
> > > While there might be downsides that I don't know of, I'm not sure it's
> > > required to hurry every kill's memory reclaim. I think there are cases
> > > when resource deallocation is critical, for example when we kill to
> > > relieve resource shortage and there are kills when reclaim speed is
> > > not essential. It would be great if we can identify urgent cases
> > > without userspace hints, so I'm open to suggestions that do not
> > > involve additional flags.
> >
> > I was imagining a PI-ish approach where we'd reap in case an RT
> > process was waiting on the death of some other process. I'd still
> > prefer the API I proposed in the other message because it gets the
> > kernel out of the business of deciding what the right signal is. I'm a
> > huge believer in "mechanism, not policy".
>
> It's not a question of the kernel deciding what the right signal is.
> The kernel knows whether a signal is fatal to a particular process or not.
> The question is whether the killing process should do the work of reaping
> the dying process's resources sometimes, always or never.  Currently,
> that is never (the process reaps its own resources); Suren is suggesting
> sometimes, and I'm asking "Why not always?"

FWIW, Suren's initial proposal is that the oom_reaper kthread do the
reaping, not the process sending the kill. Are you suggesting that
sending SIGKILL should spend a while in signal delivery reaping pages
before returning? I thought about just doing it this way, but I didn't
like the idea: it'd slow down mass-killing programs like killall(1).
Programs expect sending SIGKILL to be a fast operation that returns
immediately.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 17:36         ` Matthew Wilcox
  2019-04-11 17:47           ` Daniel Colascione
@ 2019-04-11 17:52           ` Suren Baghdasaryan
  1 sibling, 0 replies; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11 17:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Daniel Colascione, Andrew Morton, Michal Hocko, David Rientjes,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner,
	Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm,
	lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote:
> > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > > signal and only to privileged users.
> > > >
> > > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > > every time a process is going to die?
> > >
> > > I think with an implementation that does not use/abuse oom-reaper
> > > thread this could be done for any kill. As I mentioned oom-reaper is a
> > > limited resource which has access to memory reserves and should not be
> > > abused in the way I do in this reference implementation.
> > > While there might be downsides that I don't know of, I'm not sure it's
> > > required to hurry every kill's memory reclaim. I think there are cases
> > > when resource deallocation is critical, for example when we kill to
> > > relieve resource shortage and there are kills when reclaim speed is
> > > not essential. It would be great if we can identify urgent cases
> > > without userspace hints, so I'm open to suggestions that do not
> > > involve additional flags.
> >
> > I was imagining a PI-ish approach where we'd reap in case an RT
> > process was waiting on the death of some other process. I'd still
> > prefer the API I proposed in the other message because it gets the
> > kernel out of the business of deciding what the right signal is. I'm a
> > huge believer in "mechanism, not policy".
>
> It's not a question of the kernel deciding what the right signal is.
> The kernel knows whether a signal is fatal to a particular process or not.
> The question is whether the killing process should do the work of reaping
> the dying process's resources sometimes, always or never.  Currently,
> that is never (the process reaps its own resources); Suren is suggesting
> sometimes, and I'm asking "Why not always?"

If there are no downsides of doing this always (like using some
resources that can be utilized in a better way) then by all means,
let's do it unconditionally. My current implementation is not one of
such cases :)

I think with implementation when killing process is doing the reaping
of the victim's mm this can be done unconditionally because we don't
use resources which might otherwise be used in a better way. Overall I
like Daniel's idea of the process that requested killing and is
waiting for the victim to die helping in reaping its memory. It kind
of naturally elevates the priority of the reaping if the priority of
the waiting process is higher than victim's priority (a kind of
priority inheritance).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 16:18   ` Joel Fernandes
@ 2019-04-11 18:12     ` Michal Hocko
  2019-04-11 19:14       ` Joel Fernandes
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2019-04-11 18:12 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes,
	Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner,
	penguin-kernel, ebiederm, shakeelb, Christian Brauner,
	Minchan Kim, Tim Murray, Daniel Colascione,
	Joel Fernandes (Google),
	Jann Horn, open list:MEMORY MANAGEMENT, lsf-pc, LKML,
	Cc: Android Kernel

On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > [...]
> > > Proposed solution uses existing oom-reaper thread to increase memory
> > > reclaim rate of a killed process and to make this rate more deterministic.
> > > By no means the proposed solution is considered the best and was chosen
> > > because it was simple to implement and allowed for test data collection.
> > > The downside of this solution is that it requires additional “expedite”
> > > hint for something which has to be fast in all cases. Would be great to
> > > find a way that does not require additional hints.
> >
> > I have to say I do not like this much. It is abusing an implementation
> > detail of the OOM implementation and makes it an official API. Also
> > there are some non trivial assumptions to be fullfilled to use the
> > current oom_reaper. First of all all the process groups that share the
> > address space have to be killed. How do you want to guarantee/implement
> > that with a simply kill to a thread/process group?
> 
> Will task_will_free_mem() not bail out in such cases because of
> process_shares_mm() returning true?

I am not really sure I understand your question. task_will_free_mem is
just a shortcut to not kill anything if the current process or a victim
is already dying and likely to free memory without killing or spamming
the log. My concern is that this patch allows to invoke the reaper
without guaranteeing the same. So it can only be an optimistic attempt
and then I am wondering how reasonable of an interface this really is.
Userspace send the signal and has no way to find out whether the async
reaping has been scheduled or not.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 16:47   ` Suren Baghdasaryan
@ 2019-04-11 18:19     ` Michal Hocko
  2019-04-11 19:56       ` Suren Baghdasaryan
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2019-04-11 18:19 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian,
	Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa,
	ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim,
	Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn,
	linux-mm, lsf-pc, LKML, kernel-team

On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote:
[...]
> > I would question whether we really need this at all? Relying on the exit
> > speed sounds like a fundamental design problem of anything that relies
> > on it.
> 
> Relying on it is wrong, I agree. There are protections like allocation
> throttling that we can fall back to stop memory depletion. However
> having a way to free up resources that are not needed by a dying
> process quickly would help to avoid throttling which hurts user
> experience.

I am not opposing speeding up the exit time in general. That is a good
thing. Especially for a very large processes (e.g. a DB). But I do not
really think we want to expose an API to control this specific aspect.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 18:12     ` Michal Hocko
@ 2019-04-11 19:14       ` Joel Fernandes
  2019-04-11 20:11         ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Joel Fernandes @ 2019-04-11 19:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes,
	Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner,
	penguin-kernel, ebiederm, shakeelb, Christian Brauner,
	Minchan Kim, Tim Murray, Daniel Colascione, Jann Horn,
	open list:MEMORY MANAGEMENT, lsf-pc, LKML, Cc: Android Kernel

On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote:
> On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > > [...]
> > > > Proposed solution uses existing oom-reaper thread to increase memory
> > > > reclaim rate of a killed process and to make this rate more deterministic.
> > > > By no means the proposed solution is considered the best and was chosen
> > > > because it was simple to implement and allowed for test data collection.
> > > > The downside of this solution is that it requires additional “expedite”
> > > > hint for something which has to be fast in all cases. Would be great to
> > > > find a way that does not require additional hints.
> > >
> > > I have to say I do not like this much. It is abusing an implementation
> > > detail of the OOM implementation and makes it an official API. Also
> > > there are some non trivial assumptions to be fullfilled to use the
> > > current oom_reaper. First of all all the process groups that share the
> > > address space have to be killed. How do you want to guarantee/implement
> > > that with a simply kill to a thread/process group?
> > 
> > Will task_will_free_mem() not bail out in such cases because of
> > process_shares_mm() returning true?
> 
> I am not really sure I understand your question. task_will_free_mem is
> just a shortcut to not kill anything if the current process or a victim
> is already dying and likely to free memory without killing or spamming
> the log. My concern is that this patch allows to invoke the reaper

Got it.

> without guaranteeing the same. So it can only be an optimistic attempt
> and then I am wondering how reasonable of an interface this really is.
> Userspace send the signal and has no way to find out whether the async
> reaping has been scheduled or not.

Could you clarify more what you're asking to guarantee? I cannot picture it.
If you mean guaranteeing that "a task is dying anyway and will free its
memory on its own", we are calling task_will_free_mem() to check that before
invoking the oom reaper.

Could you clarify what is the draback if OOM reaper is invoked in parallel to
an exiting task which will free its memory soon? It looks like the OOM reaper
is taking all the locks necessary (mmap_sem) in particular and is unmapping
pages. It seemed to me to be safe, but I am missing what are the main draw
backs of this - other than the intereference with core dump. One could be
presumably scalability since the since OOM reaper could be bottlenecked by
freeing memory on behalf of potentially several dying tasks.

IIRC this patch is just Ok with being opportunistic and it need not be hidden
behind an API necessarily or need any guarantees. It is just providing a hint
that the OOM reaper could be woken up to expedite things. If a task is going
to be taking a long time to be scheduled and free its memory, the oom reaper
gives a headstart.  Many of the times, background tasks can be killed but
they may not have necessarily sufficient scheduler priority / cpuset (being
in the background) and may be holding onto a lot of memory that needs to be
reclaimed.

I am not saying this the right way to do it, but I also wanted us to
understand the drawbacks so that we can go back to the drawing board and come
up with something better.

Thanks!

 - Joel





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 18:19     ` Michal Hocko
@ 2019-04-11 19:56       ` Suren Baghdasaryan
  2019-04-11 20:17         ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11 19:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian,
	Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa,
	Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim,
	Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn,
	linux-mm, lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote:
> [...]
> > > I would question whether we really need this at all? Relying on the exit
> > > speed sounds like a fundamental design problem of anything that relies
> > > on it.
> >
> > Relying on it is wrong, I agree. There are protections like allocation
> > throttling that we can fall back to stop memory depletion. However
> > having a way to free up resources that are not needed by a dying
> > process quickly would help to avoid throttling which hurts user
> > experience.
>
> I am not opposing speeding up the exit time in general. That is a good
> thing. Especially for a very large processes (e.g. a DB). But I do not
> really think we want to expose an API to control this specific aspect.

Great! Thanks for confirming that the intent is not worthless.
There were a number of ideas floating both internally and in the 2/2
of this patchset. I would like to get some input on which
implementation would be preferable. From your answer sounds like you
think it should be a generic feature, should not require any new APIs
or hints from the userspace and should be conducted for all kills
unconditionally (irrespective of memory pressure, who is waiting for
victim's death, etc.). Do I understand correctly that this would be
the preferred solution?

> --
> Michal Hocko
> SUSE Labs
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 19:14       ` Joel Fernandes
@ 2019-04-11 20:11         ` Michal Hocko
  2019-04-11 21:11           ` Joel Fernandes
  0 siblings, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2019-04-11 20:11 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes,
	Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner,
	penguin-kernel, ebiederm, shakeelb, Christian Brauner,
	Minchan Kim, Tim Murray, Daniel Colascione, Jann Horn,
	open list:MEMORY MANAGEMENT, lsf-pc, LKML, Cc: Android Kernel

On Thu 11-04-19 15:14:30, Joel Fernandes wrote:
> On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote:
> > On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > > > [...]
> > > > > Proposed solution uses existing oom-reaper thread to increase memory
> > > > > reclaim rate of a killed process and to make this rate more deterministic.
> > > > > By no means the proposed solution is considered the best and was chosen
> > > > > because it was simple to implement and allowed for test data collection.
> > > > > The downside of this solution is that it requires additional “expedite”
> > > > > hint for something which has to be fast in all cases. Would be great to
> > > > > find a way that does not require additional hints.
> > > >
> > > > I have to say I do not like this much. It is abusing an implementation
> > > > detail of the OOM implementation and makes it an official API. Also
> > > > there are some non trivial assumptions to be fullfilled to use the
> > > > current oom_reaper. First of all all the process groups that share the
> > > > address space have to be killed. How do you want to guarantee/implement
> > > > that with a simply kill to a thread/process group?
> > > 
> > > Will task_will_free_mem() not bail out in such cases because of
> > > process_shares_mm() returning true?
> > 
> > I am not really sure I understand your question. task_will_free_mem is
> > just a shortcut to not kill anything if the current process or a victim
> > is already dying and likely to free memory without killing or spamming
> > the log. My concern is that this patch allows to invoke the reaper
> 
> Got it.
> 
> > without guaranteeing the same. So it can only be an optimistic attempt
> > and then I am wondering how reasonable of an interface this really is.
> > Userspace send the signal and has no way to find out whether the async
> > reaping has been scheduled or not.
> 
> Could you clarify more what you're asking to guarantee? I cannot picture it.
> If you mean guaranteeing that "a task is dying anyway and will free its
> memory on its own", we are calling task_will_free_mem() to check that before
> invoking the oom reaper.

No, I am talking about the API aspect. Say you kall kill with the flag
to make the async address space tear down. Now you cannot really
guarantee that this is safe to do because the target task might
clone(CLONE_VM) at any time. So this will be known only once the signal
is sent, but the calling process has no way to find out. So the caller
has no way to know what is the actual result of the requested operation.
That is a poor API in my book.

> Could you clarify what is the draback if OOM reaper is invoked in parallel to
> an exiting task which will free its memory soon? It looks like the OOM reaper
> is taking all the locks necessary (mmap_sem) in particular and is unmapping
> pages. It seemed to me to be safe, but I am missing what are the main draw
> backs of this - other than the intereference with core dump. One could be
> presumably scalability since the since OOM reaper could be bottlenecked by
> freeing memory on behalf of potentially several dying tasks.

oom_reaper or any other kernel thread doing the same is a mere
implementation detail I think. The oom killer doesn't really need the
oom_reaper to act swiftly because it is there to act as a last resort if
the oom victim cannot terminate on its own. If you want to offer an
user space API then you can assume users will like to use it and expect
a certain behavior but what that is? E.g. what if there are thousands of
tasks killed this way? Do we care that some of them will not get the
async treatment? If yes why do we need an API to control that at all?

Am I more clear now?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 19:56       ` Suren Baghdasaryan
@ 2019-04-11 20:17         ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2019-04-11 20:17 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian,
	Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa,
	Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim,
	Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn,
	linux-mm, lsf-pc, LKML, kernel-team

On Thu 11-04-19 12:56:32, Suren Baghdasaryan wrote:
> On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote:
> > [...]
> > > > I would question whether we really need this at all? Relying on the exit
> > > > speed sounds like a fundamental design problem of anything that relies
> > > > on it.
> > >
> > > Relying on it is wrong, I agree. There are protections like allocation
> > > throttling that we can fall back to stop memory depletion. However
> > > having a way to free up resources that are not needed by a dying
> > > process quickly would help to avoid throttling which hurts user
> > > experience.
> >
> > I am not opposing speeding up the exit time in general. That is a good
> > thing. Especially for a very large processes (e.g. a DB). But I do not
> > really think we want to expose an API to control this specific aspect.
> 
> Great! Thanks for confirming that the intent is not worthless.
> There were a number of ideas floating both internally and in the 2/2
> of this patchset. I would like to get some input on which
> implementation would be preferable. From your answer sounds like you
> think it should be a generic feature, should not require any new APIs
> or hints from the userspace and should be conducted for all kills
> unconditionally (irrespective of memory pressure, who is waiting for
> victim's death, etc.). Do I understand correctly that this would be
> the preferred solution?

Yes, I think the general tear down solution is much more preferable than
a questionable API. How that solution should look like is an open
question. I am not sure myself to be honest.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 0/2] opportunistic memory reclaim of a killed process
  2019-04-11 20:11         ` Michal Hocko
@ 2019-04-11 21:11           ` Joel Fernandes
  0 siblings, 0 replies; 43+ messages in thread
From: Joel Fernandes @ 2019-04-11 21:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes,
	Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner,
	penguin-kernel, ebiederm, shakeelb, Christian Brauner,
	Minchan Kim, Tim Murray, Daniel Colascione, Jann Horn,
	open list:MEMORY MANAGEMENT, lsf-pc, LKML, Cc: Android Kernel

On Thu, Apr 11, 2019 at 10:11:51PM +0200, Michal Hocko wrote:
> On Thu 11-04-19 15:14:30, Joel Fernandes wrote:
> > On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote:
> > > On Thu 11-04-19 12:18:33, Joel Fernandes wrote:
> > > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote:
> > > > > [...]
> > > > > > Proposed solution uses existing oom-reaper thread to increase memory
> > > > > > reclaim rate of a killed process and to make this rate more deterministic.
> > > > > > By no means the proposed solution is considered the best and was chosen
> > > > > > because it was simple to implement and allowed for test data collection.
> > > > > > The downside of this solution is that it requires additional “expedite”
> > > > > > hint for something which has to be fast in all cases. Would be great to
> > > > > > find a way that does not require additional hints.
> > > > >
> > > > > I have to say I do not like this much. It is abusing an implementation
> > > > > detail of the OOM implementation and makes it an official API. Also
> > > > > there are some non trivial assumptions to be fullfilled to use the
> > > > > current oom_reaper. First of all all the process groups that share the
> > > > > address space have to be killed. How do you want to guarantee/implement
> > > > > that with a simply kill to a thread/process group?
> > > > 
> > > > Will task_will_free_mem() not bail out in such cases because of
> > > > process_shares_mm() returning true?
> > > 
> > > I am not really sure I understand your question. task_will_free_mem is
> > > just a shortcut to not kill anything if the current process or a victim
> > > is already dying and likely to free memory without killing or spamming
> > > the log. My concern is that this patch allows to invoke the reaper
> > 
> > Got it.
> > 
> > > without guaranteeing the same. So it can only be an optimistic attempt
> > > and then I am wondering how reasonable of an interface this really is.
> > > Userspace send the signal and has no way to find out whether the async
> > > reaping has been scheduled or not.
> > 
> > Could you clarify more what you're asking to guarantee? I cannot picture it.
> > If you mean guaranteeing that "a task is dying anyway and will free its
> > memory on its own", we are calling task_will_free_mem() to check that before
> > invoking the oom reaper.
> 
> No, I am talking about the API aspect. Say you kall kill with the flag
> to make the async address space tear down. Now you cannot really
> guarantee that this is safe to do because the target task might
> clone(CLONE_VM) at any time. So this will be known only once the signal
> is sent, but the calling process has no way to find out. So the caller
> has no way to know what is the actual result of the requested operation.
> That is a poor API in my book.
> 
> > Could you clarify what is the draback if OOM reaper is invoked in parallel to
> > an exiting task which will free its memory soon? It looks like the OOM reaper
> > is taking all the locks necessary (mmap_sem) in particular and is unmapping
> > pages. It seemed to me to be safe, but I am missing what are the main draw
> > backs of this - other than the intereference with core dump. One could be
> > presumably scalability since the since OOM reaper could be bottlenecked by
> > freeing memory on behalf of potentially several dying tasks.
> 
> oom_reaper or any other kernel thread doing the same is a mere
> implementation detail I think. The oom killer doesn't really need the
> oom_reaper to act swiftly because it is there to act as a last resort if
> the oom victim cannot terminate on its own. If you want to offer an
> user space API then you can assume users will like to use it and expect
> a certain behavior but what that is? E.g. what if there are thousands of
> tasks killed this way? Do we care that some of them will not get the
> async treatment? If yes why do we need an API to control that at all?
> 
> Am I more clear now?

Yes, your concerns are more clear now. We will think more about this and your
other responses, thanks a lot.

 - Joel


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 17:09     ` Suren Baghdasaryan
  2019-04-11 17:33       ` Daniel Colascione
@ 2019-04-11 21:45       ` Roman Gushchin
  2019-04-11 21:59         ` Suren Baghdasaryan
  1 sibling, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2019-04-11 21:45 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Matthew Wilcox, Andrew Morton, mhocko, David Rientjes,
	yuzhoujian, Souptick Joarder, Johannes Weiner, Tetsuo Handa,
	ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim,
	Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn,
	linux-mm, lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 10:09:06AM -0700, Suren Baghdasaryan wrote:
> On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > victim process. The usage of this flag is currently limited to SIGKILL
> > > signal and only to privileged users.
> >
> > What is the downside of doing expedited memory reclaim?  ie why not do it
> > every time a process is going to die?

Hello, Suren!

I also like the idea to reap always.

> I think with an implementation that does not use/abuse oom-reaper
> thread this could be done for any kill. As I mentioned oom-reaper is a
> limited resource which has access to memory reserves and should not be
> abused in the way I do in this reference implementation.

In most OOM cases it doesn't matter that much which task to reap,
so I don't think that reusing the oom-reaper thread is bad.
It should be relatively easy to tweak in a way, that it won't
wait for mmap_sem if there are other tasks waiting to be reaped.
Also, the oom code add to the head of the list, and the expedited
killing to the end, or something like this.

The only think, if we're going to reap all tasks, we probably
want to have a per-node oom_reaper thread.

Thanks!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 21:45       ` Roman Gushchin
@ 2019-04-11 21:59         ` Suren Baghdasaryan
  0 siblings, 0 replies; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-11 21:59 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Matthew Wilcox, Andrew Morton, mhocko, David Rientjes,
	yuzhoujian, Souptick Joarder, Johannes Weiner, Tetsuo Handa,
	ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim,
	Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn,
	linux-mm, lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 2:45 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Apr 11, 2019 at 10:09:06AM -0700, Suren Baghdasaryan wrote:
> > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > signal and only to privileged users.
> > >
> > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > every time a process is going to die?
>
> Hello, Suren!
>
> I also like the idea to reap always.
>
> > I think with an implementation that does not use/abuse oom-reaper
> > thread this could be done for any kill. As I mentioned oom-reaper is a
> > limited resource which has access to memory reserves and should not be
> > abused in the way I do in this reference implementation.
>
> In most OOM cases it doesn't matter that much which task to reap,
> so I don't think that reusing the oom-reaper thread is bad.
> It should be relatively easy to tweak in a way, that it won't
> wait for mmap_sem if there are other tasks waiting to be reaped.
> Also, the oom code add to the head of the list, and the expedited
> killing to the end, or something like this.
>
> The only think, if we're going to reap all tasks, we probably
> want to have a per-node oom_reaper thread.

Thanks for the ideas Roman. I'll take some time to digest the input
from everybody. What I heard from everyone is that we want this to be
a part of generic kill functionality which does not require a change
in userspace API.

> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 17:47           ` Daniel Colascione
@ 2019-04-12  6:49             ` Michal Hocko
  2019-04-12 14:15               ` Suren Baghdasaryan
  2019-04-12 21:03             ` Matthew Wilcox
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2019-04-12  6:49 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Matthew Wilcox, Suren Baghdasaryan, Andrew Morton,
	David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin,
	Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt,
	Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes,
	Jann Horn, linux-mm, lsf-pc, LKML, kernel-team

On Thu 11-04-19 10:47:50, Daniel Colascione wrote:
> On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote:
> > > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > > > signal and only to privileged users.
> > > > >
> > > > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > > > every time a process is going to die?
> > > >
> > > > I think with an implementation that does not use/abuse oom-reaper
> > > > thread this could be done for any kill. As I mentioned oom-reaper is a
> > > > limited resource which has access to memory reserves and should not be
> > > > abused in the way I do in this reference implementation.
> > > > While there might be downsides that I don't know of, I'm not sure it's
> > > > required to hurry every kill's memory reclaim. I think there are cases
> > > > when resource deallocation is critical, for example when we kill to
> > > > relieve resource shortage and there are kills when reclaim speed is
> > > > not essential. It would be great if we can identify urgent cases
> > > > without userspace hints, so I'm open to suggestions that do not
> > > > involve additional flags.
> > >
> > > I was imagining a PI-ish approach where we'd reap in case an RT
> > > process was waiting on the death of some other process. I'd still
> > > prefer the API I proposed in the other message because it gets the
> > > kernel out of the business of deciding what the right signal is. I'm a
> > > huge believer in "mechanism, not policy".
> >
> > It's not a question of the kernel deciding what the right signal is.
> > The kernel knows whether a signal is fatal to a particular process or not.
> > The question is whether the killing process should do the work of reaping
> > the dying process's resources sometimes, always or never.  Currently,
> > that is never (the process reaps its own resources); Suren is suggesting
> > sometimes, and I'm asking "Why not always?"
> 
> FWIW, Suren's initial proposal is that the oom_reaper kthread do the
> reaping, not the process sending the kill. Are you suggesting that
> sending SIGKILL should spend a while in signal delivery reaping pages
> before returning? I thought about just doing it this way, but I didn't
> like the idea: it'd slow down mass-killing programs like killall(1).
> Programs expect sending SIGKILL to be a fast operation that returns
> immediately.

I was thinking about this as well. And SYNC_SIGKILL would workaround the
current expectations of how quick the current implementation is. The
harder part would what is the actual semantic. Does the kill wait until
the target task is TASK_DEAD or is there an intermediate step that would
we could call it end of the day and still have a reasonable semantic
(e.g. the original pid is really not alive anymore).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 15:33   ` Matthew Wilcox
  2019-04-11 17:05     ` Johannes Weiner
  2019-04-11 17:09     ` Suren Baghdasaryan
@ 2019-04-12  6:53     ` Michal Hocko
  2019-04-12 14:10       ` Suren Baghdasaryan
  2019-04-12 14:14       ` Daniel Colascione
  2 siblings, 2 replies; 43+ messages in thread
From: Michal Hocko @ 2019-04-12  6:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, akpm, rientjes, yuzhoujian, jrdr.linux, guro,
	hannes, penguin-kernel, ebiederm, shakeelb, christian, minchan,
	timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel,
	kernel-team

On Thu 11-04-19 08:33:13, Matthew Wilcox wrote:
> On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > victim process. The usage of this flag is currently limited to SIGKILL
> > signal and only to privileged users.
> 
> What is the downside of doing expedited memory reclaim?  ie why not do it
> every time a process is going to die?
 
Well, you are tearing down an address space which might be still in use
because the task not fully dead yeat. So there are two downsides AFAICS.
Core dumping which will not see the reaped memory so the resulting
coredump might be incomplete. And unexpected #PF/gup on the reaped
memory will result in SIGBUS. These are things that we have closed our
eyes in the oom context because they likely do not matter. If we want to
use the same technique for other usecases then we have to think how much
that matter again.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-12  6:53     ` Michal Hocko
@ 2019-04-12 14:10       ` Suren Baghdasaryan
  2019-04-12 14:14       ` Daniel Colascione
  1 sibling, 0 replies; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-12 14:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Andrew Morton, David Rientjes, yuzhoujian,
	Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa,
	Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim,
	Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn,
	linux-mm, lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 08:33:13, Matthew Wilcox wrote:
> > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > victim process. The usage of this flag is currently limited to SIGKILL
> > > signal and only to privileged users.
> >
> > What is the downside of doing expedited memory reclaim?  ie why not do it
> > every time a process is going to die?
>
> Well, you are tearing down an address space which might be still in use
> because the task not fully dead yeat. So there are two downsides AFAICS.
> Core dumping which will not see the reaped memory so the resulting
> coredump might be incomplete. And unexpected #PF/gup on the reaped
> memory will result in SIGBUS. These are things that we have closed our
> eyes in the oom context because they likely do not matter. If we want to
> use the same technique for other usecases then we have to think how much
> that matter again.
>

Sorry, resending with corrected settings...
After some internal discussions we realized that there is one
additional downside of doing reaping unconditionally. If this async
reaping is done on a more efficient and energy hungrier CPU
irrespective of urgency of the kill it should end up costing more
power overall (I’m referring here to assimetric architectures like ARM
big.LITTLE). Obviously quantifying that cost is not easy as it depends
on the usecase and a particular system but it won’t be zero. So I
think we will need some gating condition after all.

> --
> Michal Hocko
> SUSE Labs
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-12  6:53     ` Michal Hocko
  2019-04-12 14:10       ` Suren Baghdasaryan
@ 2019-04-12 14:14       ` Daniel Colascione
  2019-04-12 15:30         ` Daniel Colascione
  2019-04-25 16:09         ` Suren Baghdasaryan
  1 sibling, 2 replies; 43+ messages in thread
From: Daniel Colascione @ 2019-04-12 14:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Suren Baghdasaryan, Andrew Morton,
	David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin,
	Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt,
	Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes,
	Jann Horn, linux-mm, lsf-pc, linux-kernel, Android Kernel Team

On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 08:33:13, Matthew Wilcox wrote:
> > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > victim process. The usage of this flag is currently limited to SIGKILL
> > > signal and only to privileged users.
> >
> > What is the downside of doing expedited memory reclaim?  ie why not do it
> > every time a process is going to die?
>
> Well, you are tearing down an address space which might be still in use
> because the task not fully dead yeat. So there are two downsides AFAICS.
> Core dumping which will not see the reaped memory so the resulting

Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you
try to start a core dump after reaping begins, too bad: you could have
raced with process death anyway.

> coredump might be incomplete. And unexpected #PF/gup on the reaped
> memory will result in SIGBUS.

It's a dying process. Why even bother returning from the fault
handler? Just treat that situation as a thread exit. There's no need
to make this observable to userspace at all.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-12  6:49             ` Michal Hocko
@ 2019-04-12 14:15               ` Suren Baghdasaryan
  2019-04-12 14:20                 ` Daniel Colascione
  0 siblings, 1 reply; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-12 14:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Colascione, Matthew Wilcox, Andrew Morton, David Rientjes,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner,
	Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm,
	lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 11:49 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 10:47:50, Daniel Colascione wrote:
> > On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote:
> > > > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > > > > signal and only to privileged users.
> > > > > >
> > > > > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > > > > every time a process is going to die?
> > > > >
> > > > > I think with an implementation that does not use/abuse oom-reaper
> > > > > thread this could be done for any kill. As I mentioned oom-reaper is a
> > > > > limited resource which has access to memory reserves and should not be
> > > > > abused in the way I do in this reference implementation.
> > > > > While there might be downsides that I don't know of, I'm not sure it's
> > > > > required to hurry every kill's memory reclaim. I think there are cases
> > > > > when resource deallocation is critical, for example when we kill to
> > > > > relieve resource shortage and there are kills when reclaim speed is
> > > > > not essential. It would be great if we can identify urgent cases
> > > > > without userspace hints, so I'm open to suggestions that do not
> > > > > involve additional flags.
> > > >
> > > > I was imagining a PI-ish approach where we'd reap in case an RT
> > > > process was waiting on the death of some other process. I'd still
> > > > prefer the API I proposed in the other message because it gets the
> > > > kernel out of the business of deciding what the right signal is. I'm a
> > > > huge believer in "mechanism, not policy".
> > >
> > > It's not a question of the kernel deciding what the right signal is.
> > > The kernel knows whether a signal is fatal to a particular process or not.
> > > The question is whether the killing process should do the work of reaping
> > > the dying process's resources sometimes, always or never.  Currently,
> > > that is never (the process reaps its own resources); Suren is suggesting
> > > sometimes, and I'm asking "Why not always?"
> >
> > FWIW, Suren's initial proposal is that the oom_reaper kthread do the
> > reaping, not the process sending the kill. Are you suggesting that
> > sending SIGKILL should spend a while in signal delivery reaping pages
> > before returning? I thought about just doing it this way, but I didn't
> > like the idea: it'd slow down mass-killing programs like killall(1).
> > Programs expect sending SIGKILL to be a fast operation that returns
> > immediately.
>
> I was thinking about this as well. And SYNC_SIGKILL would workaround the
> current expectations of how quick the current implementation is. The
> harder part would what is the actual semantic. Does the kill wait until
> the target task is TASK_DEAD or is there an intermediate step that would
> we could call it end of the day and still have a reasonable semantic
> (e.g. the original pid is really not alive anymore).

I think Daniel's proposal was trying to address that. With an input of
how many pages user wants to reclaim asynchronously and return value
of how much was actually reclaimed it contains the condition when to
stop and the reply how successful we could accomplish that. Since it
returns the number of pages reclaimed I assume the call does not
return until it reaped enough pages.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-12 14:15               ` Suren Baghdasaryan
@ 2019-04-12 14:20                 ` Daniel Colascione
  0 siblings, 0 replies; 43+ messages in thread
From: Daniel Colascione @ 2019-04-12 14:20 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Hocko, Matthew Wilcox, Andrew Morton, David Rientjes,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner,
	Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm,
	lsf-pc, LKML, kernel-team

On Fri, Apr 12, 2019 at 7:15 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Apr 11, 2019 at 11:49 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 11-04-19 10:47:50, Daniel Colascione wrote:
> > > On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote:
> > > > > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > > > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > > > > > signal and only to privileged users.
> > > > > > >
> > > > > > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > > > > > every time a process is going to die?
> > > > > >
> > > > > > I think with an implementation that does not use/abuse oom-reaper
> > > > > > thread this could be done for any kill. As I mentioned oom-reaper is a
> > > > > > limited resource which has access to memory reserves and should not be
> > > > > > abused in the way I do in this reference implementation.
> > > > > > While there might be downsides that I don't know of, I'm not sure it's
> > > > > > required to hurry every kill's memory reclaim. I think there are cases
> > > > > > when resource deallocation is critical, for example when we kill to
> > > > > > relieve resource shortage and there are kills when reclaim speed is
> > > > > > not essential. It would be great if we can identify urgent cases
> > > > > > without userspace hints, so I'm open to suggestions that do not
> > > > > > involve additional flags.
> > > > >
> > > > > I was imagining a PI-ish approach where we'd reap in case an RT
> > > > > process was waiting on the death of some other process. I'd still
> > > > > prefer the API I proposed in the other message because it gets the
> > > > > kernel out of the business of deciding what the right signal is. I'm a
> > > > > huge believer in "mechanism, not policy".
> > > >
> > > > It's not a question of the kernel deciding what the right signal is.
> > > > The kernel knows whether a signal is fatal to a particular process or not.
> > > > The question is whether the killing process should do the work of reaping
> > > > the dying process's resources sometimes, always or never.  Currently,
> > > > that is never (the process reaps its own resources); Suren is suggesting
> > > > sometimes, and I'm asking "Why not always?"
> > >
> > > FWIW, Suren's initial proposal is that the oom_reaper kthread do the
> > > reaping, not the process sending the kill. Are you suggesting that
> > > sending SIGKILL should spend a while in signal delivery reaping pages
> > > before returning? I thought about just doing it this way, but I didn't
> > > like the idea: it'd slow down mass-killing programs like killall(1).
> > > Programs expect sending SIGKILL to be a fast operation that returns
> > > immediately.
> >
> > I was thinking about this as well. And SYNC_SIGKILL would workaround the

SYNC_SIGKILL (which, I presume, blocks in kill(2)) was proposed in
many occasions while we discussed pidfd waits over the past six months
or so. We've decided to just make pidfds pollable instead. The kernel
already has several ways to express the idea that a task should wait
for another task to die, and I don't think we need another. If you
want a process that's waiting for a task to exit to help reap that
task, great --- that's an option we've talked about --- but we don't
need new interface to do it, since the kernel already has all the
information it needs.

> > current expectations of how quick the current implementation is. The
> > harder part would what is the actual semantic. Does the kill wait until
> > the target task is TASK_DEAD or is there an intermediate step that would
> > we could call it end of the day and still have a reasonable semantic
> > (e.g. the original pid is really not alive anymore).
>
> I think Daniel's proposal was trying to address that. With an input of
> how many pages user wants to reclaim asynchronously and return value
> of how much was actually reclaimed it contains the condition when to
> stop and the reply how successful we could accomplish that. Since it
> returns the number of pages reclaimed I assume the call does not
> return until it reaped enough pages.

Right. I want to punt as much "policy" as possible to userspace. Just
using a user thread to do the reaping not only solves the policy
problem (since it's userspace that controls priority, affinity,
retries, and so on), but also simplifies the implementation
kernel-side. I can imagine situations where, depending on device
energy state or even charger or screen state we might want to reap
more or less aggressively, or not at all. I wouldn't want to burden
the kernel with having to get that right when userspace could make the
decision.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-12 14:14       ` Daniel Colascione
@ 2019-04-12 15:30         ` Daniel Colascione
  2019-04-25 16:09         ` Suren Baghdasaryan
  1 sibling, 0 replies; 43+ messages in thread
From: Daniel Colascione @ 2019-04-12 15:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Suren Baghdasaryan, Andrew Morton,
	David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin,
	Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt,
	Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes,
	Jann Horn, linux-mm, lsf-pc, linux-kernel, Android Kernel Team

On Fri, Apr 12, 2019 at 7:14 AM Daniel Colascione <dancol@google.com> wrote:
>
> On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote:
> > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > signal and only to privileged users.
> > >
> > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > every time a process is going to die?
> >
> > Well, you are tearing down an address space which might be still in use
> > because the task not fully dead yeat. So there are two downsides AFAICS.
> > Core dumping which will not see the reaped memory so the resulting
>
> Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you
> try to start a core dump after reaping begins, too bad: you could have
> raced with process death anyway.
>
> > coredump might be incomplete. And unexpected #PF/gup on the reaped
> > memory will result in SIGBUS.
>
> It's a dying process. Why even bother returning from the fault
> handler? Just treat that situation as a thread exit. There's no need
> to make this observable to userspace at all.

Just for clarity, checking the code, I think we already do this.
zap_other_threads sets SIGKILL pending on every thread in the group,
and we'll handle SIGKILL in the process of taking any page fault or
doing any system call, so I don't think it's actually possible for a
thread in a dying process to observe the SIGBUS that reaping in theory
can generate.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-11 17:47           ` Daniel Colascione
  2019-04-12  6:49             ` Michal Hocko
@ 2019-04-12 21:03             ` Matthew Wilcox
  1 sibling, 0 replies; 43+ messages in thread
From: Matthew Wilcox @ 2019-04-12 21:03 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Suren Baghdasaryan, Andrew Morton, Michal Hocko, David Rientjes,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner,
	Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm,
	lsf-pc, LKML, kernel-team

On Thu, Apr 11, 2019 at 10:47:50AM -0700, Daniel Colascione wrote:
> On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote:
> > It's not a question of the kernel deciding what the right signal is.
> > The kernel knows whether a signal is fatal to a particular process or not.
> > The question is whether the killing process should do the work of reaping
> > the dying process's resources sometimes, always or never.  Currently,
> > that is never (the process reaps its own resources); Suren is suggesting
> > sometimes, and I'm asking "Why not always?"
> 
> FWIW, Suren's initial proposal is that the oom_reaper kthread do the
> reaping, not the process sending the kill. Are you suggesting that
> sending SIGKILL should spend a while in signal delivery reaping pages
> before returning? I thought about just doing it this way, but I didn't
> like the idea: it'd slow down mass-killing programs like killall(1).
> Programs expect sending SIGKILL to be a fast operation that returns
> immediately.

Suren said that the implementation in this proposal wasn't to be taken
literally.

You've raised some good points here though.  Do mass-killing programs
like kill with a pgid or killall expect the signal-sending process to
be fast, or do they not really care?

From the kernel's point of view, the same work has to be done, no matter
whether the killer or the victim does it.  Should the work be accounted
to the killer or the victim?  It probably doesn't matter.  The victim
doing the work allows for parallelisation, but I'm not convinced that's
important either.

I see another advantage for the killer doing the work -- if the task
is currently blocking on I/O (eg to an NFS mount that has gone away),
the killer can get rid of the task's page tables.  If we have to wait
for the I/O to complete before the victim reaps its own page tables,
we may be waiting forever.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing
  2019-04-12 14:14       ` Daniel Colascione
  2019-04-12 15:30         ` Daniel Colascione
@ 2019-04-25 16:09         ` Suren Baghdasaryan
  1 sibling, 0 replies; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-25 16:09 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Michal Hocko, Matthew Wilcox, Suren Baghdasaryan, Andrew Morton,
	David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin,
	Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt,
	Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes,
	Jann Horn, linux-mm, lsf-pc, linux-kernel, Android Kernel Team,
	Oleg Nesterov

On Fri, Apr 12, 2019 at 7:14 AM Daniel Colascione <dancol@google.com> wrote:
>
> On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote:
> > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote:
> > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via
> > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the
> > > > victim process. The usage of this flag is currently limited to SIGKILL
> > > > signal and only to privileged users.
> > >
> > > What is the downside of doing expedited memory reclaim?  ie why not do it
> > > every time a process is going to die?
> >
> > Well, you are tearing down an address space which might be still in use
> > because the task not fully dead yeat. So there are two downsides AFAICS.
> > Core dumping which will not see the reaped memory so the resulting
>
> Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you
> try to start a core dump after reaping begins, too bad: you could have
> raced with process death anyway.
>
> > coredump might be incomplete. And unexpected #PF/gup on the reaped
> > memory will result in SIGBUS.
>
> It's a dying process. Why even bother returning from the fault
> handler? Just treat that situation as a thread exit. There's no need
> to make this observable to userspace at all.

I've spent some more time to investigate possible effects of reaping
on coredumps and asked Oleg Nesterov who worked on patchsets that
prioritize SIGKILLs over coredump activity
(https://lkml.org/lkml/2013/2/17/118). Current do_coredump
implementation seems to handle the case of SIGKILL interruption by
bailing out whenever dump_interrupted() returns true and that would be
the case with pending SIGKILL. So in the case of race when coredump
happens first and SIGKILL comes next interrupting the coredump seems
to result in no change in behavior and reaping memory proactively
seems to have no side effects.
An opposite race when SIGKILL gets posted and then coredump happens
seems impossible because do_coredump won't be called from get_signal
due to signal_group_exit check (get_signal checks signal_group_exit
while holding sighand->siglock and complete_signal sets
SIGNAL_GROUP_EXIT while holding the same lock). There is a path from
__seccomp_filter calling do_coredump while processing
SECCOMP_RET_KILL_xxx but even then it should bail out when
coredump_wait()->zap_threads(current) checks signal_group_exit().
(Thanks Oleg for clarifying this for me).

If we are really concerned about possible increase in failed coredumps
because of the proactive reaping I could make it conditional on
whether coredumping mm is possible by placing this feature behind
!get_dumpable(mm) condition. Another possibility is to check
RLIMIT_CORE to decide if coredumps are possible (although if pipe is
used for coredump that limit seems to be ignored, so that check would
have to take this into consideration).

On the issue of SIGBUS happening when accessed memory was already
reaped, my understanding that SIGBUS being a synchronous signal will
still have to be fetched using dequeue_synchronous_signal from
get_signal but not before signal_group_exit is checked. So again if
SIGKILL is pending I think SIGBUS will be ignored (please correct me
if that's not correct).

One additional question I would like to clarify is whether per-node
reapers like Roman suggested would make a big difference (All CPUs
that I've seen used for Android are single-node ones, so looking for
more feedback here). If it's important then reaping victim's memory by
the killer is probably not an option.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c
  2019-04-11  1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan
@ 2019-04-25 21:12   ` Tetsuo Handa
  2019-04-25 21:56     ` Suren Baghdasaryan
  0 siblings, 1 reply; 43+ messages in thread
From: Tetsuo Handa @ 2019-04-25 21:12 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: akpm, mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro,
	hannes, ebiederm, shakeelb, christian, minchan, timmurray,
	dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team

On 2019/04/11 10:43, Suren Baghdasaryan wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3a2484884cfd..6449710c8a06 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1102,6 +1102,21 @@ bool out_of_memory(struct oom_control *oc)
>  	return !!oc->chosen;
>  }
>  
> +bool expedite_reclaim(struct task_struct *task)
> +{
> +	bool res = false;
> +
> +	task_lock(task);
> +	if (task_will_free_mem(task)) {

mark_oom_victim() needs to be called under oom_lock mutex after
checking that oom_killer_disabled == false. Since you are trying
to trigger this function from signal handler, you might want to
defer until e.g. WQ context.

> +		mark_oom_victim(task);
> +		wake_oom_reaper(task);
> +		res = true;
> +	}
> +	task_unlock(task);
> +
> +	return res;
> +}
> +
>  /*
>   * The pagefault handler calls here because it is out of memory, so kill a
>   * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c
  2019-04-25 21:12   ` Tetsuo Handa
@ 2019-04-25 21:56     ` Suren Baghdasaryan
  0 siblings, 0 replies; 43+ messages in thread
From: Suren Baghdasaryan @ 2019-04-25 21:56 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Andrew Morton, Michal Hocko, David Rientjes, Matthew Wilcox,
	yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner,
	Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim,
	Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn,
	linux-mm, lsf-pc, LKML, kernel-team

On Thu, Apr 25, 2019 at 2:13 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2019/04/11 10:43, Suren Baghdasaryan wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 3a2484884cfd..6449710c8a06 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1102,6 +1102,21 @@ bool out_of_memory(struct oom_control *oc)
> >       return !!oc->chosen;
> >  }
> >
> > +bool expedite_reclaim(struct task_struct *task)
> > +{
> > +     bool res = false;
> > +
> > +     task_lock(task);
> > +     if (task_will_free_mem(task)) {
>
> mark_oom_victim() needs to be called under oom_lock mutex after
> checking that oom_killer_disabled == false. Since you are trying
> to trigger this function from signal handler, you might want to
> defer until e.g. WQ context.

Thanks for the tip! I'll take this into account in the new design.
Just thinking out loud... AFAIU oom_lock is there to protect against
multiple concurrent out_of_memory calls from different contexts and
prevent overly-aggressive process killing. For my purposes when
reaping memory of a killed process we don't have this concern (we did
not initiate the killing, SIGKILL was explicitly requested). I'll
probably need some synchronization there but not for purposes of
preventing multiple concurrent reapers. In any case, thank you for the
feedback!

>
> > +             mark_oom_victim(task);
> > +             wake_oom_reaper(task);
> > +             res = true;
> > +     }
> > +     task_unlock(task);
> > +
> > +     return res;
> > +}
> > +
> >  /*
> >   * The pagefault handler calls here because it is out of memory, so kill a
> >   * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2019-04-25 21:56 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-11  1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan
2019-04-11  1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan
2019-04-25 21:12   ` Tetsuo Handa
2019-04-25 21:56     ` Suren Baghdasaryan
2019-04-11  1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan
2019-04-11 10:30   ` Christian Brauner
2019-04-11 10:34     ` Christian Brauner
2019-04-11 15:18     ` Suren Baghdasaryan
2019-04-11 15:23       ` Suren Baghdasaryan
2019-04-11 16:25         ` Daniel Colascione
2019-04-11 15:33   ` Matthew Wilcox
2019-04-11 17:05     ` Johannes Weiner
2019-04-11 17:09     ` Suren Baghdasaryan
2019-04-11 17:33       ` Daniel Colascione
2019-04-11 17:36         ` Matthew Wilcox
2019-04-11 17:47           ` Daniel Colascione
2019-04-12  6:49             ` Michal Hocko
2019-04-12 14:15               ` Suren Baghdasaryan
2019-04-12 14:20                 ` Daniel Colascione
2019-04-12 21:03             ` Matthew Wilcox
2019-04-11 17:52           ` Suren Baghdasaryan
2019-04-11 21:45       ` Roman Gushchin
2019-04-11 21:59         ` Suren Baghdasaryan
2019-04-12  6:53     ` Michal Hocko
2019-04-12 14:10       ` Suren Baghdasaryan
2019-04-12 14:14       ` Daniel Colascione
2019-04-12 15:30         ` Daniel Colascione
2019-04-25 16:09         ` Suren Baghdasaryan
2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko
2019-04-11 16:18   ` Joel Fernandes
2019-04-11 18:12     ` Michal Hocko
2019-04-11 19:14       ` Joel Fernandes
2019-04-11 20:11         ` Michal Hocko
2019-04-11 21:11           ` Joel Fernandes
2019-04-11 16:20   ` Sandeep Patil
2019-04-11 16:47   ` Suren Baghdasaryan
2019-04-11 18:19     ` Michal Hocko
2019-04-11 19:56       ` Suren Baghdasaryan
2019-04-11 20:17         ` Michal Hocko
2019-04-11 17:19   ` Johannes Weiner
2019-04-11 11:51 ` [Lsf-pc] " Rik van Riel
2019-04-11 12:16   ` Michal Hocko
2019-04-11 16:54     ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).