* [RFC 0/2] opportunistic memory reclaim of a killed process @ 2019-04-11 1:43 Suren Baghdasaryan 2019-04-11 1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan ` (3 more replies) 0 siblings, 4 replies; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 1:43 UTC (permalink / raw) To: akpm Cc: mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, surenb, linux-mm, lsf-pc, linux-kernel, kernel-team The time to kill a process and free its memory can be critical when the killing was done to prevent memory shortages affecting system responsiveness. In the case of Android, where processes can be restarted easily, killing a less important background process is preferred to delaying or throttling an interactive foreground process. At the same time unnecessary kills should be avoided as they cause delays when the killed process is needed again. This requires a balanced decision from the system software about how long a kill can be postponed in the hope that memory usage will decrease without such drastic measures. As killing a process and reclaiming its memory is not an instant operation, a margin of free memory has to be maintained to prevent system performance deterioration while memory of the killed process is being reclaimed. The size of this margin depends on the minimum reclaim rate to cover the worst-case scenario and this minimum rate should be deterministic. Note that on asymmetric architectures like ARM big.LITTLE the reclaim rate can vary dramatically depending on which core it’s performed at (see test results). It’s a usual scenario that a non-essential victim process is being restricted to a less performant or throttled CPU for power saving purposes. This makes the worst-case reclaim rate scenario very probable. The cases when victim’s memory reclaim can be delayed further due to process being blocked in an uninterruptible sleep or when it performs a time-consuming operation makes the reclaim time even more unpredictable. Increasing memory reclaim rate and making it more deterministic would allow for a smaller free memory margin and would lead to more opportunities to avoid killing a process. Note that while other strategies like throttling memory allocations are viable and can be employed for other non-essential processes they would affect user experience if applied towards an interactive process. Proposed solution uses existing oom-reaper thread to increase memory reclaim rate of a killed process and to make this rate more deterministic. By no means the proposed solution is considered the best and was chosen because it was simple to implement and allowed for test data collection. The downside of this solution is that it requires additional “expedite” hint for something which has to be fast in all cases. Would be great to find a way that does not require additional hints. Other possible approaches include: - Implementing a dedicated syscall to perform opportunistic reclaim in the context of the process waiting for the victim’s death. A natural boost bonus occurs if the waiting process has high or RT priority and is not limited by cpuset cgroup in its CPU choices. - Implement a mechanism that would perform opportunistic reclaim if it’s possible unconditionally (similar to checks in task_will_free_mem()). - Implement opportunistic reclaim that uses shrinker interface, PSI or other memory pressure indications as a hint to engage. Test details: Tests are performed on a Qualcomm® Snapdragon™ 845 8-core ARM big.LITTLE system with 4 little cores (0.3-1.6GHz) and 4 big cores (0.8-2.5GHz) running Android. Memory reclaim speed was measured using signal/signal_generate, kmem/rss_stat and sched/sched_process_exit traces. Test results: powersave governor, min freq normal kills expedited kills little 856 MB/sec 3236 MB/sec big 5084 MB/sec 6144 MB/sec performance governor, max freq normal kills expedited kills little 5602 MB/sec 8144 MB/sec big 14656 MB/sec 12398 MB/sec schedutil governor (default) normal kills expedited kills little 2386 MB/sec 3908 MB/sec big 7282 MB/sec 6820-16386 MB/sec ================================================================= min reclaim speed: 856 MB/sec 3236 MB/sec The patches are based on 5.1-rc1 Suren Baghdasaryan (2): mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c signal: extend pidfd_send_signal() to allow expedited process killing include/linux/oom.h | 1 + include/linux/sched/signal.h | 3 ++- include/linux/signal.h | 11 ++++++++++- ipc/mqueue.c | 2 +- kernel/signal.c | 37 ++++++++++++++++++++++++++++-------- kernel/time/itimer.c | 2 +- mm/oom_kill.c | 15 +++++++++++++++ 7 files changed, 59 insertions(+), 12 deletions(-) -- 2.21.0.392.gf8f6787159e-goog ^ permalink raw reply [flat|nested] 43+ messages in thread
* [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c 2019-04-11 1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan @ 2019-04-11 1:43 ` Suren Baghdasaryan 2019-04-25 21:12 ` Tetsuo Handa 2019-04-11 1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan ` (2 subsequent siblings) 3 siblings, 1 reply; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 1:43 UTC (permalink / raw) To: akpm Cc: mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, surenb, linux-mm, lsf-pc, linux-kernel, kernel-team Create an API to allow users outside of oom_kill.c to mark a victim and wake up oom_reaper thread for expedited memory reclaim of the process being killed. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- include/linux/oom.h | 1 + mm/oom_kill.c | 15 +++++++++++++++ 2 files changed, 16 insertions(+) diff --git a/include/linux/oom.h b/include/linux/oom.h index d07992009265..6c043c7518c1 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -112,6 +112,7 @@ extern unsigned long oom_badness(struct task_struct *p, unsigned long totalpages); extern bool out_of_memory(struct oom_control *oc); +extern bool expedite_reclaim(struct task_struct *task); extern void exit_oom_victim(void); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 3a2484884cfd..6449710c8a06 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1102,6 +1102,21 @@ bool out_of_memory(struct oom_control *oc) return !!oc->chosen; } +bool expedite_reclaim(struct task_struct *task) +{ + bool res = false; + + task_lock(task); + if (task_will_free_mem(task)) { + mark_oom_victim(task); + wake_oom_reaper(task); + res = true; + } + task_unlock(task); + + return res; +} + /* * The pagefault handler calls here because it is out of memory, so kill a * memory-hogging task. If oom_lock is held by somebody else, a parallel oom -- 2.21.0.392.gf8f6787159e-goog ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c 2019-04-11 1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan @ 2019-04-25 21:12 ` Tetsuo Handa 2019-04-25 21:56 ` Suren Baghdasaryan 0 siblings, 1 reply; 43+ messages in thread From: Tetsuo Handa @ 2019-04-25 21:12 UTC (permalink / raw) To: Suren Baghdasaryan Cc: akpm, mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team On 2019/04/11 10:43, Suren Baghdasaryan wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 3a2484884cfd..6449710c8a06 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1102,6 +1102,21 @@ bool out_of_memory(struct oom_control *oc) > return !!oc->chosen; > } > > +bool expedite_reclaim(struct task_struct *task) > +{ > + bool res = false; > + > + task_lock(task); > + if (task_will_free_mem(task)) { mark_oom_victim() needs to be called under oom_lock mutex after checking that oom_killer_disabled == false. Since you are trying to trigger this function from signal handler, you might want to defer until e.g. WQ context. > + mark_oom_victim(task); > + wake_oom_reaper(task); > + res = true; > + } > + task_unlock(task); > + > + return res; > +} > + > /* > * The pagefault handler calls here because it is out of memory, so kill a > * memory-hogging task. If oom_lock is held by somebody else, a parallel oom > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c 2019-04-25 21:12 ` Tetsuo Handa @ 2019-04-25 21:56 ` Suren Baghdasaryan 0 siblings, 0 replies; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-25 21:56 UTC (permalink / raw) To: Tetsuo Handa Cc: Andrew Morton, Michal Hocko, David Rientjes, Matthew Wilcox, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 25, 2019 at 2:13 PM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > On 2019/04/11 10:43, Suren Baghdasaryan wrote: > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 3a2484884cfd..6449710c8a06 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -1102,6 +1102,21 @@ bool out_of_memory(struct oom_control *oc) > > return !!oc->chosen; > > } > > > > +bool expedite_reclaim(struct task_struct *task) > > +{ > > + bool res = false; > > + > > + task_lock(task); > > + if (task_will_free_mem(task)) { > > mark_oom_victim() needs to be called under oom_lock mutex after > checking that oom_killer_disabled == false. Since you are trying > to trigger this function from signal handler, you might want to > defer until e.g. WQ context. Thanks for the tip! I'll take this into account in the new design. Just thinking out loud... AFAIU oom_lock is there to protect against multiple concurrent out_of_memory calls from different contexts and prevent overly-aggressive process killing. For my purposes when reaping memory of a killed process we don't have this concern (we did not initiate the killing, SIGKILL was explicitly requested). I'll probably need some synchronization there but not for purposes of preventing multiple concurrent reapers. In any case, thank you for the feedback! > > > + mark_oom_victim(task); > > + wake_oom_reaper(task); > > + res = true; > > + } > > + task_unlock(task); > > + > > + return res; > > +} > > + > > /* > > * The pagefault handler calls here because it is out of memory, so kill a > > * memory-hogging task. If oom_lock is held by somebody else, a parallel oom > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan 2019-04-11 1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan @ 2019-04-11 1:43 ` Suren Baghdasaryan 2019-04-11 10:30 ` Christian Brauner 2019-04-11 15:33 ` Matthew Wilcox 2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko 2019-04-11 11:51 ` [Lsf-pc] " Rik van Riel 3 siblings, 2 replies; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 1:43 UTC (permalink / raw) To: akpm Cc: mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, surenb, linux-mm, lsf-pc, linux-kernel, kernel-team Add new SS_EXPEDITE flag to be used when sending SIGKILL via pidfd_send_signal() syscall to allow expedited memory reclaim of the victim process. The usage of this flag is currently limited to SIGKILL signal and only to privileged users. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- include/linux/sched/signal.h | 3 ++- include/linux/signal.h | 11 ++++++++++- ipc/mqueue.c | 2 +- kernel/signal.c | 37 ++++++++++++++++++++++++++++-------- kernel/time/itimer.c | 2 +- 5 files changed, 43 insertions(+), 12 deletions(-) diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index e412c092c1e8..8a227633a058 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *); extern void force_sigsegv(int sig, struct task_struct *p); extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *); extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp); -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid); +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, + bool expedite); extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *, const struct cred *); extern int kill_pgrp(struct pid *pid, int sig, int priv); diff --git a/include/linux/signal.h b/include/linux/signal.h index 9702016734b1..34b7852aa4a0 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long); } while (0); #ifdef CONFIG_PROC_FS + +/* + * SS_FLAGS values used in pidfd_send_signal: + * + * SS_EXPEDITE indicates desire to expedite the operation. + */ +#define SS_EXPEDITE 0x00000001 + struct seq_file; extern void render_sigset_t(struct seq_file *, const char *, sigset_t *); -#endif + +#endif /* CONFIG_PROC_FS */ #endif /* _LINUX_SIGNAL_H */ diff --git a/ipc/mqueue.c b/ipc/mqueue.c index aea30530c472..27c66296e08e 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info) rcu_read_unlock(); kill_pid_info(info->notify.sigev_signo, - &sig_i, info->notify_owner); + &sig_i, info->notify_owner, false); break; case SIGEV_THREAD: set_cookie(info->notify_cookie, NOTIFY_WOKENUP); diff --git a/kernel/signal.c b/kernel/signal.c index f98448cf2def..02ed4332d17c 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -43,6 +43,7 @@ #include <linux/compiler.h> #include <linux/posix-timers.h> #include <linux/livepatch.h> +#include <linux/oom.h> #define CREATE_TRACE_POINTS #include <trace/events/signal.h> @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp) return success ? 0 : retval; } -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) +int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, + bool expedite) { int error = -ESRCH; struct task_struct *p; @@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) for (;;) { rcu_read_lock(); p = pid_task(pid, PIDTYPE_PID); - if (p) + if (p) { error = group_send_sig_info(sig, info, p, PIDTYPE_TGID); + + /* + * Ignore expedite_reclaim return value, it is best + * effort only. + */ + if (!error && expedite) + expedite_reclaim(p); + } + rcu_read_unlock(); if (likely(!p || error != -ESRCH)) return error; @@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid) { int error; rcu_read_lock(); - error = kill_pid_info(sig, info, find_vpid(pid)); + error = kill_pid_info(sig, info, find_vpid(pid), false); rcu_read_unlock(); return error; } @@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid) if (pid > 0) { rcu_read_lock(); - ret = kill_pid_info(sig, info, find_vpid(pid)); + ret = kill_pid_info(sig, info, find_vpid(pid), false); rcu_read_unlock(); return ret; } @@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp); int kill_pid(struct pid *pid, int sig, int priv) { - return kill_pid_info(sig, __si_special(priv), pid); + return kill_pid_info(sig, __si_special(priv), pid, false); } EXPORT_SYMBOL(kill_pid); @@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, struct pid *pid; kernel_siginfo_t kinfo; - /* Enforce flags be set to 0 until we add an extension. */ - if (flags) + /* Enforce no unknown flags. */ + if (flags & ~SS_EXPEDITE) return -EINVAL; + if (flags & SS_EXPEDITE) { + /* Enforce SS_EXPEDITE to be used with SIGKILL only. */ + if (sig != SIGKILL) + return -EINVAL; + + /* Limit expedited killing to privileged users only. */ + if (!capable(CAP_SYS_NICE)) + return -EPERM; + } + f = fdget_raw(pidfd); if (!f.file) return -EBADF; @@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, prepare_kill_siginfo(sig, &kinfo); } - ret = kill_pid_info(sig, &kinfo, pid); + ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0); err: fdput(f); diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c index 02068b2d5862..c926483cdb53 100644 --- a/kernel/time/itimer.c +++ b/kernel/time/itimer.c @@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer) struct pid *leader_pid = sig->pids[PIDTYPE_TGID]; trace_itimer_expire(ITIMER_REAL, leader_pid, 0); - kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid); + kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false); return HRTIMER_NORESTART; } -- 2.21.0.392.gf8f6787159e-goog ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan @ 2019-04-11 10:30 ` Christian Brauner 2019-04-11 10:34 ` Christian Brauner 2019-04-11 15:18 ` Suren Baghdasaryan 2019-04-11 15:33 ` Matthew Wilcox 1 sibling, 2 replies; 43+ messages in thread From: Christian Brauner @ 2019-04-11 10:30 UTC (permalink / raw) To: Suren Baghdasaryan Cc: akpm, mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb, minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > pidfd_send_signal() syscall to allow expedited memory reclaim of the > victim process. The usage of this flag is currently limited to SIGKILL > signal and only to privileged users. > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > include/linux/sched/signal.h | 3 ++- > include/linux/signal.h | 11 ++++++++++- > ipc/mqueue.c | 2 +- > kernel/signal.c | 37 ++++++++++++++++++++++++++++-------- > kernel/time/itimer.c | 2 +- > 5 files changed, 43 insertions(+), 12 deletions(-) > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > index e412c092c1e8..8a227633a058 100644 > --- a/include/linux/sched/signal.h > +++ b/include/linux/sched/signal.h > @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *); > extern void force_sigsegv(int sig, struct task_struct *p); > extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *); > extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp); > -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid); > +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, > + bool expedite); > extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *, > const struct cred *); > extern int kill_pgrp(struct pid *pid, int sig, int priv); > diff --git a/include/linux/signal.h b/include/linux/signal.h > index 9702016734b1..34b7852aa4a0 100644 > --- a/include/linux/signal.h > +++ b/include/linux/signal.h > @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long); > } while (0); > > #ifdef CONFIG_PROC_FS > + > +/* > + * SS_FLAGS values used in pidfd_send_signal: > + * > + * SS_EXPEDITE indicates desire to expedite the operation. > + */ > +#define SS_EXPEDITE 0x00000001 Does this make sense as an SS_* flag? How does this relate to the signal stack? Is there any intention to ever use this flag with stack_t? New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be PIDFD_SIGNAL_THREAD.) And since this is exposed to userspace in contrast to the mm internal naming it should be something more easily understandable like PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something. > + > struct seq_file; > extern void render_sigset_t(struct seq_file *, const char *, sigset_t *); > -#endif > + > +#endif /* CONFIG_PROC_FS */ > > #endif /* _LINUX_SIGNAL_H */ > diff --git a/ipc/mqueue.c b/ipc/mqueue.c > index aea30530c472..27c66296e08e 100644 > --- a/ipc/mqueue.c > +++ b/ipc/mqueue.c > @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info) > rcu_read_unlock(); > > kill_pid_info(info->notify.sigev_signo, > - &sig_i, info->notify_owner); > + &sig_i, info->notify_owner, false); > break; > case SIGEV_THREAD: > set_cookie(info->notify_cookie, NOTIFY_WOKENUP); > diff --git a/kernel/signal.c b/kernel/signal.c > index f98448cf2def..02ed4332d17c 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -43,6 +43,7 @@ > #include <linux/compiler.h> > #include <linux/posix-timers.h> > #include <linux/livepatch.h> > +#include <linux/oom.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/signal.h> > @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp) > return success ? 0 : retval; > } > > -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) > +int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, > + bool expedite) > { > int error = -ESRCH; > struct task_struct *p; > @@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) > for (;;) { > rcu_read_lock(); > p = pid_task(pid, PIDTYPE_PID); > - if (p) > + if (p) { > error = group_send_sig_info(sig, info, p, PIDTYPE_TGID); > + > + /* > + * Ignore expedite_reclaim return value, it is best > + * effort only. > + */ > + if (!error && expedite) > + expedite_reclaim(p); SIGKILL will take the whole thread group down so the reclaim should make sense here. > + } > + > rcu_read_unlock(); > if (likely(!p || error != -ESRCH)) > return error; > @@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid) > { > int error; > rcu_read_lock(); > - error = kill_pid_info(sig, info, find_vpid(pid)); > + error = kill_pid_info(sig, info, find_vpid(pid), false); > rcu_read_unlock(); > return error; > } > @@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid) > > if (pid > 0) { > rcu_read_lock(); > - ret = kill_pid_info(sig, info, find_vpid(pid)); > + ret = kill_pid_info(sig, info, find_vpid(pid), false); > rcu_read_unlock(); > return ret; > } > @@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp); > > int kill_pid(struct pid *pid, int sig, int priv) > { > - return kill_pid_info(sig, __si_special(priv), pid); > + return kill_pid_info(sig, __si_special(priv), pid, false); > } > EXPORT_SYMBOL(kill_pid); > > @@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, > struct pid *pid; > kernel_siginfo_t kinfo; > > - /* Enforce flags be set to 0 until we add an extension. */ > - if (flags) > + /* Enforce no unknown flags. */ > + if (flags & ~SS_EXPEDITE) > return -EINVAL; > > + if (flags & SS_EXPEDITE) { > + /* Enforce SS_EXPEDITE to be used with SIGKILL only. */ > + if (sig != SIGKILL) > + return -EINVAL; Not super fond of this being a SIGKILL specific flag but I get why. > + > + /* Limit expedited killing to privileged users only. */ > + if (!capable(CAP_SYS_NICE)) > + return -EPERM; Do you have a specific (DOS or other) attack vector in mind that renders ns_capable unsuitable? > + } > + > f = fdget_raw(pidfd); > if (!f.file) > return -EBADF; > @@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, > prepare_kill_siginfo(sig, &kinfo); > } > > - ret = kill_pid_info(sig, &kinfo, pid); > + ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0); > > err: > fdput(f); > diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c > index 02068b2d5862..c926483cdb53 100644 > --- a/kernel/time/itimer.c > +++ b/kernel/time/itimer.c > @@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer) > struct pid *leader_pid = sig->pids[PIDTYPE_TGID]; > > trace_itimer_expire(ITIMER_REAL, leader_pid, 0); > - kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid); > + kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false); > > return HRTIMER_NORESTART; > } > -- > 2.21.0.392.gf8f6787159e-goog > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 10:30 ` Christian Brauner @ 2019-04-11 10:34 ` Christian Brauner 2019-04-11 15:18 ` Suren Baghdasaryan 1 sibling, 0 replies; 43+ messages in thread From: Christian Brauner @ 2019-04-11 10:34 UTC (permalink / raw) To: Suren Baghdasaryan Cc: akpm, mhocko, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb, minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team, oleg Cc: Oleg too On Thu, Apr 11, 2019 at 12:30:18PM +0200, Christian Brauner wrote: > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > victim process. The usage of this flag is currently limited to SIGKILL > > signal and only to privileged users. > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > --- > > include/linux/sched/signal.h | 3 ++- > > include/linux/signal.h | 11 ++++++++++- > > ipc/mqueue.c | 2 +- > > kernel/signal.c | 37 ++++++++++++++++++++++++++++-------- > > kernel/time/itimer.c | 2 +- > > 5 files changed, 43 insertions(+), 12 deletions(-) > > > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > > index e412c092c1e8..8a227633a058 100644 > > --- a/include/linux/sched/signal.h > > +++ b/include/linux/sched/signal.h > > @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *); > > extern void force_sigsegv(int sig, struct task_struct *p); > > extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *); > > extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp); > > -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid); > > +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, > > + bool expedite); > > extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *, > > const struct cred *); > > extern int kill_pgrp(struct pid *pid, int sig, int priv); > > diff --git a/include/linux/signal.h b/include/linux/signal.h > > index 9702016734b1..34b7852aa4a0 100644 > > --- a/include/linux/signal.h > > +++ b/include/linux/signal.h > > @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long); > > } while (0); > > > > #ifdef CONFIG_PROC_FS > > + > > +/* > > + * SS_FLAGS values used in pidfd_send_signal: > > + * > > + * SS_EXPEDITE indicates desire to expedite the operation. > > + */ > > +#define SS_EXPEDITE 0x00000001 > > Does this make sense as an SS_* flag? > How does this relate to the signal stack? > Is there any intention to ever use this flag with stack_t? > > New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be > PIDFD_SIGNAL_THREAD.) > And since this is exposed to userspace in contrast to the mm internal > naming it should be something more easily understandable like > PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something. > > > + > > struct seq_file; > > extern void render_sigset_t(struct seq_file *, const char *, sigset_t *); > > -#endif > > + > > +#endif /* CONFIG_PROC_FS */ > > > > #endif /* _LINUX_SIGNAL_H */ > > diff --git a/ipc/mqueue.c b/ipc/mqueue.c > > index aea30530c472..27c66296e08e 100644 > > --- a/ipc/mqueue.c > > +++ b/ipc/mqueue.c > > @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info) > > rcu_read_unlock(); > > > > kill_pid_info(info->notify.sigev_signo, > > - &sig_i, info->notify_owner); > > + &sig_i, info->notify_owner, false); > > break; > > case SIGEV_THREAD: > > set_cookie(info->notify_cookie, NOTIFY_WOKENUP); > > diff --git a/kernel/signal.c b/kernel/signal.c > > index f98448cf2def..02ed4332d17c 100644 > > --- a/kernel/signal.c > > +++ b/kernel/signal.c > > @@ -43,6 +43,7 @@ > > #include <linux/compiler.h> > > #include <linux/posix-timers.h> > > #include <linux/livepatch.h> > > +#include <linux/oom.h> > > > > #define CREATE_TRACE_POINTS > > #include <trace/events/signal.h> > > @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp) > > return success ? 0 : retval; > > } > > > > -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) > > +int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, > > + bool expedite) > > { > > int error = -ESRCH; > > struct task_struct *p; > > @@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) > > for (;;) { > > rcu_read_lock(); > > p = pid_task(pid, PIDTYPE_PID); > > - if (p) > > + if (p) { > > error = group_send_sig_info(sig, info, p, PIDTYPE_TGID); > > + > > + /* > > + * Ignore expedite_reclaim return value, it is best > > + * effort only. > > + */ > > + if (!error && expedite) > > + expedite_reclaim(p); > > SIGKILL will take the whole thread group down so the reclaim should make > sense here. > > > + } > > + > > rcu_read_unlock(); > > if (likely(!p || error != -ESRCH)) > > return error; > > @@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid) > > { > > int error; > > rcu_read_lock(); > > - error = kill_pid_info(sig, info, find_vpid(pid)); > > + error = kill_pid_info(sig, info, find_vpid(pid), false); > > rcu_read_unlock(); > > return error; > > } > > @@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid) > > > > if (pid > 0) { > > rcu_read_lock(); > > - ret = kill_pid_info(sig, info, find_vpid(pid)); > > + ret = kill_pid_info(sig, info, find_vpid(pid), false); > > rcu_read_unlock(); > > return ret; > > } > > @@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp); > > > > int kill_pid(struct pid *pid, int sig, int priv) > > { > > - return kill_pid_info(sig, __si_special(priv), pid); > > + return kill_pid_info(sig, __si_special(priv), pid, false); > > } > > EXPORT_SYMBOL(kill_pid); > > > > @@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, > > struct pid *pid; > > kernel_siginfo_t kinfo; > > > > - /* Enforce flags be set to 0 until we add an extension. */ > > - if (flags) > > + /* Enforce no unknown flags. */ > > + if (flags & ~SS_EXPEDITE) > > return -EINVAL; > > > > + if (flags & SS_EXPEDITE) { > > + /* Enforce SS_EXPEDITE to be used with SIGKILL only. */ > > + if (sig != SIGKILL) > > + return -EINVAL; > > Not super fond of this being a SIGKILL specific flag but I get why. > > > + > > + /* Limit expedited killing to privileged users only. */ > > + if (!capable(CAP_SYS_NICE)) > > + return -EPERM; > > Do you have a specific (DOS or other) attack vector in mind that renders > ns_capable unsuitable? > > > + } > > + > > f = fdget_raw(pidfd); > > if (!f.file) > > return -EBADF; > > @@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, > > prepare_kill_siginfo(sig, &kinfo); > > } > > > > - ret = kill_pid_info(sig, &kinfo, pid); > > + ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0); > > > > err: > > fdput(f); > > diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c > > index 02068b2d5862..c926483cdb53 100644 > > --- a/kernel/time/itimer.c > > +++ b/kernel/time/itimer.c > > @@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer) > > struct pid *leader_pid = sig->pids[PIDTYPE_TGID]; > > > > trace_itimer_expire(ITIMER_REAL, leader_pid, 0); > > - kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid); > > + kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false); > > > > return HRTIMER_NORESTART; > > } > > -- > > 2.21.0.392.gf8f6787159e-goog > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 10:30 ` Christian Brauner 2019-04-11 10:34 ` Christian Brauner @ 2019-04-11 15:18 ` Suren Baghdasaryan 2019-04-11 15:23 ` Suren Baghdasaryan 1 sibling, 1 reply; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 15:18 UTC (permalink / raw) To: Christian Brauner Cc: Andrew Morton, mhocko, David Rientjes, Matthew Wilcox, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, ebiederm, Shakeel Butt, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team, Oleg Nesterov Thanks for the feedback! Just to be clear, this implementation is used in this RFC as a reference to explain the intent. To be honest I don't think it will be adopted as is even if the idea survives scrutiny. On Thu, Apr 11, 2019 at 3:30 AM Christian Brauner <christian@brauner.io> wrote: > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > victim process. The usage of this flag is currently limited to SIGKILL > > signal and only to privileged users. > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > --- > > include/linux/sched/signal.h | 3 ++- > > include/linux/signal.h | 11 ++++++++++- > > ipc/mqueue.c | 2 +- > > kernel/signal.c | 37 ++++++++++++++++++++++++++++-------- > > kernel/time/itimer.c | 2 +- > > 5 files changed, 43 insertions(+), 12 deletions(-) > > > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > > index e412c092c1e8..8a227633a058 100644 > > --- a/include/linux/sched/signal.h > > +++ b/include/linux/sched/signal.h > > @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *); > > extern void force_sigsegv(int sig, struct task_struct *p); > > extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *); > > extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp); > > -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid); > > +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, > > + bool expedite); > > extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *, > > const struct cred *); > > extern int kill_pgrp(struct pid *pid, int sig, int priv); > > diff --git a/include/linux/signal.h b/include/linux/signal.h > > index 9702016734b1..34b7852aa4a0 100644 > > --- a/include/linux/signal.h > > +++ b/include/linux/signal.h > > @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long); > > } while (0); > > > > #ifdef CONFIG_PROC_FS > > + > > +/* > > + * SS_FLAGS values used in pidfd_send_signal: > > + * > > + * SS_EXPEDITE indicates desire to expedite the operation. > > + */ > > +#define SS_EXPEDITE 0x00000001 > > Does this make sense as an SS_* flag? > How does this relate to the signal stack? It doesn't, so I agree that the name should be changed. PIDFD_SIGNAL_EXPEDITE_MM_RECLAIM would seem appropriate. > Is there any intention to ever use this flag with stack_t? > > New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be > PIDFD_SIGNAL_THREAD.) > And since this is exposed to userspace in contrast to the mm internal > naming it should be something more easily understandable like > PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something. > > > + > > struct seq_file; > > extern void render_sigset_t(struct seq_file *, const char *, sigset_t *); > > -#endif > > + > > +#endif /* CONFIG_PROC_FS */ > > > > #endif /* _LINUX_SIGNAL_H */ > > diff --git a/ipc/mqueue.c b/ipc/mqueue.c > > index aea30530c472..27c66296e08e 100644 > > --- a/ipc/mqueue.c > > +++ b/ipc/mqueue.c > > @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info) > > rcu_read_unlock(); > > > > kill_pid_info(info->notify.sigev_signo, > > - &sig_i, info->notify_owner); > > + &sig_i, info->notify_owner, false); > > break; > > case SIGEV_THREAD: > > set_cookie(info->notify_cookie, NOTIFY_WOKENUP); > > diff --git a/kernel/signal.c b/kernel/signal.c > > index f98448cf2def..02ed4332d17c 100644 > > --- a/kernel/signal.c > > +++ b/kernel/signal.c > > @@ -43,6 +43,7 @@ > > #include <linux/compiler.h> > > #include <linux/posix-timers.h> > > #include <linux/livepatch.h> > > +#include <linux/oom.h> > > > > #define CREATE_TRACE_POINTS > > #include <trace/events/signal.h> > > @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp) > > return success ? 0 : retval; > > } > > > > -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) > > +int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, > > + bool expedite) > > { > > int error = -ESRCH; > > struct task_struct *p; > > @@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) > > for (;;) { > > rcu_read_lock(); > > p = pid_task(pid, PIDTYPE_PID); > > - if (p) > > + if (p) { > > error = group_send_sig_info(sig, info, p, PIDTYPE_TGID); > > + > > + /* > > + * Ignore expedite_reclaim return value, it is best > > + * effort only. > > + */ > > + if (!error && expedite) > > + expedite_reclaim(p); > > SIGKILL will take the whole thread group down so the reclaim should make > sense here. > This sounds like confirmation. I hope I'm not missing some flaw that you are trying to point out. > > + } > > + > > rcu_read_unlock(); > > if (likely(!p || error != -ESRCH)) > > return error; > > @@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid) > > { > > int error; > > rcu_read_lock(); > > - error = kill_pid_info(sig, info, find_vpid(pid)); > > + error = kill_pid_info(sig, info, find_vpid(pid), false); > > rcu_read_unlock(); > > return error; > > } > > @@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid) > > > > if (pid > 0) { > > rcu_read_lock(); > > - ret = kill_pid_info(sig, info, find_vpid(pid)); > > + ret = kill_pid_info(sig, info, find_vpid(pid), false); > > rcu_read_unlock(); > > return ret; > > } > > @@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp); > > > > int kill_pid(struct pid *pid, int sig, int priv) > > { > > - return kill_pid_info(sig, __si_special(priv), pid); > > + return kill_pid_info(sig, __si_special(priv), pid, false); > > } > > EXPORT_SYMBOL(kill_pid); > > > > @@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, > > struct pid *pid; > > kernel_siginfo_t kinfo; > > > > - /* Enforce flags be set to 0 until we add an extension. */ > > - if (flags) > > + /* Enforce no unknown flags. */ > > + if (flags & ~SS_EXPEDITE) > > return -EINVAL; > > > > + if (flags & SS_EXPEDITE) { > > + /* Enforce SS_EXPEDITE to be used with SIGKILL only. */ > > + if (sig != SIGKILL) > > + return -EINVAL; > > Not super fond of this being a SIGKILL specific flag but I get why. Understood. I was thinking that EXPEDITE flag might make sense for other signals in the future but from internal feedback sounds like if we go this way the flag name should be more specific. > > + > > + /* Limit expedited killing to privileged users only. */ > > + if (!capable(CAP_SYS_NICE)) > > + return -EPERM; > > Do you have a specific (DOS or other) attack vector in mind that renders > ns_capable unsuitable? > > > + } > > + > > f = fdget_raw(pidfd); > > if (!f.file) > > return -EBADF; > > @@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, > > prepare_kill_siginfo(sig, &kinfo); > > } > > > > - ret = kill_pid_info(sig, &kinfo, pid); > > + ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0); > > > > err: > > fdput(f); > > diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c > > index 02068b2d5862..c926483cdb53 100644 > > --- a/kernel/time/itimer.c > > +++ b/kernel/time/itimer.c > > @@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer) > > struct pid *leader_pid = sig->pids[PIDTYPE_TGID]; > > > > trace_itimer_expire(ITIMER_REAL, leader_pid, 0); > > - kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid); > > + kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false); > > > > return HRTIMER_NORESTART; > > } > > -- > > 2.21.0.392.gf8f6787159e-goog > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 15:18 ` Suren Baghdasaryan @ 2019-04-11 15:23 ` Suren Baghdasaryan 2019-04-11 16:25 ` Daniel Colascione 0 siblings, 1 reply; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 15:23 UTC (permalink / raw) To: Christian Brauner Cc: Andrew Morton, mhocko, David Rientjes, Matthew Wilcox, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, ebiederm, Shakeel Butt, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team, Oleg Nesterov On Thu, Apr 11, 2019 at 8:18 AM Suren Baghdasaryan <surenb@google.com> wrote: > > Thanks for the feedback! > Just to be clear, this implementation is used in this RFC as a > reference to explain the intent. To be honest I don't think it will be > adopted as is even if the idea survives scrutiny. > > On Thu, Apr 11, 2019 at 3:30 AM Christian Brauner <christian@brauner.io> wrote: > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > victim process. The usage of this flag is currently limited to SIGKILL > > > signal and only to privileged users. > > > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > > --- > > > include/linux/sched/signal.h | 3 ++- > > > include/linux/signal.h | 11 ++++++++++- > > > ipc/mqueue.c | 2 +- > > > kernel/signal.c | 37 ++++++++++++++++++++++++++++-------- > > > kernel/time/itimer.c | 2 +- > > > 5 files changed, 43 insertions(+), 12 deletions(-) > > > > > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > > > index e412c092c1e8..8a227633a058 100644 > > > --- a/include/linux/sched/signal.h > > > +++ b/include/linux/sched/signal.h > > > @@ -327,7 +327,8 @@ extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *); > > > extern void force_sigsegv(int sig, struct task_struct *p); > > > extern int force_sig_info(int, struct kernel_siginfo *, struct task_struct *); > > > extern int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp); > > > -extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid); > > > +extern int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, > > > + bool expedite); > > > extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *, > > > const struct cred *); > > > extern int kill_pgrp(struct pid *pid, int sig, int priv); > > > diff --git a/include/linux/signal.h b/include/linux/signal.h > > > index 9702016734b1..34b7852aa4a0 100644 > > > --- a/include/linux/signal.h > > > +++ b/include/linux/signal.h > > > @@ -446,8 +446,17 @@ int __save_altstack(stack_t __user *, unsigned long); > > > } while (0); > > > > > > #ifdef CONFIG_PROC_FS > > > + > > > +/* > > > + * SS_FLAGS values used in pidfd_send_signal: > > > + * > > > + * SS_EXPEDITE indicates desire to expedite the operation. > > > + */ > > > +#define SS_EXPEDITE 0x00000001 > > > > Does this make sense as an SS_* flag? > > How does this relate to the signal stack? > > It doesn't, so I agree that the name should be changed. > PIDFD_SIGNAL_EXPEDITE_MM_RECLAIM would seem appropriate. > > > Is there any intention to ever use this flag with stack_t? > > > > New flags should be PIDFD_SIGNAL_*. (E.g. the thread flag will be > > PIDFD_SIGNAL_THREAD.) > > And since this is exposed to userspace in contrast to the mm internal > > naming it should be something more easily understandable like > > PIDFD_SIGNAL_MM_RECLAIM{_FASTER} or something. > > > > > + > > > struct seq_file; > > > extern void render_sigset_t(struct seq_file *, const char *, sigset_t *); > > > -#endif > > > + > > > +#endif /* CONFIG_PROC_FS */ > > > > > > #endif /* _LINUX_SIGNAL_H */ > > > diff --git a/ipc/mqueue.c b/ipc/mqueue.c > > > index aea30530c472..27c66296e08e 100644 > > > --- a/ipc/mqueue.c > > > +++ b/ipc/mqueue.c > > > @@ -720,7 +720,7 @@ static void __do_notify(struct mqueue_inode_info *info) > > > rcu_read_unlock(); > > > > > > kill_pid_info(info->notify.sigev_signo, > > > - &sig_i, info->notify_owner); > > > + &sig_i, info->notify_owner, false); > > > break; > > > case SIGEV_THREAD: > > > set_cookie(info->notify_cookie, NOTIFY_WOKENUP); > > > diff --git a/kernel/signal.c b/kernel/signal.c > > > index f98448cf2def..02ed4332d17c 100644 > > > --- a/kernel/signal.c > > > +++ b/kernel/signal.c > > > @@ -43,6 +43,7 @@ > > > #include <linux/compiler.h> > > > #include <linux/posix-timers.h> > > > #include <linux/livepatch.h> > > > +#include <linux/oom.h> > > > > > > #define CREATE_TRACE_POINTS > > > #include <trace/events/signal.h> > > > @@ -1394,7 +1395,8 @@ int __kill_pgrp_info(int sig, struct kernel_siginfo *info, struct pid *pgrp) > > > return success ? 0 : retval; > > > } > > > > > > -int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) > > > +int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid, > > > + bool expedite) > > > { > > > int error = -ESRCH; > > > struct task_struct *p; > > > @@ -1402,8 +1404,17 @@ int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid) > > > for (;;) { > > > rcu_read_lock(); > > > p = pid_task(pid, PIDTYPE_PID); > > > - if (p) > > > + if (p) { > > > error = group_send_sig_info(sig, info, p, PIDTYPE_TGID); > > > + > > > + /* > > > + * Ignore expedite_reclaim return value, it is best > > > + * effort only. > > > + */ > > > + if (!error && expedite) > > > + expedite_reclaim(p); > > > > SIGKILL will take the whole thread group down so the reclaim should make > > sense here. > > > > This sounds like confirmation. I hope I'm not missing some flaw that > you are trying to point out. > > > > + } > > > + > > > rcu_read_unlock(); > > > if (likely(!p || error != -ESRCH)) > > > return error; > > > @@ -1420,7 +1431,7 @@ static int kill_proc_info(int sig, struct kernel_siginfo *info, pid_t pid) > > > { > > > int error; > > > rcu_read_lock(); > > > - error = kill_pid_info(sig, info, find_vpid(pid)); > > > + error = kill_pid_info(sig, info, find_vpid(pid), false); > > > rcu_read_unlock(); > > > return error; > > > } > > > @@ -1487,7 +1498,7 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid) > > > > > > if (pid > 0) { > > > rcu_read_lock(); > > > - ret = kill_pid_info(sig, info, find_vpid(pid)); > > > + ret = kill_pid_info(sig, info, find_vpid(pid), false); > > > rcu_read_unlock(); > > > return ret; > > > } > > > @@ -1704,7 +1715,7 @@ EXPORT_SYMBOL(kill_pgrp); > > > > > > int kill_pid(struct pid *pid, int sig, int priv) > > > { > > > - return kill_pid_info(sig, __si_special(priv), pid); > > > + return kill_pid_info(sig, __si_special(priv), pid, false); > > > } > > > EXPORT_SYMBOL(kill_pid); > > > > > > @@ -3577,10 +3588,20 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, > > > struct pid *pid; > > > kernel_siginfo_t kinfo; > > > > > > - /* Enforce flags be set to 0 until we add an extension. */ > > > - if (flags) > > > + /* Enforce no unknown flags. */ > > > + if (flags & ~SS_EXPEDITE) > > > return -EINVAL; > > > > > > + if (flags & SS_EXPEDITE) { > > > + /* Enforce SS_EXPEDITE to be used with SIGKILL only. */ > > > + if (sig != SIGKILL) > > > + return -EINVAL; > > > > Not super fond of this being a SIGKILL specific flag but I get why. > > Understood. I was thinking that EXPEDITE flag might make sense for > other signals in the future but from internal feedback sounds like if > we go this way the flag name should be more specific. > > > > + > > > + /* Limit expedited killing to privileged users only. */ > > > + if (!capable(CAP_SYS_NICE)) > > > + return -EPERM; > > > > Do you have a specific (DOS or other) attack vector in mind that renders > > ns_capable unsuitable? > > Missed this one. I was thinking of oom-reaper thread as a limited system resource (one thread which maintains a kill list and reaps process mms one at a time) and therefore should be protected from abuse. > > > + } > > > + > > > f = fdget_raw(pidfd); > > > if (!f.file) > > > return -EBADF; > > > @@ -3614,7 +3635,7 @@ SYSCALL_DEFINE4(pidfd_send_signal, int, pidfd, int, sig, > > > prepare_kill_siginfo(sig, &kinfo); > > > } > > > > > > - ret = kill_pid_info(sig, &kinfo, pid); > > > + ret = kill_pid_info(sig, &kinfo, pid, (flags & SS_EXPEDITE) != 0); > > > > > > err: > > > fdput(f); > > > diff --git a/kernel/time/itimer.c b/kernel/time/itimer.c > > > index 02068b2d5862..c926483cdb53 100644 > > > --- a/kernel/time/itimer.c > > > +++ b/kernel/time/itimer.c > > > @@ -140,7 +140,7 @@ enum hrtimer_restart it_real_fn(struct hrtimer *timer) > > > struct pid *leader_pid = sig->pids[PIDTYPE_TGID]; > > > > > > trace_itimer_expire(ITIMER_REAL, leader_pid, 0); > > > - kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid); > > > + kill_pid_info(SIGALRM, SEND_SIG_PRIV, leader_pid, false); > > > > > > return HRTIMER_NORESTART; > > > } > > > -- > > > 2.21.0.392.gf8f6787159e-goog > > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 15:23 ` Suren Baghdasaryan @ 2019-04-11 16:25 ` Daniel Colascione 0 siblings, 0 replies; 43+ messages in thread From: Daniel Colascione @ 2019-04-11 16:25 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Christian Brauner, Andrew Morton, Michal Hocko, David Rientjes, Matthew Wilcox, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team, Oleg Nesterov On Thu, Apr 11, 2019 at 8:23 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > signal and only to privileged users. FWIW, I like Suren's general idea, but I was thinking of a different way of exposing the same general functionality to userspace. The way I look at it, it's very useful for an auto-balancing memory system like Android (or, I presume, something that uses oomd) to recover memory *immediately* after a SIGKILL instead of waiting for the process to kill itself: a process's death can be delayed for a long time due to factors like scheduling and being blocked in various uninterruptible kernel-side operations. Suren's proposal is basically about pulling forward in time page reclaimation that would happen anyway. What if we let userspace control exactly when this reclaimation happens? I'm imagining a new* kernel facility that basically looks like this. It lets lmkd determine for itself how much work the system should expend on reclaiming memory from dying processes. size_t try_reap_dying_process( int pidfd, int flags /* must be zero */, size_t maximum_bytes_to_reap); Precondition: process is pending group-exit (someone already sent it SIGKILL) Postcondition: some memory reclaimed from dying process Invariant: doesn't sleep; stops reaping after MAXIMUM_BYTES_TO_REAP -> success: return number of bytes reaped -> failure: (size_t)-1 EBUSY: couldn't get mmap_sem EINVAL: PIDFD isn't a pidfd or otherwise invalid arguments EPERM: process hasn't been send SIGKILL: try_reap_dying_process on a process that isn't dying is illegal Kernel-side, try_reap_dying_process would try-acquire mmap_sem and just fail if it couldn't get it. Once acquired, it would release "easy" pages (using the same check the oom reaper uses) until it either ran out of pages or hit the MAXIMUM_BYTES_TO_REAP cap. The purpose of MAXIMUM_BYTES_TO_REAP is to let userspace bound-above the amount of time we spend reclaiming pages. It'd be up to userspace to set policy on retries, the actual value of the reap cap, the priority at which we run TRY_REAP_DYING_PROCESS, and so on. We return the number of bytes we managed to free this way so that lmkd can make an informed decision about what to do next, e.g., kill something else or wait a little while. Personally, I like th approach a bit more that recruiting the oom reaper through because it doesn't affect any kind of emergency memory reserve permission and because it frees us from having to think about whether the oom reaper's thread priority is right for this particular job. It also occurred to me that try_reap_dying_process might make a decent shrinker callback. Shrinkers are there, AIUI, to reclaim memory that's easy to free and that's not essential for correct kernel operation. Usually, it's some kind of cache that meets these criteria. But the private pages of a dying process also meet the criteria, don't they? I'm imagining the shrinker just picking an arbitrary doomed (dying but not yet dead) process and freeing some of its pages. I know there are concerns about slow shrinkers causing havoc throughout the system, but since this shrinker would be bounded above on CPU time and would never block, I feel like it'd be pretty safe. * insert standard missive about system calls being cheap, but we can talk about the way in which we expose this functionality after we agree that it's a good idea generally ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan 2019-04-11 10:30 ` Christian Brauner @ 2019-04-11 15:33 ` Matthew Wilcox 2019-04-11 17:05 ` Johannes Weiner ` (2 more replies) 1 sibling, 3 replies; 43+ messages in thread From: Matthew Wilcox @ 2019-04-11 15:33 UTC (permalink / raw) To: Suren Baghdasaryan Cc: akpm, mhocko, rientjes, yuzhoujian, jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > pidfd_send_signal() syscall to allow expedited memory reclaim of the > victim process. The usage of this flag is currently limited to SIGKILL > signal and only to privileged users. What is the downside of doing expedited memory reclaim? ie why not do it every time a process is going to die? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 15:33 ` Matthew Wilcox @ 2019-04-11 17:05 ` Johannes Weiner 2019-04-11 17:09 ` Suren Baghdasaryan 2019-04-12 6:53 ` Michal Hocko 2 siblings, 0 replies; 43+ messages in thread From: Johannes Weiner @ 2019-04-11 17:05 UTC (permalink / raw) To: Matthew Wilcox Cc: Suren Baghdasaryan, akpm, mhocko, rientjes, yuzhoujian, jrdr.linux, guro, penguin-kernel, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team On Thu, Apr 11, 2019 at 08:33:13AM -0700, Matthew Wilcox wrote: > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > victim process. The usage of this flag is currently limited to SIGKILL > > signal and only to privileged users. > > What is the downside of doing expedited memory reclaim? ie why not do it > every time a process is going to die? I agree with this. The oom reaper mostly does work the exiting task would have to do anyway, so this shouldn't be measurably more expensive to do it per default. I wouldn't want to add a new interface unless we really do need that type of control. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 15:33 ` Matthew Wilcox 2019-04-11 17:05 ` Johannes Weiner @ 2019-04-11 17:09 ` Suren Baghdasaryan 2019-04-11 17:33 ` Daniel Colascione 2019-04-11 21:45 ` Roman Gushchin 2019-04-12 6:53 ` Michal Hocko 2 siblings, 2 replies; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 17:09 UTC (permalink / raw) To: Matthew Wilcox Cc: Andrew Morton, mhocko, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > victim process. The usage of this flag is currently limited to SIGKILL > > signal and only to privileged users. > > What is the downside of doing expedited memory reclaim? ie why not do it > every time a process is going to die? I think with an implementation that does not use/abuse oom-reaper thread this could be done for any kill. As I mentioned oom-reaper is a limited resource which has access to memory reserves and should not be abused in the way I do in this reference implementation. While there might be downsides that I don't know of, I'm not sure it's required to hurry every kill's memory reclaim. I think there are cases when resource deallocation is critical, for example when we kill to relieve resource shortage and there are kills when reclaim speed is not essential. It would be great if we can identify urgent cases without userspace hints, so I'm open to suggestions that do not involve additional flags. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 17:09 ` Suren Baghdasaryan @ 2019-04-11 17:33 ` Daniel Colascione 2019-04-11 17:36 ` Matthew Wilcox 2019-04-11 21:45 ` Roman Gushchin 1 sibling, 1 reply; 43+ messages in thread From: Daniel Colascione @ 2019-04-11 17:33 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Matthew Wilcox, Andrew Morton, Michal Hocko, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > victim process. The usage of this flag is currently limited to SIGKILL > > > signal and only to privileged users. > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > every time a process is going to die? > > I think with an implementation that does not use/abuse oom-reaper > thread this could be done for any kill. As I mentioned oom-reaper is a > limited resource which has access to memory reserves and should not be > abused in the way I do in this reference implementation. > While there might be downsides that I don't know of, I'm not sure it's > required to hurry every kill's memory reclaim. I think there are cases > when resource deallocation is critical, for example when we kill to > relieve resource shortage and there are kills when reclaim speed is > not essential. It would be great if we can identify urgent cases > without userspace hints, so I'm open to suggestions that do not > involve additional flags. I was imagining a PI-ish approach where we'd reap in case an RT process was waiting on the death of some other process. I'd still prefer the API I proposed in the other message because it gets the kernel out of the business of deciding what the right signal is. I'm a huge believer in "mechanism, not policy". ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 17:33 ` Daniel Colascione @ 2019-04-11 17:36 ` Matthew Wilcox 2019-04-11 17:47 ` Daniel Colascione 2019-04-11 17:52 ` Suren Baghdasaryan 0 siblings, 2 replies; 43+ messages in thread From: Matthew Wilcox @ 2019-04-11 17:36 UTC (permalink / raw) To: Daniel Colascione Cc: Suren Baghdasaryan, Andrew Morton, Michal Hocko, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote: > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > signal and only to privileged users. > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > every time a process is going to die? > > > > I think with an implementation that does not use/abuse oom-reaper > > thread this could be done for any kill. As I mentioned oom-reaper is a > > limited resource which has access to memory reserves and should not be > > abused in the way I do in this reference implementation. > > While there might be downsides that I don't know of, I'm not sure it's > > required to hurry every kill's memory reclaim. I think there are cases > > when resource deallocation is critical, for example when we kill to > > relieve resource shortage and there are kills when reclaim speed is > > not essential. It would be great if we can identify urgent cases > > without userspace hints, so I'm open to suggestions that do not > > involve additional flags. > > I was imagining a PI-ish approach where we'd reap in case an RT > process was waiting on the death of some other process. I'd still > prefer the API I proposed in the other message because it gets the > kernel out of the business of deciding what the right signal is. I'm a > huge believer in "mechanism, not policy". It's not a question of the kernel deciding what the right signal is. The kernel knows whether a signal is fatal to a particular process or not. The question is whether the killing process should do the work of reaping the dying process's resources sometimes, always or never. Currently, that is never (the process reaps its own resources); Suren is suggesting sometimes, and I'm asking "Why not always?" ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 17:36 ` Matthew Wilcox @ 2019-04-11 17:47 ` Daniel Colascione 2019-04-12 6:49 ` Michal Hocko 2019-04-12 21:03 ` Matthew Wilcox 2019-04-11 17:52 ` Suren Baghdasaryan 1 sibling, 2 replies; 43+ messages in thread From: Daniel Colascione @ 2019-04-11 17:47 UTC (permalink / raw) To: Matthew Wilcox Cc: Suren Baghdasaryan, Andrew Morton, Michal Hocko, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote: > > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > > signal and only to privileged users. > > > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > > every time a process is going to die? > > > > > > I think with an implementation that does not use/abuse oom-reaper > > > thread this could be done for any kill. As I mentioned oom-reaper is a > > > limited resource which has access to memory reserves and should not be > > > abused in the way I do in this reference implementation. > > > While there might be downsides that I don't know of, I'm not sure it's > > > required to hurry every kill's memory reclaim. I think there are cases > > > when resource deallocation is critical, for example when we kill to > > > relieve resource shortage and there are kills when reclaim speed is > > > not essential. It would be great if we can identify urgent cases > > > without userspace hints, so I'm open to suggestions that do not > > > involve additional flags. > > > > I was imagining a PI-ish approach where we'd reap in case an RT > > process was waiting on the death of some other process. I'd still > > prefer the API I proposed in the other message because it gets the > > kernel out of the business of deciding what the right signal is. I'm a > > huge believer in "mechanism, not policy". > > It's not a question of the kernel deciding what the right signal is. > The kernel knows whether a signal is fatal to a particular process or not. > The question is whether the killing process should do the work of reaping > the dying process's resources sometimes, always or never. Currently, > that is never (the process reaps its own resources); Suren is suggesting > sometimes, and I'm asking "Why not always?" FWIW, Suren's initial proposal is that the oom_reaper kthread do the reaping, not the process sending the kill. Are you suggesting that sending SIGKILL should spend a while in signal delivery reaping pages before returning? I thought about just doing it this way, but I didn't like the idea: it'd slow down mass-killing programs like killall(1). Programs expect sending SIGKILL to be a fast operation that returns immediately. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 17:47 ` Daniel Colascione @ 2019-04-12 6:49 ` Michal Hocko 2019-04-12 14:15 ` Suren Baghdasaryan 2019-04-12 21:03 ` Matthew Wilcox 1 sibling, 1 reply; 43+ messages in thread From: Michal Hocko @ 2019-04-12 6:49 UTC (permalink / raw) To: Daniel Colascione Cc: Matthew Wilcox, Suren Baghdasaryan, Andrew Morton, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu 11-04-19 10:47:50, Daniel Colascione wrote: > On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote: > > > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > > > signal and only to privileged users. > > > > > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > > > every time a process is going to die? > > > > > > > > I think with an implementation that does not use/abuse oom-reaper > > > > thread this could be done for any kill. As I mentioned oom-reaper is a > > > > limited resource which has access to memory reserves and should not be > > > > abused in the way I do in this reference implementation. > > > > While there might be downsides that I don't know of, I'm not sure it's > > > > required to hurry every kill's memory reclaim. I think there are cases > > > > when resource deallocation is critical, for example when we kill to > > > > relieve resource shortage and there are kills when reclaim speed is > > > > not essential. It would be great if we can identify urgent cases > > > > without userspace hints, so I'm open to suggestions that do not > > > > involve additional flags. > > > > > > I was imagining a PI-ish approach where we'd reap in case an RT > > > process was waiting on the death of some other process. I'd still > > > prefer the API I proposed in the other message because it gets the > > > kernel out of the business of deciding what the right signal is. I'm a > > > huge believer in "mechanism, not policy". > > > > It's not a question of the kernel deciding what the right signal is. > > The kernel knows whether a signal is fatal to a particular process or not. > > The question is whether the killing process should do the work of reaping > > the dying process's resources sometimes, always or never. Currently, > > that is never (the process reaps its own resources); Suren is suggesting > > sometimes, and I'm asking "Why not always?" > > FWIW, Suren's initial proposal is that the oom_reaper kthread do the > reaping, not the process sending the kill. Are you suggesting that > sending SIGKILL should spend a while in signal delivery reaping pages > before returning? I thought about just doing it this way, but I didn't > like the idea: it'd slow down mass-killing programs like killall(1). > Programs expect sending SIGKILL to be a fast operation that returns > immediately. I was thinking about this as well. And SYNC_SIGKILL would workaround the current expectations of how quick the current implementation is. The harder part would what is the actual semantic. Does the kill wait until the target task is TASK_DEAD or is there an intermediate step that would we could call it end of the day and still have a reasonable semantic (e.g. the original pid is really not alive anymore). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-12 6:49 ` Michal Hocko @ 2019-04-12 14:15 ` Suren Baghdasaryan 2019-04-12 14:20 ` Daniel Colascione 0 siblings, 1 reply; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-12 14:15 UTC (permalink / raw) To: Michal Hocko Cc: Daniel Colascione, Matthew Wilcox, Andrew Morton, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 11:49 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 11-04-19 10:47:50, Daniel Colascione wrote: > > On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote: > > > > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > > > > signal and only to privileged users. > > > > > > > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > > > > every time a process is going to die? > > > > > > > > > > I think with an implementation that does not use/abuse oom-reaper > > > > > thread this could be done for any kill. As I mentioned oom-reaper is a > > > > > limited resource which has access to memory reserves and should not be > > > > > abused in the way I do in this reference implementation. > > > > > While there might be downsides that I don't know of, I'm not sure it's > > > > > required to hurry every kill's memory reclaim. I think there are cases > > > > > when resource deallocation is critical, for example when we kill to > > > > > relieve resource shortage and there are kills when reclaim speed is > > > > > not essential. It would be great if we can identify urgent cases > > > > > without userspace hints, so I'm open to suggestions that do not > > > > > involve additional flags. > > > > > > > > I was imagining a PI-ish approach where we'd reap in case an RT > > > > process was waiting on the death of some other process. I'd still > > > > prefer the API I proposed in the other message because it gets the > > > > kernel out of the business of deciding what the right signal is. I'm a > > > > huge believer in "mechanism, not policy". > > > > > > It's not a question of the kernel deciding what the right signal is. > > > The kernel knows whether a signal is fatal to a particular process or not. > > > The question is whether the killing process should do the work of reaping > > > the dying process's resources sometimes, always or never. Currently, > > > that is never (the process reaps its own resources); Suren is suggesting > > > sometimes, and I'm asking "Why not always?" > > > > FWIW, Suren's initial proposal is that the oom_reaper kthread do the > > reaping, not the process sending the kill. Are you suggesting that > > sending SIGKILL should spend a while in signal delivery reaping pages > > before returning? I thought about just doing it this way, but I didn't > > like the idea: it'd slow down mass-killing programs like killall(1). > > Programs expect sending SIGKILL to be a fast operation that returns > > immediately. > > I was thinking about this as well. And SYNC_SIGKILL would workaround the > current expectations of how quick the current implementation is. The > harder part would what is the actual semantic. Does the kill wait until > the target task is TASK_DEAD or is there an intermediate step that would > we could call it end of the day and still have a reasonable semantic > (e.g. the original pid is really not alive anymore). I think Daniel's proposal was trying to address that. With an input of how many pages user wants to reclaim asynchronously and return value of how much was actually reclaimed it contains the condition when to stop and the reply how successful we could accomplish that. Since it returns the number of pages reclaimed I assume the call does not return until it reaped enough pages. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-12 14:15 ` Suren Baghdasaryan @ 2019-04-12 14:20 ` Daniel Colascione 0 siblings, 0 replies; 43+ messages in thread From: Daniel Colascione @ 2019-04-12 14:20 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Michal Hocko, Matthew Wilcox, Andrew Morton, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Fri, Apr 12, 2019 at 7:15 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Apr 11, 2019 at 11:49 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Thu 11-04-19 10:47:50, Daniel Colascione wrote: > > > On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote: > > > > > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > > > > > signal and only to privileged users. > > > > > > > > > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > > > > > every time a process is going to die? > > > > > > > > > > > > I think with an implementation that does not use/abuse oom-reaper > > > > > > thread this could be done for any kill. As I mentioned oom-reaper is a > > > > > > limited resource which has access to memory reserves and should not be > > > > > > abused in the way I do in this reference implementation. > > > > > > While there might be downsides that I don't know of, I'm not sure it's > > > > > > required to hurry every kill's memory reclaim. I think there are cases > > > > > > when resource deallocation is critical, for example when we kill to > > > > > > relieve resource shortage and there are kills when reclaim speed is > > > > > > not essential. It would be great if we can identify urgent cases > > > > > > without userspace hints, so I'm open to suggestions that do not > > > > > > involve additional flags. > > > > > > > > > > I was imagining a PI-ish approach where we'd reap in case an RT > > > > > process was waiting on the death of some other process. I'd still > > > > > prefer the API I proposed in the other message because it gets the > > > > > kernel out of the business of deciding what the right signal is. I'm a > > > > > huge believer in "mechanism, not policy". > > > > > > > > It's not a question of the kernel deciding what the right signal is. > > > > The kernel knows whether a signal is fatal to a particular process or not. > > > > The question is whether the killing process should do the work of reaping > > > > the dying process's resources sometimes, always or never. Currently, > > > > that is never (the process reaps its own resources); Suren is suggesting > > > > sometimes, and I'm asking "Why not always?" > > > > > > FWIW, Suren's initial proposal is that the oom_reaper kthread do the > > > reaping, not the process sending the kill. Are you suggesting that > > > sending SIGKILL should spend a while in signal delivery reaping pages > > > before returning? I thought about just doing it this way, but I didn't > > > like the idea: it'd slow down mass-killing programs like killall(1). > > > Programs expect sending SIGKILL to be a fast operation that returns > > > immediately. > > > > I was thinking about this as well. And SYNC_SIGKILL would workaround the SYNC_SIGKILL (which, I presume, blocks in kill(2)) was proposed in many occasions while we discussed pidfd waits over the past six months or so. We've decided to just make pidfds pollable instead. The kernel already has several ways to express the idea that a task should wait for another task to die, and I don't think we need another. If you want a process that's waiting for a task to exit to help reap that task, great --- that's an option we've talked about --- but we don't need new interface to do it, since the kernel already has all the information it needs. > > current expectations of how quick the current implementation is. The > > harder part would what is the actual semantic. Does the kill wait until > > the target task is TASK_DEAD or is there an intermediate step that would > > we could call it end of the day and still have a reasonable semantic > > (e.g. the original pid is really not alive anymore). > > I think Daniel's proposal was trying to address that. With an input of > how many pages user wants to reclaim asynchronously and return value > of how much was actually reclaimed it contains the condition when to > stop and the reply how successful we could accomplish that. Since it > returns the number of pages reclaimed I assume the call does not > return until it reaped enough pages. Right. I want to punt as much "policy" as possible to userspace. Just using a user thread to do the reaping not only solves the policy problem (since it's userspace that controls priority, affinity, retries, and so on), but also simplifies the implementation kernel-side. I can imagine situations where, depending on device energy state or even charger or screen state we might want to reap more or less aggressively, or not at all. I wouldn't want to burden the kernel with having to get that right when userspace could make the decision. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 17:47 ` Daniel Colascione 2019-04-12 6:49 ` Michal Hocko @ 2019-04-12 21:03 ` Matthew Wilcox 1 sibling, 0 replies; 43+ messages in thread From: Matthew Wilcox @ 2019-04-12 21:03 UTC (permalink / raw) To: Daniel Colascione Cc: Suren Baghdasaryan, Andrew Morton, Michal Hocko, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 10:47:50AM -0700, Daniel Colascione wrote: > On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote: > > It's not a question of the kernel deciding what the right signal is. > > The kernel knows whether a signal is fatal to a particular process or not. > > The question is whether the killing process should do the work of reaping > > the dying process's resources sometimes, always or never. Currently, > > that is never (the process reaps its own resources); Suren is suggesting > > sometimes, and I'm asking "Why not always?" > > FWIW, Suren's initial proposal is that the oom_reaper kthread do the > reaping, not the process sending the kill. Are you suggesting that > sending SIGKILL should spend a while in signal delivery reaping pages > before returning? I thought about just doing it this way, but I didn't > like the idea: it'd slow down mass-killing programs like killall(1). > Programs expect sending SIGKILL to be a fast operation that returns > immediately. Suren said that the implementation in this proposal wasn't to be taken literally. You've raised some good points here though. Do mass-killing programs like kill with a pgid or killall expect the signal-sending process to be fast, or do they not really care? From the kernel's point of view, the same work has to be done, no matter whether the killer or the victim does it. Should the work be accounted to the killer or the victim? It probably doesn't matter. The victim doing the work allows for parallelisation, but I'm not convinced that's important either. I see another advantage for the killer doing the work -- if the task is currently blocking on I/O (eg to an NFS mount that has gone away), the killer can get rid of the task's page tables. If we have to wait for the I/O to complete before the victim reaps its own page tables, we may be waiting forever. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 17:36 ` Matthew Wilcox 2019-04-11 17:47 ` Daniel Colascione @ 2019-04-11 17:52 ` Suren Baghdasaryan 1 sibling, 0 replies; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 17:52 UTC (permalink / raw) To: Matthew Wilcox Cc: Daniel Colascione, Andrew Morton, Michal Hocko, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 10:36 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Apr 11, 2019 at 10:33:32AM -0700, Daniel Colascione wrote: > > On Thu, Apr 11, 2019 at 10:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > > signal and only to privileged users. > > > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > > every time a process is going to die? > > > > > > I think with an implementation that does not use/abuse oom-reaper > > > thread this could be done for any kill. As I mentioned oom-reaper is a > > > limited resource which has access to memory reserves and should not be > > > abused in the way I do in this reference implementation. > > > While there might be downsides that I don't know of, I'm not sure it's > > > required to hurry every kill's memory reclaim. I think there are cases > > > when resource deallocation is critical, for example when we kill to > > > relieve resource shortage and there are kills when reclaim speed is > > > not essential. It would be great if we can identify urgent cases > > > without userspace hints, so I'm open to suggestions that do not > > > involve additional flags. > > > > I was imagining a PI-ish approach where we'd reap in case an RT > > process was waiting on the death of some other process. I'd still > > prefer the API I proposed in the other message because it gets the > > kernel out of the business of deciding what the right signal is. I'm a > > huge believer in "mechanism, not policy". > > It's not a question of the kernel deciding what the right signal is. > The kernel knows whether a signal is fatal to a particular process or not. > The question is whether the killing process should do the work of reaping > the dying process's resources sometimes, always or never. Currently, > that is never (the process reaps its own resources); Suren is suggesting > sometimes, and I'm asking "Why not always?" If there are no downsides of doing this always (like using some resources that can be utilized in a better way) then by all means, let's do it unconditionally. My current implementation is not one of such cases :) I think with implementation when killing process is doing the reaping of the victim's mm this can be done unconditionally because we don't use resources which might otherwise be used in a better way. Overall I like Daniel's idea of the process that requested killing and is waiting for the victim to die helping in reaping its memory. It kind of naturally elevates the priority of the reaping if the priority of the waiting process is higher than victim's priority (a kind of priority inheritance). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 17:09 ` Suren Baghdasaryan 2019-04-11 17:33 ` Daniel Colascione @ 2019-04-11 21:45 ` Roman Gushchin 2019-04-11 21:59 ` Suren Baghdasaryan 1 sibling, 1 reply; 43+ messages in thread From: Roman Gushchin @ 2019-04-11 21:45 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Matthew Wilcox, Andrew Morton, mhocko, David Rientjes, yuzhoujian, Souptick Joarder, Johannes Weiner, Tetsuo Handa, ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 10:09:06AM -0700, Suren Baghdasaryan wrote: > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > victim process. The usage of this flag is currently limited to SIGKILL > > > signal and only to privileged users. > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > every time a process is going to die? Hello, Suren! I also like the idea to reap always. > I think with an implementation that does not use/abuse oom-reaper > thread this could be done for any kill. As I mentioned oom-reaper is a > limited resource which has access to memory reserves and should not be > abused in the way I do in this reference implementation. In most OOM cases it doesn't matter that much which task to reap, so I don't think that reusing the oom-reaper thread is bad. It should be relatively easy to tweak in a way, that it won't wait for mmap_sem if there are other tasks waiting to be reaped. Also, the oom code add to the head of the list, and the expedited killing to the end, or something like this. The only think, if we're going to reap all tasks, we probably want to have a per-node oom_reaper thread. Thanks! ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 21:45 ` Roman Gushchin @ 2019-04-11 21:59 ` Suren Baghdasaryan 0 siblings, 0 replies; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 21:59 UTC (permalink / raw) To: Roman Gushchin Cc: Matthew Wilcox, Andrew Morton, mhocko, David Rientjes, yuzhoujian, Souptick Joarder, Johannes Weiner, Tetsuo Handa, ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 2:45 PM Roman Gushchin <guro@fb.com> wrote: > > On Thu, Apr 11, 2019 at 10:09:06AM -0700, Suren Baghdasaryan wrote: > > On Thu, Apr 11, 2019 at 8:33 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > signal and only to privileged users. > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > every time a process is going to die? > > Hello, Suren! > > I also like the idea to reap always. > > > I think with an implementation that does not use/abuse oom-reaper > > thread this could be done for any kill. As I mentioned oom-reaper is a > > limited resource which has access to memory reserves and should not be > > abused in the way I do in this reference implementation. > > In most OOM cases it doesn't matter that much which task to reap, > so I don't think that reusing the oom-reaper thread is bad. > It should be relatively easy to tweak in a way, that it won't > wait for mmap_sem if there are other tasks waiting to be reaped. > Also, the oom code add to the head of the list, and the expedited > killing to the end, or something like this. > > The only think, if we're going to reap all tasks, we probably > want to have a per-node oom_reaper thread. Thanks for the ideas Roman. I'll take some time to digest the input from everybody. What I heard from everyone is that we want this to be a part of generic kill functionality which does not require a change in userspace API. > Thanks! > > -- > You received this message because you are subscribed to the Google Groups "kernel-team" group. > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-11 15:33 ` Matthew Wilcox 2019-04-11 17:05 ` Johannes Weiner 2019-04-11 17:09 ` Suren Baghdasaryan @ 2019-04-12 6:53 ` Michal Hocko 2019-04-12 14:10 ` Suren Baghdasaryan 2019-04-12 14:14 ` Daniel Colascione 2 siblings, 2 replies; 43+ messages in thread From: Michal Hocko @ 2019-04-12 6:53 UTC (permalink / raw) To: Matthew Wilcox Cc: Suren Baghdasaryan, akpm, rientjes, yuzhoujian, jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team On Thu 11-04-19 08:33:13, Matthew Wilcox wrote: > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > victim process. The usage of this flag is currently limited to SIGKILL > > signal and only to privileged users. > > What is the downside of doing expedited memory reclaim? ie why not do it > every time a process is going to die? Well, you are tearing down an address space which might be still in use because the task not fully dead yeat. So there are two downsides AFAICS. Core dumping which will not see the reaped memory so the resulting coredump might be incomplete. And unexpected #PF/gup on the reaped memory will result in SIGBUS. These are things that we have closed our eyes in the oom context because they likely do not matter. If we want to use the same technique for other usecases then we have to think how much that matter again. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-12 6:53 ` Michal Hocko @ 2019-04-12 14:10 ` Suren Baghdasaryan 2019-04-12 14:14 ` Daniel Colascione 1 sibling, 0 replies; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-12 14:10 UTC (permalink / raw) To: Michal Hocko Cc: Matthew Wilcox, Andrew Morton, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote: > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > victim process. The usage of this flag is currently limited to SIGKILL > > > signal and only to privileged users. > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > every time a process is going to die? > > Well, you are tearing down an address space which might be still in use > because the task not fully dead yeat. So there are two downsides AFAICS. > Core dumping which will not see the reaped memory so the resulting > coredump might be incomplete. And unexpected #PF/gup on the reaped > memory will result in SIGBUS. These are things that we have closed our > eyes in the oom context because they likely do not matter. If we want to > use the same technique for other usecases then we have to think how much > that matter again. > Sorry, resending with corrected settings... After some internal discussions we realized that there is one additional downside of doing reaping unconditionally. If this async reaping is done on a more efficient and energy hungrier CPU irrespective of urgency of the kill it should end up costing more power overall (I’m referring here to assimetric architectures like ARM big.LITTLE). Obviously quantifying that cost is not easy as it depends on the usecase and a particular system but it won’t be zero. So I think we will need some gating condition after all. > -- > Michal Hocko > SUSE Labs > > -- > You received this message because you are subscribed to the Google Groups "kernel-team" group. > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-12 6:53 ` Michal Hocko 2019-04-12 14:10 ` Suren Baghdasaryan @ 2019-04-12 14:14 ` Daniel Colascione 2019-04-12 15:30 ` Daniel Colascione 2019-04-25 16:09 ` Suren Baghdasaryan 1 sibling, 2 replies; 43+ messages in thread From: Daniel Colascione @ 2019-04-12 14:14 UTC (permalink / raw) To: Michal Hocko Cc: Matthew Wilcox, Suren Baghdasaryan, Andrew Morton, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, linux-kernel, Android Kernel Team On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote: > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > victim process. The usage of this flag is currently limited to SIGKILL > > > signal and only to privileged users. > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > every time a process is going to die? > > Well, you are tearing down an address space which might be still in use > because the task not fully dead yeat. So there are two downsides AFAICS. > Core dumping which will not see the reaped memory so the resulting Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you try to start a core dump after reaping begins, too bad: you could have raced with process death anyway. > coredump might be incomplete. And unexpected #PF/gup on the reaped > memory will result in SIGBUS. It's a dying process. Why even bother returning from the fault handler? Just treat that situation as a thread exit. There's no need to make this observable to userspace at all. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-12 14:14 ` Daniel Colascione @ 2019-04-12 15:30 ` Daniel Colascione 2019-04-25 16:09 ` Suren Baghdasaryan 1 sibling, 0 replies; 43+ messages in thread From: Daniel Colascione @ 2019-04-12 15:30 UTC (permalink / raw) To: Michal Hocko Cc: Matthew Wilcox, Suren Baghdasaryan, Andrew Morton, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, linux-kernel, Android Kernel Team On Fri, Apr 12, 2019 at 7:14 AM Daniel Colascione <dancol@google.com> wrote: > > On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote: > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > signal and only to privileged users. > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > every time a process is going to die? > > > > Well, you are tearing down an address space which might be still in use > > because the task not fully dead yeat. So there are two downsides AFAICS. > > Core dumping which will not see the reaped memory so the resulting > > Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you > try to start a core dump after reaping begins, too bad: you could have > raced with process death anyway. > > > coredump might be incomplete. And unexpected #PF/gup on the reaped > > memory will result in SIGBUS. > > It's a dying process. Why even bother returning from the fault > handler? Just treat that situation as a thread exit. There's no need > to make this observable to userspace at all. Just for clarity, checking the code, I think we already do this. zap_other_threads sets SIGKILL pending on every thread in the group, and we'll handle SIGKILL in the process of taking any page fault or doing any system call, so I don't think it's actually possible for a thread in a dying process to observe the SIGBUS that reaping in theory can generate. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing 2019-04-12 14:14 ` Daniel Colascione 2019-04-12 15:30 ` Daniel Colascione @ 2019-04-25 16:09 ` Suren Baghdasaryan 1 sibling, 0 replies; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-25 16:09 UTC (permalink / raw) To: Daniel Colascione Cc: Michal Hocko, Matthew Wilcox, Suren Baghdasaryan, Andrew Morton, David Rientjes, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, linux-kernel, Android Kernel Team, Oleg Nesterov On Fri, Apr 12, 2019 at 7:14 AM Daniel Colascione <dancol@google.com> wrote: > > On Thu, Apr 11, 2019 at 11:53 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Thu 11-04-19 08:33:13, Matthew Wilcox wrote: > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > signal and only to privileged users. > > > > > > What is the downside of doing expedited memory reclaim? ie why not do it > > > every time a process is going to die? > > > > Well, you are tearing down an address space which might be still in use > > because the task not fully dead yeat. So there are two downsides AFAICS. > > Core dumping which will not see the reaped memory so the resulting > > Test for SIGNAL_GROUP_COREDUMP before doing any of this then. If you > try to start a core dump after reaping begins, too bad: you could have > raced with process death anyway. > > > coredump might be incomplete. And unexpected #PF/gup on the reaped > > memory will result in SIGBUS. > > It's a dying process. Why even bother returning from the fault > handler? Just treat that situation as a thread exit. There's no need > to make this observable to userspace at all. I've spent some more time to investigate possible effects of reaping on coredumps and asked Oleg Nesterov who worked on patchsets that prioritize SIGKILLs over coredump activity (https://lkml.org/lkml/2013/2/17/118). Current do_coredump implementation seems to handle the case of SIGKILL interruption by bailing out whenever dump_interrupted() returns true and that would be the case with pending SIGKILL. So in the case of race when coredump happens first and SIGKILL comes next interrupting the coredump seems to result in no change in behavior and reaping memory proactively seems to have no side effects. An opposite race when SIGKILL gets posted and then coredump happens seems impossible because do_coredump won't be called from get_signal due to signal_group_exit check (get_signal checks signal_group_exit while holding sighand->siglock and complete_signal sets SIGNAL_GROUP_EXIT while holding the same lock). There is a path from __seccomp_filter calling do_coredump while processing SECCOMP_RET_KILL_xxx but even then it should bail out when coredump_wait()->zap_threads(current) checks signal_group_exit(). (Thanks Oleg for clarifying this for me). If we are really concerned about possible increase in failed coredumps because of the proactive reaping I could make it conditional on whether coredumping mm is possible by placing this feature behind !get_dumpable(mm) condition. Another possibility is to check RLIMIT_CORE to decide if coredumps are possible (although if pipe is used for coredump that limit seems to be ignored, so that check would have to take this into consideration). On the issue of SIGBUS happening when accessed memory was already reaped, my understanding that SIGBUS being a synchronous signal will still have to be fetched using dequeue_synchronous_signal from get_signal but not before signal_group_exit is checked. So again if SIGKILL is pending I think SIGBUS will be ignored (please correct me if that's not correct). One additional question I would like to clarify is whether per-node reapers like Roman suggested would make a big difference (All CPUs that I've seen used for Android are single-node ones, so looking for more feedback here). If it's important then reaping victim's memory by the killer is probably not an option. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan 2019-04-11 1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan 2019-04-11 1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan @ 2019-04-11 10:51 ` Michal Hocko 2019-04-11 16:18 ` Joel Fernandes ` (3 more replies) 2019-04-11 11:51 ` [Lsf-pc] " Rik van Riel 3 siblings, 4 replies; 43+ messages in thread From: Michal Hocko @ 2019-04-11 10:51 UTC (permalink / raw) To: Suren Baghdasaryan Cc: akpm, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: [...] > Proposed solution uses existing oom-reaper thread to increase memory > reclaim rate of a killed process and to make this rate more deterministic. > By no means the proposed solution is considered the best and was chosen > because it was simple to implement and allowed for test data collection. > The downside of this solution is that it requires additional “expedite” > hint for something which has to be fast in all cases. Would be great to > find a way that does not require additional hints. I have to say I do not like this much. It is abusing an implementation detail of the OOM implementation and makes it an official API. Also there are some non trivial assumptions to be fullfilled to use the current oom_reaper. First of all all the process groups that share the address space have to be killed. How do you want to guarantee/implement that with a simply kill to a thread/process group? > Other possible approaches include: > - Implementing a dedicated syscall to perform opportunistic reclaim in the > context of the process waiting for the victim’s death. A natural boost > bonus occurs if the waiting process has high or RT priority and is not > limited by cpuset cgroup in its CPU choices. > - Implement a mechanism that would perform opportunistic reclaim if it’s > possible unconditionally (similar to checks in task_will_free_mem()). > - Implement opportunistic reclaim that uses shrinker interface, PSI or > other memory pressure indications as a hint to engage. I would question whether we really need this at all? Relying on the exit speed sounds like a fundamental design problem of anything that relies on it. Sure task exit might be slow, but async mm tear down is just a mere optimization this is not guaranteed to really help in speading things up. OOM killer uses it as a guarantee for a forward progress in a finite time rather than as soon as possible. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko @ 2019-04-11 16:18 ` Joel Fernandes 2019-04-11 18:12 ` Michal Hocko 2019-04-11 16:20 ` Sandeep Patil ` (2 subsequent siblings) 3 siblings, 1 reply; 43+ messages in thread From: Joel Fernandes @ 2019-04-11 16:18 UTC (permalink / raw) To: Michal Hocko Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner, penguin-kernel, ebiederm, shakeelb, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes (Google), Jann Horn, open list:MEMORY MANAGEMENT, lsf-pc, LKML, Cc: Android Kernel On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > [...] > > Proposed solution uses existing oom-reaper thread to increase memory > > reclaim rate of a killed process and to make this rate more deterministic. > > By no means the proposed solution is considered the best and was chosen > > because it was simple to implement and allowed for test data collection. > > The downside of this solution is that it requires additional “expedite” > > hint for something which has to be fast in all cases. Would be great to > > find a way that does not require additional hints. > > I have to say I do not like this much. It is abusing an implementation > detail of the OOM implementation and makes it an official API. Also > there are some non trivial assumptions to be fullfilled to use the > current oom_reaper. First of all all the process groups that share the > address space have to be killed. How do you want to guarantee/implement > that with a simply kill to a thread/process group? Will task_will_free_mem() not bail out in such cases because of process_shares_mm() returning true? AFAIU, Suren's patch calls that. Also, if I understand correctly, this patch is opportunistic and knows what it may not be possible to reap in advance this way in all cases. /* * Make sure that all tasks which share the mm with the given tasks * are dying as well to make sure that a) nobody pins its mm and * b) the task is also reapable by the oom reaper. */ rcu_read_lock(); for_each_process(p) { if (!process_shares_mm(p, mm)) > > Other possible approaches include: > > - Implementing a dedicated syscall to perform opportunistic reclaim in the > > context of the process waiting for the victim’s death. A natural boost > > bonus occurs if the waiting process has high or RT priority and is not > > limited by cpuset cgroup in its CPU choices. > > - Implement a mechanism that would perform opportunistic reclaim if it’s > > possible unconditionally (similar to checks in task_will_free_mem()). > > - Implement opportunistic reclaim that uses shrinker interface, PSI or > > other memory pressure indications as a hint to engage. > > I would question whether we really need this at all? Relying on the exit > speed sounds like a fundamental design problem of anything that relies > on it. Sure task exit might be slow, but async mm tear down is just a > mere optimization this is not guaranteed to really help in speading > things up. OOM killer uses it as a guarantee for a forward progress in a > finite time rather than as soon as possible. Per the data collected by Suren, it does speed things up. It would be nice if we can reuse this mechanism, or come up with a similar mechanism. thanks, - Joel ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 16:18 ` Joel Fernandes @ 2019-04-11 18:12 ` Michal Hocko 2019-04-11 19:14 ` Joel Fernandes 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2019-04-11 18:12 UTC (permalink / raw) To: Joel Fernandes Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner, penguin-kernel, ebiederm, shakeelb, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes (Google), Jann Horn, open list:MEMORY MANAGEMENT, lsf-pc, LKML, Cc: Android Kernel On Thu 11-04-19 12:18:33, Joel Fernandes wrote: > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > > [...] > > > Proposed solution uses existing oom-reaper thread to increase memory > > > reclaim rate of a killed process and to make this rate more deterministic. > > > By no means the proposed solution is considered the best and was chosen > > > because it was simple to implement and allowed for test data collection. > > > The downside of this solution is that it requires additional “expedite” > > > hint for something which has to be fast in all cases. Would be great to > > > find a way that does not require additional hints. > > > > I have to say I do not like this much. It is abusing an implementation > > detail of the OOM implementation and makes it an official API. Also > > there are some non trivial assumptions to be fullfilled to use the > > current oom_reaper. First of all all the process groups that share the > > address space have to be killed. How do you want to guarantee/implement > > that with a simply kill to a thread/process group? > > Will task_will_free_mem() not bail out in such cases because of > process_shares_mm() returning true? I am not really sure I understand your question. task_will_free_mem is just a shortcut to not kill anything if the current process or a victim is already dying and likely to free memory without killing or spamming the log. My concern is that this patch allows to invoke the reaper without guaranteeing the same. So it can only be an optimistic attempt and then I am wondering how reasonable of an interface this really is. Userspace send the signal and has no way to find out whether the async reaping has been scheduled or not. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 18:12 ` Michal Hocko @ 2019-04-11 19:14 ` Joel Fernandes 2019-04-11 20:11 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Joel Fernandes @ 2019-04-11 19:14 UTC (permalink / raw) To: Michal Hocko Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner, penguin-kernel, ebiederm, shakeelb, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Jann Horn, open list:MEMORY MANAGEMENT, lsf-pc, LKML, Cc: Android Kernel On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote: > On Thu 11-04-19 12:18:33, Joel Fernandes wrote: > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > > > [...] > > > > Proposed solution uses existing oom-reaper thread to increase memory > > > > reclaim rate of a killed process and to make this rate more deterministic. > > > > By no means the proposed solution is considered the best and was chosen > > > > because it was simple to implement and allowed for test data collection. > > > > The downside of this solution is that it requires additional “expedite” > > > > hint for something which has to be fast in all cases. Would be great to > > > > find a way that does not require additional hints. > > > > > > I have to say I do not like this much. It is abusing an implementation > > > detail of the OOM implementation and makes it an official API. Also > > > there are some non trivial assumptions to be fullfilled to use the > > > current oom_reaper. First of all all the process groups that share the > > > address space have to be killed. How do you want to guarantee/implement > > > that with a simply kill to a thread/process group? > > > > Will task_will_free_mem() not bail out in such cases because of > > process_shares_mm() returning true? > > I am not really sure I understand your question. task_will_free_mem is > just a shortcut to not kill anything if the current process or a victim > is already dying and likely to free memory without killing or spamming > the log. My concern is that this patch allows to invoke the reaper Got it. > without guaranteeing the same. So it can only be an optimistic attempt > and then I am wondering how reasonable of an interface this really is. > Userspace send the signal and has no way to find out whether the async > reaping has been scheduled or not. Could you clarify more what you're asking to guarantee? I cannot picture it. If you mean guaranteeing that "a task is dying anyway and will free its memory on its own", we are calling task_will_free_mem() to check that before invoking the oom reaper. Could you clarify what is the draback if OOM reaper is invoked in parallel to an exiting task which will free its memory soon? It looks like the OOM reaper is taking all the locks necessary (mmap_sem) in particular and is unmapping pages. It seemed to me to be safe, but I am missing what are the main draw backs of this - other than the intereference with core dump. One could be presumably scalability since the since OOM reaper could be bottlenecked by freeing memory on behalf of potentially several dying tasks. IIRC this patch is just Ok with being opportunistic and it need not be hidden behind an API necessarily or need any guarantees. It is just providing a hint that the OOM reaper could be woken up to expedite things. If a task is going to be taking a long time to be scheduled and free its memory, the oom reaper gives a headstart. Many of the times, background tasks can be killed but they may not have necessarily sufficient scheduler priority / cpuset (being in the background) and may be holding onto a lot of memory that needs to be reclaimed. I am not saying this the right way to do it, but I also wanted us to understand the drawbacks so that we can go back to the drawing board and come up with something better. Thanks! - Joel ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 19:14 ` Joel Fernandes @ 2019-04-11 20:11 ` Michal Hocko 2019-04-11 21:11 ` Joel Fernandes 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2019-04-11 20:11 UTC (permalink / raw) To: Joel Fernandes Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner, penguin-kernel, ebiederm, shakeelb, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Jann Horn, open list:MEMORY MANAGEMENT, lsf-pc, LKML, Cc: Android Kernel On Thu 11-04-19 15:14:30, Joel Fernandes wrote: > On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote: > > On Thu 11-04-19 12:18:33, Joel Fernandes wrote: > > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > > > > [...] > > > > > Proposed solution uses existing oom-reaper thread to increase memory > > > > > reclaim rate of a killed process and to make this rate more deterministic. > > > > > By no means the proposed solution is considered the best and was chosen > > > > > because it was simple to implement and allowed for test data collection. > > > > > The downside of this solution is that it requires additional “expedite” > > > > > hint for something which has to be fast in all cases. Would be great to > > > > > find a way that does not require additional hints. > > > > > > > > I have to say I do not like this much. It is abusing an implementation > > > > detail of the OOM implementation and makes it an official API. Also > > > > there are some non trivial assumptions to be fullfilled to use the > > > > current oom_reaper. First of all all the process groups that share the > > > > address space have to be killed. How do you want to guarantee/implement > > > > that with a simply kill to a thread/process group? > > > > > > Will task_will_free_mem() not bail out in such cases because of > > > process_shares_mm() returning true? > > > > I am not really sure I understand your question. task_will_free_mem is > > just a shortcut to not kill anything if the current process or a victim > > is already dying and likely to free memory without killing or spamming > > the log. My concern is that this patch allows to invoke the reaper > > Got it. > > > without guaranteeing the same. So it can only be an optimistic attempt > > and then I am wondering how reasonable of an interface this really is. > > Userspace send the signal and has no way to find out whether the async > > reaping has been scheduled or not. > > Could you clarify more what you're asking to guarantee? I cannot picture it. > If you mean guaranteeing that "a task is dying anyway and will free its > memory on its own", we are calling task_will_free_mem() to check that before > invoking the oom reaper. No, I am talking about the API aspect. Say you kall kill with the flag to make the async address space tear down. Now you cannot really guarantee that this is safe to do because the target task might clone(CLONE_VM) at any time. So this will be known only once the signal is sent, but the calling process has no way to find out. So the caller has no way to know what is the actual result of the requested operation. That is a poor API in my book. > Could you clarify what is the draback if OOM reaper is invoked in parallel to > an exiting task which will free its memory soon? It looks like the OOM reaper > is taking all the locks necessary (mmap_sem) in particular and is unmapping > pages. It seemed to me to be safe, but I am missing what are the main draw > backs of this - other than the intereference with core dump. One could be > presumably scalability since the since OOM reaper could be bottlenecked by > freeing memory on behalf of potentially several dying tasks. oom_reaper or any other kernel thread doing the same is a mere implementation detail I think. The oom killer doesn't really need the oom_reaper to act swiftly because it is there to act as a last resort if the oom victim cannot terminate on its own. If you want to offer an user space API then you can assume users will like to use it and expect a certain behavior but what that is? E.g. what if there are thousands of tasks killed this way? Do we care that some of them will not get the async treatment? If yes why do we need an API to control that at all? Am I more clear now? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 20:11 ` Michal Hocko @ 2019-04-11 21:11 ` Joel Fernandes 0 siblings, 0 replies; 43+ messages in thread From: Joel Fernandes @ 2019-04-11 21:11 UTC (permalink / raw) To: Michal Hocko Cc: Suren Baghdasaryan, Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian, jrdr.linux, guro, Johannes Weiner, penguin-kernel, ebiederm, shakeelb, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Jann Horn, open list:MEMORY MANAGEMENT, lsf-pc, LKML, Cc: Android Kernel On Thu, Apr 11, 2019 at 10:11:51PM +0200, Michal Hocko wrote: > On Thu 11-04-19 15:14:30, Joel Fernandes wrote: > > On Thu, Apr 11, 2019 at 08:12:43PM +0200, Michal Hocko wrote: > > > On Thu 11-04-19 12:18:33, Joel Fernandes wrote: > > > > On Thu, Apr 11, 2019 at 6:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > > > > > [...] > > > > > > Proposed solution uses existing oom-reaper thread to increase memory > > > > > > reclaim rate of a killed process and to make this rate more deterministic. > > > > > > By no means the proposed solution is considered the best and was chosen > > > > > > because it was simple to implement and allowed for test data collection. > > > > > > The downside of this solution is that it requires additional “expedite” > > > > > > hint for something which has to be fast in all cases. Would be great to > > > > > > find a way that does not require additional hints. > > > > > > > > > > I have to say I do not like this much. It is abusing an implementation > > > > > detail of the OOM implementation and makes it an official API. Also > > > > > there are some non trivial assumptions to be fullfilled to use the > > > > > current oom_reaper. First of all all the process groups that share the > > > > > address space have to be killed. How do you want to guarantee/implement > > > > > that with a simply kill to a thread/process group? > > > > > > > > Will task_will_free_mem() not bail out in such cases because of > > > > process_shares_mm() returning true? > > > > > > I am not really sure I understand your question. task_will_free_mem is > > > just a shortcut to not kill anything if the current process or a victim > > > is already dying and likely to free memory without killing or spamming > > > the log. My concern is that this patch allows to invoke the reaper > > > > Got it. > > > > > without guaranteeing the same. So it can only be an optimistic attempt > > > and then I am wondering how reasonable of an interface this really is. > > > Userspace send the signal and has no way to find out whether the async > > > reaping has been scheduled or not. > > > > Could you clarify more what you're asking to guarantee? I cannot picture it. > > If you mean guaranteeing that "a task is dying anyway and will free its > > memory on its own", we are calling task_will_free_mem() to check that before > > invoking the oom reaper. > > No, I am talking about the API aspect. Say you kall kill with the flag > to make the async address space tear down. Now you cannot really > guarantee that this is safe to do because the target task might > clone(CLONE_VM) at any time. So this will be known only once the signal > is sent, but the calling process has no way to find out. So the caller > has no way to know what is the actual result of the requested operation. > That is a poor API in my book. > > > Could you clarify what is the draback if OOM reaper is invoked in parallel to > > an exiting task which will free its memory soon? It looks like the OOM reaper > > is taking all the locks necessary (mmap_sem) in particular and is unmapping > > pages. It seemed to me to be safe, but I am missing what are the main draw > > backs of this - other than the intereference with core dump. One could be > > presumably scalability since the since OOM reaper could be bottlenecked by > > freeing memory on behalf of potentially several dying tasks. > > oom_reaper or any other kernel thread doing the same is a mere > implementation detail I think. The oom killer doesn't really need the > oom_reaper to act swiftly because it is there to act as a last resort if > the oom victim cannot terminate on its own. If you want to offer an > user space API then you can assume users will like to use it and expect > a certain behavior but what that is? E.g. what if there are thousands of > tasks killed this way? Do we care that some of them will not get the > async treatment? If yes why do we need an API to control that at all? > > Am I more clear now? Yes, your concerns are more clear now. We will think more about this and your other responses, thanks a lot. - Joel ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko 2019-04-11 16:18 ` Joel Fernandes @ 2019-04-11 16:20 ` Sandeep Patil 2019-04-11 16:47 ` Suren Baghdasaryan 2019-04-11 17:19 ` Johannes Weiner 3 siblings, 0 replies; 43+ messages in thread From: Sandeep Patil @ 2019-04-11 16:20 UTC (permalink / raw) To: Michal Hocko Cc: Suren Baghdasaryan, akpm, rientjes, willy, yuzhoujian, jrdr.linux, guro, hannes, penguin-kernel, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote: > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > [...] > > Proposed solution uses existing oom-reaper thread to increase memory > > reclaim rate of a killed process and to make this rate more deterministic. > > By no means the proposed solution is considered the best and was chosen > > because it was simple to implement and allowed for test data collection. > > The downside of this solution is that it requires additional “expedite” > > hint for something which has to be fast in all cases. Would be great to > > find a way that does not require additional hints. > > I have to say I do not like this much. It is abusing an implementation > detail of the OOM implementation and makes it an official API. Also > there are some non trivial assumptions to be fullfilled to use the > current oom_reaper. First of all all the process groups that share the > address space have to be killed. How do you want to guarantee/implement > that with a simply kill to a thread/process group? > > > Other possible approaches include: > > - Implementing a dedicated syscall to perform opportunistic reclaim in the > > context of the process waiting for the victim’s death. A natural boost > > bonus occurs if the waiting process has high or RT priority and is not > > limited by cpuset cgroup in its CPU choices. > > - Implement a mechanism that would perform opportunistic reclaim if it’s > > possible unconditionally (similar to checks in task_will_free_mem()). > > - Implement opportunistic reclaim that uses shrinker interface, PSI or > > other memory pressure indications as a hint to engage. > > I would question whether we really need this at all? Relying on the exit > speed sounds like a fundamental design problem of anything that relies > on it. OTOH, we want to keep as many processes around as possible for recency. In which case, the exit path (particularly the memory reclaim) becomes critical to maintain interactivity for phones. Android keeps processes around because cold starting applications is much slower than simply bringing them up from background. This obviously presents the problem when a background application _is_ killed, it is almost always to address sudden spike in memory needs by something else much more important and user visible. e.g. a foreground application or critical system process. > Sure task exit might be slow, but async mm tear down is just a > mere optimization this is not guaranteed to really help in speading > things up. OOM killer uses it as a guarantee for a forward progress in a > finite time rather than as soon as possible. With OOM killer, things are already really bad. When lmkd[1] kills processes, it is doing so to serve the immediate needs of the system while trying to avoid the OOM killer. - ssp 1] https://android.googlesource.com/platform/system/core/+/refs/heads/master/lmkd/ ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko 2019-04-11 16:18 ` Joel Fernandes 2019-04-11 16:20 ` Sandeep Patil @ 2019-04-11 16:47 ` Suren Baghdasaryan 2019-04-11 18:19 ` Michal Hocko 2019-04-11 17:19 ` Johannes Weiner 3 siblings, 1 reply; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 16:47 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team Thanks for the feedback! On Thu, Apr 11, 2019 at 3:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Wed 10-04-19 18:43:51, Suren Baghdasaryan wrote: > [...] > > Proposed solution uses existing oom-reaper thread to increase memory > > reclaim rate of a killed process and to make this rate more deterministic. > > By no means the proposed solution is considered the best and was chosen > > because it was simple to implement and allowed for test data collection. > > The downside of this solution is that it requires additional “expedite” > > hint for something which has to be fast in all cases. Would be great to > > find a way that does not require additional hints. > > I have to say I do not like this much. It is abusing an implementation > detail of the OOM implementation and makes it an official API. I agree with you that this particular implementation is abusing oom internal machinery and I don't think it is acceptable as is (hence this is sent as an RFC). I would like to discuss the viability of the idea of reaping kill victim's mm asynchronously. If we agree that this is worth our time, only then I would love to get into more details on how to implement it. The implementation in this RFC is a convenient way to illustrate the idea and to collect test data. > Also > there are some non trivial assumptions to be fullfilled to use the > current oom_reaper. First of all all the process groups that share the > address space have to be killed. How do you want to guarantee/implement > that with a simply kill to a thread/process group? > I'm not sure I understood this correctly but if you are asking how do we know that the mm we are reaping is not shared with processes that are not being killed then I think your task_will_free_mem() checks for that. Or have I misunderstood your question? > > Other possible approaches include: > > - Implementing a dedicated syscall to perform opportunistic reclaim in the > > context of the process waiting for the victim’s death. A natural boost > > bonus occurs if the waiting process has high or RT priority and is not > > limited by cpuset cgroup in its CPU choices. > > - Implement a mechanism that would perform opportunistic reclaim if it’s > > possible unconditionally (similar to checks in task_will_free_mem()). > > - Implement opportunistic reclaim that uses shrinker interface, PSI or > > other memory pressure indications as a hint to engage. > > I would question whether we really need this at all? Relying on the exit > speed sounds like a fundamental design problem of anything that relies > on it. Relying on it is wrong, I agree. There are protections like allocation throttling that we can fall back to stop memory depletion. However having a way to free up resources that are not needed by a dying process quickly would help to avoid throttling which hurts user experience. I agree that this is an optimization which is beneficial in a specific case - when we kill to free up resources. However this is an important optimization for systems with low memory resources like embedded systems, phones, etc. The only way to prevent being cornered into throttling is to increase the free memory margin that system needs to maintain (I describe this in my cover letter). And with limited overall memory resources memory space is at a premium, so we try to decrease that margin. I think the other and arguably even more important issue than the speed of memory reclaim is that this speed depends on what the victim is doing at the time of a kill. This introduces non-determinism in how fast we can free up resource and at this point we don't even know how much safety margin we need. > Sure task exit might be slow, but async mm tear down is just a > mere optimization this is not guaranteed to really help in speading > things up. OOM killer uses it as a guarantee for a forward progress in a > finite time rather than as soon as possible. > > -- > Michal Hocko > SUSE Labs > > -- > You received this message because you are subscribed to the Google Groups "kernel-team" group. > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 16:47 ` Suren Baghdasaryan @ 2019-04-11 18:19 ` Michal Hocko 2019-04-11 19:56 ` Suren Baghdasaryan 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2019-04-11 18:19 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, ebiederm, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote: [...] > > I would question whether we really need this at all? Relying on the exit > > speed sounds like a fundamental design problem of anything that relies > > on it. > > Relying on it is wrong, I agree. There are protections like allocation > throttling that we can fall back to stop memory depletion. However > having a way to free up resources that are not needed by a dying > process quickly would help to avoid throttling which hurts user > experience. I am not opposing speeding up the exit time in general. That is a good thing. Especially for a very large processes (e.g. a DB). But I do not really think we want to expose an API to control this specific aspect. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 18:19 ` Michal Hocko @ 2019-04-11 19:56 ` Suren Baghdasaryan 2019-04-11 20:17 ` Michal Hocko 0 siblings, 1 reply; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 19:56 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote: > [...] > > > I would question whether we really need this at all? Relying on the exit > > > speed sounds like a fundamental design problem of anything that relies > > > on it. > > > > Relying on it is wrong, I agree. There are protections like allocation > > throttling that we can fall back to stop memory depletion. However > > having a way to free up resources that are not needed by a dying > > process quickly would help to avoid throttling which hurts user > > experience. > > I am not opposing speeding up the exit time in general. That is a good > thing. Especially for a very large processes (e.g. a DB). But I do not > really think we want to expose an API to control this specific aspect. Great! Thanks for confirming that the intent is not worthless. There were a number of ideas floating both internally and in the 2/2 of this patchset. I would like to get some input on which implementation would be preferable. From your answer sounds like you think it should be a generic feature, should not require any new APIs or hints from the userspace and should be conducted for all kills unconditionally (irrespective of memory pressure, who is waiting for victim's death, etc.). Do I understand correctly that this would be the preferred solution? > -- > Michal Hocko > SUSE Labs > > -- > You received this message because you are subscribed to the Google Groups "kernel-team" group. > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 19:56 ` Suren Baghdasaryan @ 2019-04-11 20:17 ` Michal Hocko 0 siblings, 0 replies; 43+ messages in thread From: Michal Hocko @ 2019-04-11 20:17 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Andrew Morton, David Rientjes, Matthew Wilcox, yuzhoujian, Souptick Joarder, Roman Gushchin, Johannes Weiner, Tetsuo Handa, Eric W. Biederman, Shakeel Butt, Christian Brauner, Minchan Kim, Tim Murray, Daniel Colascione, Joel Fernandes, Jann Horn, linux-mm, lsf-pc, LKML, kernel-team On Thu 11-04-19 12:56:32, Suren Baghdasaryan wrote: > On Thu, Apr 11, 2019 at 11:19 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Thu 11-04-19 09:47:31, Suren Baghdasaryan wrote: > > [...] > > > > I would question whether we really need this at all? Relying on the exit > > > > speed sounds like a fundamental design problem of anything that relies > > > > on it. > > > > > > Relying on it is wrong, I agree. There are protections like allocation > > > throttling that we can fall back to stop memory depletion. However > > > having a way to free up resources that are not needed by a dying > > > process quickly would help to avoid throttling which hurts user > > > experience. > > > > I am not opposing speeding up the exit time in general. That is a good > > thing. Especially for a very large processes (e.g. a DB). But I do not > > really think we want to expose an API to control this specific aspect. > > Great! Thanks for confirming that the intent is not worthless. > There were a number of ideas floating both internally and in the 2/2 > of this patchset. I would like to get some input on which > implementation would be preferable. From your answer sounds like you > think it should be a generic feature, should not require any new APIs > or hints from the userspace and should be conducted for all kills > unconditionally (irrespective of memory pressure, who is waiting for > victim's death, etc.). Do I understand correctly that this would be > the preferred solution? Yes, I think the general tear down solution is much more preferable than a questionable API. How that solution should look like is an open question. I am not sure myself to be honest. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko ` (2 preceding siblings ...) 2019-04-11 16:47 ` Suren Baghdasaryan @ 2019-04-11 17:19 ` Johannes Weiner 3 siblings, 0 replies; 43+ messages in thread From: Johannes Weiner @ 2019-04-11 17:19 UTC (permalink / raw) To: Michal Hocko Cc: Suren Baghdasaryan, akpm, rientjes, willy, yuzhoujian, jrdr.linux, guro, penguin-kernel, ebiederm, shakeelb, christian, minchan, timmurray, dancol, joel, jannh, linux-mm, lsf-pc, linux-kernel, kernel-team On Thu, Apr 11, 2019 at 12:51:11PM +0200, Michal Hocko wrote: > I would question whether we really need this at all? Relying on the exit > speed sounds like a fundamental design problem of anything that relies > on it. Sure task exit might be slow, but async mm tear down is just a > mere optimization this is not guaranteed to really help in speading > things up. OOM killer uses it as a guarantee for a forward progress in a > finite time rather than as soon as possible. I don't think it's flawed, it's just optimizing the user experience as best as it can. You don't want to kill things prematurely, but once there is pressure you want to rectify it quickly. That's valid. We have a tool that does this, side effect or not, so I think it's fair to try to make use of it when oom killing from userspace (which we explictily support with oom_control in cgroup1 and memory.high in cgroup2, and it's not just an Android thing). The question is how explicit a contract we want to make with userspace, and I would much prefer to not overpromise on a best-effort thing like this, or even making the oom reaper ABI. If unconditionally reaping killed tasks is too expensive, I'd much prefer a simple kill hint over an explicit task reclaim interface. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Lsf-pc] [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan ` (2 preceding siblings ...) 2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko @ 2019-04-11 11:51 ` Rik van Riel 2019-04-11 12:16 ` Michal Hocko 3 siblings, 1 reply; 43+ messages in thread From: Rik van Riel @ 2019-04-11 11:51 UTC (permalink / raw) To: Suren Baghdasaryan, akpm Cc: dancol, mhocko, jannh, minchan, penguin-kernel, kernel-team, rientjes, linux-kernel, willy, linux-mm, hannes, shakeelb, jrdr.linux, yuzhoujian, joel, timmurray, lsf-pc, guro, christian, ebiederm [-- Attachment #1: Type: text/plain, Size: 561 bytes --] On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote: > The time to kill a process and free its memory can be critical when > the > killing was done to prevent memory shortages affecting system > responsiveness. The OOM killer is fickle, and often takes a fairly long time to trigger. Speeding up what happens after that seems like the wrong thing to optimize. Have you considered using something like oomd to proactively kill tasks when memory gets low, so you do not have to wait for an OOM kill? -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Lsf-pc] [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 11:51 ` [Lsf-pc] " Rik van Riel @ 2019-04-11 12:16 ` Michal Hocko 2019-04-11 16:54 ` Suren Baghdasaryan 0 siblings, 1 reply; 43+ messages in thread From: Michal Hocko @ 2019-04-11 12:16 UTC (permalink / raw) To: Rik van Riel Cc: Suren Baghdasaryan, akpm, dancol, jannh, minchan, penguin-kernel, kernel-team, rientjes, linux-kernel, willy, linux-mm, hannes, shakeelb, jrdr.linux, yuzhoujian, joel, timmurray, lsf-pc, guro, christian, ebiederm On Thu 11-04-19 07:51:21, Rik van Riel wrote: > On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote: > > The time to kill a process and free its memory can be critical when > > the > > killing was done to prevent memory shortages affecting system > > responsiveness. > > The OOM killer is fickle, and often takes a fairly > long time to trigger. Speeding up what happens after > that seems like the wrong thing to optimize. > > Have you considered using something like oomd to > proactively kill tasks when memory gets low, so > you do not have to wait for an OOM kill? AFAIU, this is the point here. They probably have a user space OOM killer implementation and want to achieve killing to be as swift as possible. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Lsf-pc] [RFC 0/2] opportunistic memory reclaim of a killed process 2019-04-11 12:16 ` Michal Hocko @ 2019-04-11 16:54 ` Suren Baghdasaryan 0 siblings, 0 replies; 43+ messages in thread From: Suren Baghdasaryan @ 2019-04-11 16:54 UTC (permalink / raw) To: Michal Hocko Cc: Rik van Riel, Suren Baghdasaryan, Andrew Morton, Daniel Colascione, Jann Horn, Minchan Kim, Tetsuo Handa, kernel-team, David Rientjes, LKML, Matthew Wilcox, linux-mm, Johannes Weiner, Shakeel Butt, Souptick Joarder, yuzhoujian, Joel Fernandes, Tim Murray, lsf-pc, Roman Gushchin, Christian Brauner, ebiederm On Thu, Apr 11, 2019 at 5:16 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 11-04-19 07:51:21, Rik van Riel wrote: > > On Wed, 2019-04-10 at 18:43 -0700, Suren Baghdasaryan via Lsf-pc wrote: > > > The time to kill a process and free its memory can be critical when > > > the > > > killing was done to prevent memory shortages affecting system > > > responsiveness. > > > > The OOM killer is fickle, and often takes a fairly > > long time to trigger. Speeding up what happens after > > that seems like the wrong thing to optimize. > > > > Have you considered using something like oomd to > > proactively kill tasks when memory gets low, so > > you do not have to wait for an OOM kill? > > AFAIU, this is the point here. They probably have a user space OOM > killer implementation and want to achieve killing to be as swift as > possible. That is correct. Android has a userspace daemon called lmkd (low memory killer daemon) to respond to memory pressure before things get bad enough for kernel oom-killer to get involved. So this asynchronous reclaim optimization would allow lmkd do its job more efficiently. > -- > Michal Hocko > SUSE Labs > > -- > You received this message because you are subscribed to the Google Groups "kernel-team" group. > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2019-04-25 21:56 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-11 1:43 [RFC 0/2] opportunistic memory reclaim of a killed process Suren Baghdasaryan 2019-04-11 1:43 ` [RFC 1/2] mm: oom: expose expedite_reclaim to use oom_reaper outside of oom_kill.c Suren Baghdasaryan 2019-04-25 21:12 ` Tetsuo Handa 2019-04-25 21:56 ` Suren Baghdasaryan 2019-04-11 1:43 ` [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing Suren Baghdasaryan 2019-04-11 10:30 ` Christian Brauner 2019-04-11 10:34 ` Christian Brauner 2019-04-11 15:18 ` Suren Baghdasaryan 2019-04-11 15:23 ` Suren Baghdasaryan 2019-04-11 16:25 ` Daniel Colascione 2019-04-11 15:33 ` Matthew Wilcox 2019-04-11 17:05 ` Johannes Weiner 2019-04-11 17:09 ` Suren Baghdasaryan 2019-04-11 17:33 ` Daniel Colascione 2019-04-11 17:36 ` Matthew Wilcox 2019-04-11 17:47 ` Daniel Colascione 2019-04-12 6:49 ` Michal Hocko 2019-04-12 14:15 ` Suren Baghdasaryan 2019-04-12 14:20 ` Daniel Colascione 2019-04-12 21:03 ` Matthew Wilcox 2019-04-11 17:52 ` Suren Baghdasaryan 2019-04-11 21:45 ` Roman Gushchin 2019-04-11 21:59 ` Suren Baghdasaryan 2019-04-12 6:53 ` Michal Hocko 2019-04-12 14:10 ` Suren Baghdasaryan 2019-04-12 14:14 ` Daniel Colascione 2019-04-12 15:30 ` Daniel Colascione 2019-04-25 16:09 ` Suren Baghdasaryan 2019-04-11 10:51 ` [RFC 0/2] opportunistic memory reclaim of a killed process Michal Hocko 2019-04-11 16:18 ` Joel Fernandes 2019-04-11 18:12 ` Michal Hocko 2019-04-11 19:14 ` Joel Fernandes 2019-04-11 20:11 ` Michal Hocko 2019-04-11 21:11 ` Joel Fernandes 2019-04-11 16:20 ` Sandeep Patil 2019-04-11 16:47 ` Suren Baghdasaryan 2019-04-11 18:19 ` Michal Hocko 2019-04-11 19:56 ` Suren Baghdasaryan 2019-04-11 20:17 ` Michal Hocko 2019-04-11 17:19 ` Johannes Weiner 2019-04-11 11:51 ` [Lsf-pc] " Rik van Riel 2019-04-11 12:16 ` Michal Hocko 2019-04-11 16:54 ` Suren Baghdasaryan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).