* [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-01 5:32 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-01 5:32 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, Shawn Landden It is common for services to be stateless around their main event loop. If a process passes the EPOLL_KILLME flag to epoll_wait5() then it signals to the kernel that epoll_wait5() may not complete, and the kernel may send SIGKILL if resources get tight. See my systemd patch: https://github.com/shawnl/systemd/tree/killme Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/eventpoll.c | 74 +++++++++++++++++++++++++++++++++- include/linux/eventpoll.h | 2 + include/linux/sched.h | 3 ++ include/uapi/asm-generic/unistd.h | 5 ++- include/uapi/linux/eventpoll.h | 3 ++ kernel/exit.c | 2 + mm/oom_kill.c | 17 ++++++++ 9 files changed, 105 insertions(+), 3 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 448ac2161112..040e5d02bdcc 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -391,3 +391,4 @@ 382 i386 pkey_free sys_pkey_free 383 i386 statx sys_statx 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl +385 i386 epoll_wait5 sys_epoll_wait5 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 5aef183e2f85..c72802e8cf65 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -339,6 +339,7 @@ 330 common pkey_alloc sys_pkey_alloc 331 common pkey_free sys_pkey_free 332 common statx sys_statx +333 common epoll_wait5 sys_epoll_wait5 # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..76d1c91d940b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -297,6 +297,14 @@ static LIST_HEAD(visited_list); */ static LIST_HEAD(tfile_check_list); +static LIST_HEAD(deathrow_q); +static long deathrow_len __read_mostly; + +/* TODO: Can this lock be removed by using atomic instructions to update + * queue? + */ +static DEFINE_MUTEX(deathrow_mutex); + #ifdef CONFIG_SYSCTL #include <linux/sysctl.h> @@ -314,6 +322,15 @@ struct ctl_table epoll_table[] = { .extra1 = &zero, .extra2 = &long_max, }, + { + .procname = "deathrow_size", + .data = &deathrow_len, + .maxlen = sizeof(deathrow_len), + .mode = 0444, + .proc_handler = proc_doulongvec_minmax, + .extra1 = &zero, + .extra2 = &long_max, + }, { } }; #endif /* CONFIG_SYSCTL */ @@ -2164,9 +2181,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_wait(2). + * + * A flags argument cannot be added to epoll_pwait cause it already has + * the maximum number of arguments (6). Can this be fixed? */ -SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, - int, maxevents, int, timeout) +SYSCALL_DEFINE5(epoll_wait5, int, epfd, struct epoll_event __user *, events, + int, maxevents, int, timeout, int, flags) { int error; struct fd f; @@ -2199,14 +2219,44 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, */ ep = f.file->private_data; + /* Check the EPOLL_* constants for conflicts. */ + BUILD_BUG_ON(EPOLL_KILLME == EPOLL_CLOEXEC); + + if (flags & ~EPOLL_KILLME) + return -EINVAL; + + if (flags & EPOLL_KILLME) { + /* Put process on death row. */ + mutex_lock(&deathrow_mutex); + deathrow_len++; + list_add(¤t->se.deathrow, &deathrow_q); + current->se.on_deathrow = 1; + mutex_unlock(&deathrow_mutex); + } + /* Time to fish for events ... */ error = ep_poll(ep, events, maxevents, timeout); + if (flags & EPOLL_KILLME) { + /* Remove process from death row. */ + mutex_lock(&deathrow_mutex); + current->se.on_deathrow = 0; + list_del(¤t->se.deathrow); + deathrow_len--; + mutex_unlock(&deathrow_mutex); + } + error_fput: fdput(f); return error; } +SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, + int, maxevents, int, timeout) +{ + return sys_epoll_wait5(epfd, events, maxevents, timeout, 0); +} + /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_pwait(2). @@ -2297,6 +2347,26 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd, } #endif +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +int exit_killme(void) +{ + if (current->se.on_deathrow) { + mutex_lock(&deathrow_mutex); + current->se.on_deathrow = 0; + list_del(¤t->se.deathrow); + mutex_unlock(&deathrow_mutex); + } + + return 0; +} + +struct list_head *eventpoll_deathrow_list(void) +{ + return &deathrow_q; +} + static int __init eventpoll_init(void) { struct sysinfo si; diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index 2f14ac73d01d..f1e28d468de5 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -20,6 +20,8 @@ /* Forward declarations to avoid compiler errors */ struct file; +int exit_killme(void); +struct list_head *eventpoll_deathrow_list(void); #ifdef CONFIG_EPOLL diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a7df4e558c..66462bf27a29 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -380,6 +380,9 @@ struct sched_entity { struct list_head group_node; unsigned int on_rq; + unsigned on_deathrow:1; + struct list_head deathrow; + u64 exec_start; u64 sum_exec_runtime; u64 vruntime; diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 061185a5eb51..843553a39388 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -893,8 +893,11 @@ __SYSCALL(__NR_fork, sys_fork) __SYSCALL(__NR_fork, sys_ni_syscall) #endif /* CONFIG_MMU */ +#define __NR_epoll_wait5 1080 +__SYSCALL(__NR_epoll_wait5, sys_epoll_wait5) + #undef __NR_syscalls -#define __NR_syscalls (__NR_fork+1) +#define __NR_syscalls (__NR_fork+2) #endif /* __ARCH_WANT_SYSCALL_DEPRECATED */ diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index f4d5c998cc2b..ce150a3e7248 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -21,6 +21,9 @@ /* Flags for epoll_create1. */ #define EPOLL_CLOEXEC O_CLOEXEC +/* Flags for epoll_wait5. */ +#define EPOLL_KILLME 0x00000001 + /* Valid opcodes to issue to sys_epoll_ctl() */ #define EPOLL_CTL_ADD 1 #define EPOLL_CTL_DEL 2 diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..cd089bdc5b17 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> @@ -917,6 +918,7 @@ void __noreturn do_exit(long code) __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); exit_rcu(); exit_tasks_rcu_finish(); + exit_killme(); lockdep_free_task(tsk); do_task_dead(); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..d6252772d593 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,7 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> #include <asm/tlb.h> #include "internal.h" @@ -1029,6 +1030,22 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row. + */ + if (!list_empty(eventpoll_deathrow_list())) { + struct list_head *l = eventpoll_deathrow_list(); + struct task_struct *ts = list_first_entry(l, + struct task_struct, se.deathrow); + + pr_debug("Killing pid %u from EPOLL_KILLME death row.", + ts->pid); + + /* We use SIGKILL so as to cleanly interrupt ep_poll() */ + kill_pid(task_pid(ts), SIGKILL, 1); + return true; + } + /* * The OOM killer does not compensate for IO-less reclaim. * pagefault_out_of_memory lost its gfp context so we have to -- 2.15.0.rc2 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-01 5:32 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-01 5:32 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, Shawn Landden It is common for services to be stateless around their main event loop. If a process passes the EPOLL_KILLME flag to epoll_wait5() then it signals to the kernel that epoll_wait5() may not complete, and the kernel may send SIGKILL if resources get tight. See my systemd patch: https://github.com/shawnl/systemd/tree/killme Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/eventpoll.c | 74 +++++++++++++++++++++++++++++++++- include/linux/eventpoll.h | 2 + include/linux/sched.h | 3 ++ include/uapi/asm-generic/unistd.h | 5 ++- include/uapi/linux/eventpoll.h | 3 ++ kernel/exit.c | 2 + mm/oom_kill.c | 17 ++++++++ 9 files changed, 105 insertions(+), 3 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 448ac2161112..040e5d02bdcc 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -391,3 +391,4 @@ 382 i386 pkey_free sys_pkey_free 383 i386 statx sys_statx 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl +385 i386 epoll_wait5 sys_epoll_wait5 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 5aef183e2f85..c72802e8cf65 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -339,6 +339,7 @@ 330 common pkey_alloc sys_pkey_alloc 331 common pkey_free sys_pkey_free 332 common statx sys_statx +333 common epoll_wait5 sys_epoll_wait5 # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..76d1c91d940b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -297,6 +297,14 @@ static LIST_HEAD(visited_list); */ static LIST_HEAD(tfile_check_list); +static LIST_HEAD(deathrow_q); +static long deathrow_len __read_mostly; + +/* TODO: Can this lock be removed by using atomic instructions to update + * queue? + */ +static DEFINE_MUTEX(deathrow_mutex); + #ifdef CONFIG_SYSCTL #include <linux/sysctl.h> @@ -314,6 +322,15 @@ struct ctl_table epoll_table[] = { .extra1 = &zero, .extra2 = &long_max, }, + { + .procname = "deathrow_size", + .data = &deathrow_len, + .maxlen = sizeof(deathrow_len), + .mode = 0444, + .proc_handler = proc_doulongvec_minmax, + .extra1 = &zero, + .extra2 = &long_max, + }, { } }; #endif /* CONFIG_SYSCTL */ @@ -2164,9 +2181,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_wait(2). + * + * A flags argument cannot be added to epoll_pwait cause it already has + * the maximum number of arguments (6). Can this be fixed? */ -SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, - int, maxevents, int, timeout) +SYSCALL_DEFINE5(epoll_wait5, int, epfd, struct epoll_event __user *, events, + int, maxevents, int, timeout, int, flags) { int error; struct fd f; @@ -2199,14 +2219,44 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, */ ep = f.file->private_data; + /* Check the EPOLL_* constants for conflicts. */ + BUILD_BUG_ON(EPOLL_KILLME == EPOLL_CLOEXEC); + + if (flags & ~EPOLL_KILLME) + return -EINVAL; + + if (flags & EPOLL_KILLME) { + /* Put process on death row. */ + mutex_lock(&deathrow_mutex); + deathrow_len++; + list_add(¤t->se.deathrow, &deathrow_q); + current->se.on_deathrow = 1; + mutex_unlock(&deathrow_mutex); + } + /* Time to fish for events ... */ error = ep_poll(ep, events, maxevents, timeout); + if (flags & EPOLL_KILLME) { + /* Remove process from death row. */ + mutex_lock(&deathrow_mutex); + current->se.on_deathrow = 0; + list_del(¤t->se.deathrow); + deathrow_len--; + mutex_unlock(&deathrow_mutex); + } + error_fput: fdput(f); return error; } +SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, + int, maxevents, int, timeout) +{ + return sys_epoll_wait5(epfd, events, maxevents, timeout, 0); +} + /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_pwait(2). @@ -2297,6 +2347,26 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd, } #endif +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +int exit_killme(void) +{ + if (current->se.on_deathrow) { + mutex_lock(&deathrow_mutex); + current->se.on_deathrow = 0; + list_del(¤t->se.deathrow); + mutex_unlock(&deathrow_mutex); + } + + return 0; +} + +struct list_head *eventpoll_deathrow_list(void) +{ + return &deathrow_q; +} + static int __init eventpoll_init(void) { struct sysinfo si; diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index 2f14ac73d01d..f1e28d468de5 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -20,6 +20,8 @@ /* Forward declarations to avoid compiler errors */ struct file; +int exit_killme(void); +struct list_head *eventpoll_deathrow_list(void); #ifdef CONFIG_EPOLL diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a7df4e558c..66462bf27a29 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -380,6 +380,9 @@ struct sched_entity { struct list_head group_node; unsigned int on_rq; + unsigned on_deathrow:1; + struct list_head deathrow; + u64 exec_start; u64 sum_exec_runtime; u64 vruntime; diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 061185a5eb51..843553a39388 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -893,8 +893,11 @@ __SYSCALL(__NR_fork, sys_fork) __SYSCALL(__NR_fork, sys_ni_syscall) #endif /* CONFIG_MMU */ +#define __NR_epoll_wait5 1080 +__SYSCALL(__NR_epoll_wait5, sys_epoll_wait5) + #undef __NR_syscalls -#define __NR_syscalls (__NR_fork+1) +#define __NR_syscalls (__NR_fork+2) #endif /* __ARCH_WANT_SYSCALL_DEPRECATED */ diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index f4d5c998cc2b..ce150a3e7248 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -21,6 +21,9 @@ /* Flags for epoll_create1. */ #define EPOLL_CLOEXEC O_CLOEXEC +/* Flags for epoll_wait5. */ +#define EPOLL_KILLME 0x00000001 + /* Valid opcodes to issue to sys_epoll_ctl() */ #define EPOLL_CTL_ADD 1 #define EPOLL_CTL_DEL 2 diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..cd089bdc5b17 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> @@ -917,6 +918,7 @@ void __noreturn do_exit(long code) __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); exit_rcu(); exit_tasks_rcu_finish(); + exit_killme(); lockdep_free_task(tsk); do_task_dead(); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..d6252772d593 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,7 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> #include <asm/tlb.h> #include "internal.h" @@ -1029,6 +1030,22 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row. + */ + if (!list_empty(eventpoll_deathrow_list())) { + struct list_head *l = eventpoll_deathrow_list(); + struct task_struct *ts = list_first_entry(l, + struct task_struct, se.deathrow); + + pr_debug("Killing pid %u from EPOLL_KILLME death row.", + ts->pid); + + /* We use SIGKILL so as to cleanly interrupt ep_poll() */ + kill_pid(task_pid(ts), SIGKILL, 1); + return true; + } + /* * The OOM killer does not compensate for IO-less reclaim. * pagefault_out_of_memory lost its gfp context so we have to -- 2.15.0.rc2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-01 5:32 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-01 5:32 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, Shawn Landden It is common for services to be stateless around their main event loop. If a process passes the EPOLL_KILLME flag to epoll_wait5() then it signals to the kernel that epoll_wait5() may not complete, and the kernel may send SIGKILL if resources get tight. See my systemd patch: https://github.com/shawnl/systemd/tree/killme Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/eventpoll.c | 74 +++++++++++++++++++++++++++++++++- include/linux/eventpoll.h | 2 + include/linux/sched.h | 3 ++ include/uapi/asm-generic/unistd.h | 5 ++- include/uapi/linux/eventpoll.h | 3 ++ kernel/exit.c | 2 + mm/oom_kill.c | 17 ++++++++ 9 files changed, 105 insertions(+), 3 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 448ac2161112..040e5d02bdcc 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -391,3 +391,4 @@ 382 i386 pkey_free sys_pkey_free 383 i386 statx sys_statx 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl +385 i386 epoll_wait5 sys_epoll_wait5 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 5aef183e2f85..c72802e8cf65 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -339,6 +339,7 @@ 330 common pkey_alloc sys_pkey_alloc 331 common pkey_free sys_pkey_free 332 common statx sys_statx +333 common epoll_wait5 sys_epoll_wait5 # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..76d1c91d940b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -297,6 +297,14 @@ static LIST_HEAD(visited_list); */ static LIST_HEAD(tfile_check_list); +static LIST_HEAD(deathrow_q); +static long deathrow_len __read_mostly; + +/* TODO: Can this lock be removed by using atomic instructions to update + * queue? + */ +static DEFINE_MUTEX(deathrow_mutex); + #ifdef CONFIG_SYSCTL #include <linux/sysctl.h> @@ -314,6 +322,15 @@ struct ctl_table epoll_table[] = { .extra1 = &zero, .extra2 = &long_max, }, + { + .procname = "deathrow_size", + .data = &deathrow_len, + .maxlen = sizeof(deathrow_len), + .mode = 0444, + .proc_handler = proc_doulongvec_minmax, + .extra1 = &zero, + .extra2 = &long_max, + }, { } }; #endif /* CONFIG_SYSCTL */ @@ -2164,9 +2181,12 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_wait(2). + * + * A flags argument cannot be added to epoll_pwait cause it already has + * the maximum number of arguments (6). Can this be fixed? */ -SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, - int, maxevents, int, timeout) +SYSCALL_DEFINE5(epoll_wait5, int, epfd, struct epoll_event __user *, events, + int, maxevents, int, timeout, int, flags) { int error; struct fd f; @@ -2199,14 +2219,44 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, */ ep = f.file->private_data; + /* Check the EPOLL_* constants for conflicts. */ + BUILD_BUG_ON(EPOLL_KILLME == EPOLL_CLOEXEC); + + if (flags & ~EPOLL_KILLME) + return -EINVAL; + + if (flags & EPOLL_KILLME) { + /* Put process on death row. */ + mutex_lock(&deathrow_mutex); + deathrow_len++; + list_add(¤t->se.deathrow, &deathrow_q); + current->se.on_deathrow = 1; + mutex_unlock(&deathrow_mutex); + } + /* Time to fish for events ... */ error = ep_poll(ep, events, maxevents, timeout); + if (flags & EPOLL_KILLME) { + /* Remove process from death row. */ + mutex_lock(&deathrow_mutex); + current->se.on_deathrow = 0; + list_del(¤t->se.deathrow); + deathrow_len--; + mutex_unlock(&deathrow_mutex); + } + error_fput: fdput(f); return error; } +SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, + int, maxevents, int, timeout) +{ + return sys_epoll_wait5(epfd, events, maxevents, timeout, 0); +} + /* * Implement the event wait interface for the eventpoll file. It is the kernel * part of the user space epoll_pwait(2). @@ -2297,6 +2347,26 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd, } #endif +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +int exit_killme(void) +{ + if (current->se.on_deathrow) { + mutex_lock(&deathrow_mutex); + current->se.on_deathrow = 0; + list_del(¤t->se.deathrow); + mutex_unlock(&deathrow_mutex); + } + + return 0; +} + +struct list_head *eventpoll_deathrow_list(void) +{ + return &deathrow_q; +} + static int __init eventpoll_init(void) { struct sysinfo si; diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h index 2f14ac73d01d..f1e28d468de5 100644 --- a/include/linux/eventpoll.h +++ b/include/linux/eventpoll.h @@ -20,6 +20,8 @@ /* Forward declarations to avoid compiler errors */ struct file; +int exit_killme(void); +struct list_head *eventpoll_deathrow_list(void); #ifdef CONFIG_EPOLL diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a7df4e558c..66462bf27a29 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -380,6 +380,9 @@ struct sched_entity { struct list_head group_node; unsigned int on_rq; + unsigned on_deathrow:1; + struct list_head deathrow; + u64 exec_start; u64 sum_exec_runtime; u64 vruntime; diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 061185a5eb51..843553a39388 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -893,8 +893,11 @@ __SYSCALL(__NR_fork, sys_fork) __SYSCALL(__NR_fork, sys_ni_syscall) #endif /* CONFIG_MMU */ +#define __NR_epoll_wait5 1080 +__SYSCALL(__NR_epoll_wait5, sys_epoll_wait5) + #undef __NR_syscalls -#define __NR_syscalls (__NR_fork+1) +#define __NR_syscalls (__NR_fork+2) #endif /* __ARCH_WANT_SYSCALL_DEPRECATED */ diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h index f4d5c998cc2b..ce150a3e7248 100644 --- a/include/uapi/linux/eventpoll.h +++ b/include/uapi/linux/eventpoll.h @@ -21,6 +21,9 @@ /* Flags for epoll_create1. */ #define EPOLL_CLOEXEC O_CLOEXEC +/* Flags for epoll_wait5. */ +#define EPOLL_KILLME 0x00000001 + /* Valid opcodes to issue to sys_epoll_ctl() */ #define EPOLL_CTL_ADD 1 #define EPOLL_CTL_DEL 2 diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..cd089bdc5b17 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> @@ -917,6 +918,7 @@ void __noreturn do_exit(long code) __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); exit_rcu(); exit_tasks_rcu_finish(); + exit_killme(); lockdep_free_task(tsk); do_task_dead(); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..d6252772d593 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,7 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> #include <asm/tlb.h> #include "internal.h" @@ -1029,6 +1030,22 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row. + */ + if (!list_empty(eventpoll_deathrow_list())) { + struct list_head *l = eventpoll_deathrow_list(); + struct task_struct *ts = list_first_entry(l, + struct task_struct, se.deathrow); + + pr_debug("Killing pid %u from EPOLL_KILLME death row.", + ts->pid); + + /* We use SIGKILL so as to cleanly interrupt ep_poll() */ + kill_pid(task_pid(ts), SIGKILL, 1); + return true; + } + /* * The OOM killer does not compensate for IO-less reclaim. * pagefault_out_of_memory lost its gfp context so we have to -- 2.15.0.rc2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 5:32 ` Shawn Landden @ 2017-11-01 14:04 ` Matthew Wilcox -1 siblings, 0 replies; 58+ messages in thread From: Matthew Wilcox @ 2017-11-01 14:04 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On Tue, Oct 31, 2017 at 10:32:44PM -0700, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > signals to the kernel that epoll_wait5() may not complete, and the kernel > may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/killme > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). I'm not taking a position on whether this is a good feature to have, but your implementation could do with some improvement. > +static LIST_HEAD(deathrow_q); > +static long deathrow_len __read_mostly; In what sense is this __read_mostly when it's modified by every call that has EPOLL_KILLME set? Also, why do you think this is a useful statistic to gather in the kernel and expose to userspace? > +/* TODO: Can this lock be removed by using atomic instructions to update > + * queue? > + */ > +static DEFINE_MUTEX(deathrow_mutex); This doesn't need to be a mutex; you don't do anything that sleeps while holding it. It should be a spinlock instead (but see below). > @@ -380,6 +380,9 @@ struct sched_entity { > struct list_head group_node; > unsigned int on_rq; > > + unsigned on_deathrow:1; > + struct list_head deathrow; > + > u64 exec_start; > u64 sum_exec_runtime; > u64 vruntime; You're adding an extra 16 bytes to each task to implement this feature. I don't like that, and I think you can avoid it. Turn 'deathrow' into a wait_queue_head_t. Declare the wait_queue_entry on the stack. While you're at it, I don't think 'deathrow' is an epoll concept. I think it's an OOM killer concept which happens to be only accessible through epoll today (but we could consider allowing other system calls to place tasks on it in the future). So the central place for all this is in oom_kill.c and epoll only calls into it. Maybe we have 'deathrow_enroll()' and 'deathrow_remove()' APIs in oom_killer. And I don't like the name 'deathrow'. How about oom_target? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-01 14:04 ` Matthew Wilcox 0 siblings, 0 replies; 58+ messages in thread From: Matthew Wilcox @ 2017-11-01 14:04 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On Tue, Oct 31, 2017 at 10:32:44PM -0700, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > signals to the kernel that epoll_wait5() may not complete, and the kernel > may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/killme > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). I'm not taking a position on whether this is a good feature to have, but your implementation could do with some improvement. > +static LIST_HEAD(deathrow_q); > +static long deathrow_len __read_mostly; In what sense is this __read_mostly when it's modified by every call that has EPOLL_KILLME set? Also, why do you think this is a useful statistic to gather in the kernel and expose to userspace? > +/* TODO: Can this lock be removed by using atomic instructions to update > + * queue? > + */ > +static DEFINE_MUTEX(deathrow_mutex); This doesn't need to be a mutex; you don't do anything that sleeps while holding it. It should be a spinlock instead (but see below). > @@ -380,6 +380,9 @@ struct sched_entity { > struct list_head group_node; > unsigned int on_rq; > > + unsigned on_deathrow:1; > + struct list_head deathrow; > + > u64 exec_start; > u64 sum_exec_runtime; > u64 vruntime; You're adding an extra 16 bytes to each task to implement this feature. I don't like that, and I think you can avoid it. Turn 'deathrow' into a wait_queue_head_t. Declare the wait_queue_entry on the stack. While you're at it, I don't think 'deathrow' is an epoll concept. I think it's an OOM killer concept which happens to be only accessible through epoll today (but we could consider allowing other system calls to place tasks on it in the future). So the central place for all this is in oom_kill.c and epoll only calls into it. Maybe we have 'deathrow_enroll()' and 'deathrow_remove()' APIs in oom_killer. And I don't like the name 'deathrow'. How about oom_target? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 5:32 ` Shawn Landden @ 2017-11-01 15:16 ` Colin Walters -1 siblings, 0 replies; 58+ messages in thread From: Colin Walters @ 2017-11-01 15:16 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On Wed, Nov 1, 2017, at 01:32 AM, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > signals to the kernel that epoll_wait5() may not complete, and the kernel > may send SIGKILL if resources get tight. > I've thought about something like this in the past too and would love to see it land. Bigger picture, this also comes up in (server) container environments, see e.g.: https://docs.openshift.com/container-platform/3.3/admin_guide/idling_applications.html There's going to be a long slog getting apps to actually make use of this, but I suspect if it gets wrapped up nicely in some "framework" libraries for C/C++, and be bound in the language ecosystems like golang we could see a fair amount of adoption on the order of a year or two. However, while I understand why it feels natural to tie this to epoll, as the maintainer of glib2 which is used by a *lot* of things; I'm not sure we're going to port to epoll anytime soon. Why not just make this a prctl()? It's not like it's really any less racy to do: prctl(PR_SET_IDLE) epoll() and this also allows: prctl(PR_SET_IDLE) poll() And as this is most often just going to be an optional hint it's easier to e.g. just ignore EINVAL from the prctl(). ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-01 15:16 ` Colin Walters 0 siblings, 0 replies; 58+ messages in thread From: Colin Walters @ 2017-11-01 15:16 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On Wed, Nov 1, 2017, at 01:32 AM, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > signals to the kernel that epoll_wait5() may not complete, and the kernel > may send SIGKILL if resources get tight. > I've thought about something like this in the past too and would love to see it land. Bigger picture, this also comes up in (server) container environments, see e.g.: https://docs.openshift.com/container-platform/3.3/admin_guide/idling_applications.html There's going to be a long slog getting apps to actually make use of this, but I suspect if it gets wrapped up nicely in some "framework" libraries for C/C++, and be bound in the language ecosystems like golang we could see a fair amount of adoption on the order of a year or two. However, while I understand why it feels natural to tie this to epoll, as the maintainer of glib2 which is used by a *lot* of things; I'm not sure we're going to port to epoll anytime soon. Why not just make this a prctl()? It's not like it's really any less racy to do: prctl(PR_SET_IDLE) epoll() and this also allows: prctl(PR_SET_IDLE) poll() And as this is most often just going to be an optional hint it's easier to e.g. just ignore EINVAL from the prctl(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 15:16 ` Colin Walters @ 2017-11-01 15:22 ` Colin Walters -1 siblings, 0 replies; 58+ messages in thread From: Colin Walters @ 2017-11-01 15:22 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On Wed, Nov 1, 2017, at 11:16 AM, Colin Walters wrote: > > as the maintainer of glib2 which is used by a *lot* of things; I'm not (I meant to say "a" maintainer) Also, while I'm not an expert in Android, I think the "what to kill" logic there lives in userspace, right? So it feels like we should expose this state in e.g. /proc and allow userspace daemons (e.g. systemd, kubelet) to perform idle collection too, even if the system isn't actually low on resources from the kernel's perspective. And doing that requires some sort of kill(pid, SIGKILL_IF_IDLE) or so? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-01 15:22 ` Colin Walters 0 siblings, 0 replies; 58+ messages in thread From: Colin Walters @ 2017-11-01 15:22 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On Wed, Nov 1, 2017, at 11:16 AM, Colin Walters wrote: > > as the maintainer of glib2 which is used by a *lot* of things; I'm not (I meant to say "a" maintainer) Also, while I'm not an expert in Android, I think the "what to kill" logic there lives in userspace, right? So it feels like we should expose this state in e.g. /proc and allow userspace daemons (e.g. systemd, kubelet) to perform idle collection too, even if the system isn't actually low on resources from the kernel's perspective. And doing that requires some sort of kill(pid, SIGKILL_IF_IDLE) or so? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 15:22 ` Colin Walters (?) @ 2017-11-03 9:22 ` peter enderborg -1 siblings, 0 replies; 58+ messages in thread From: peter enderborg @ 2017-11-03 9:22 UTC (permalink / raw) To: Colin Walters, Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On 11/01/2017 04:22 PM, Colin Walters wrote: > > On Wed, Nov 1, 2017, at 11:16 AM, Colin Walters wrote: >> as the maintainer of glib2 which is used by a *lot* of things; I'm not > (I meant to say "a" maintainer) > > Also, while I'm not an expert in Android, I think the "what to kill" logic > there lives in userspace, right? So it feels like we should expose this > state in e.g. /proc and allow userspace daemons (e.g. systemd, kubelet) to perform > idle collection too, even if the system isn't actually low on resources > from the kernel's perspective. > > And doing that requires some sort of kill(pid, SIGKILL_IF_IDLE) or so? > You are right, in android it is the activity manager that performs this tasks. And if services dies without talking to the activity manager the service is restarted, unless it is on highest oom score. A other problem is that a lot communication in android is binder not epoll. And a signal that can not be caught not that good. But a "warn" signal of the userspace choice in something in a context similar to ulimit. SIGXFSZ/SIGXCPU that you can pickup and notify activity manager might work. However, in android this is already solved with OnTrimMemory that is message sent from activitymanager to application, services etc when system need memory back. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-03 9:22 ` peter enderborg 0 siblings, 0 replies; 58+ messages in thread From: peter enderborg @ 2017-11-03 9:22 UTC (permalink / raw) To: Colin Walters, Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On 11/01/2017 04:22 PM, Colin Walters wrote: > > On Wed, Nov 1, 2017, at 11:16 AM, Colin Walters wrote: >> as the maintainer of glib2 which is used by a *lot* of things; I'm not > (I meant to say "a" maintainer) > > Also, while I'm not an expert in Android, I think the "what to kill" logic > there lives in userspace, right? So it feels like we should expose this > state in e.g. /proc and allow userspace daemons (e.g. systemd, kubelet) to perform > idle collection too, even if the system isn't actually low on resources > from the kernel's perspective. > > And doing that requires some sort of kill(pid, SIGKILL_IF_IDLE) or so? > You are right, in android it is the activity manager that performs this tasks. And if services dies without talking to the activity manager the service is restarted, unless it is on highest oom score. A other problem is that a lot communication in android is binder not epoll. And a signal that can not be caught not that good. But a "warn" signal of the userspace choice in something in a context similar to ulimit. SIGXFSZ/SIGXCPU that you can pickup and notify activity manager might work. However, in android this is already solved with OnTrimMemory that is message sent from activitymanager to application, services etc when system need memory back. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-03 9:22 ` peter enderborg 0 siblings, 0 replies; 58+ messages in thread From: peter enderborg @ 2017-11-03 9:22 UTC (permalink / raw) To: Colin Walters, Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On 11/01/2017 04:22 PM, Colin Walters wrote: > > On Wed, Nov 1, 2017, at 11:16 AM, Colin Walters wrote: >> as the maintainer of glib2 which is used by a *lot* of things; I'm not > (I meant to say "a" maintainer) > > Also, while I'm not an expert in Android, I think the "what to kill" logic > there lives in userspace, right? So it feels like we should expose this > state in e.g. /proc and allow userspace daemons (e.g. systemd, kubelet) to perform > idle collection too, even if the system isn't actually low on resources > from the kernel's perspective. > > And doing that requires some sort of kill(pid, SIGKILL_IF_IDLE) or so? > You are right, in android it is the activity manager that performs this tasks. And if services dies without talking to the activity manager the service is restarted, unless it is on highest oom score. A other problem is that a lot communication in android is binder not epoll. And a signal that can not be caught not that good. But a "warn" signal of the userspace choice in something in a context similar to ulimit. SIGXFSZ/SIGXCPU that you can pickup and notify activity manager might work. However, in android this is already solved with OnTrimMemory that is message sent from activitymanager to application, services etc when system need memory back. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 15:16 ` Colin Walters (?) (?) @ 2017-11-01 19:02 ` Shawn Landden 2017-11-01 19:37 ` Colin Walters -1 siblings, 1 reply; 58+ messages in thread From: Shawn Landden @ 2017-11-01 19:02 UTC (permalink / raw) To: Colin Walters; +Cc: linux-kernel, linux-fsdevel, linux-mm [-- Attachment #1: Type: text/plain, Size: 1642 bytes --] On Wed, Nov 1, 2017 at 8:16 AM, Colin Walters <walters@verbum.org> wrote: > > > On Wed, Nov 1, 2017, at 01:32 AM, Shawn Landden wrote: > > It is common for services to be stateless around their main event loop. > > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > > signals to the kernel that epoll_wait5() may not complete, and the kernel > > may send SIGKILL if resources get tight. > > > > I've thought about something like this in the past too and would love > to see it land. Bigger picture, this also comes up in (server) container > environments, see e.g.: > > https://docs.openshift.com/container-platform/3.3/admin_ > guide/idling_applications.html > > There's going to be a long slog getting apps to actually make use > of this, but I suspect if it gets wrapped up nicely in some "framework" > libraries for C/C++, and be bound in the language ecosystems like golang > we could see a fair amount of adoption on the order of a year or two. > > However, while I understand why it feels natural to tie this to epoll, > as the maintainer of glib2 which is used by a *lot* of things; I'm not > sure we're going to port to epoll anytime soon. > > Why not just make this a prctl()? It's not like it's really any less racy > to do: > > prctl(PR_SET_IDLE) > epoll() > > and this also allows: > > prctl(PR_SET_IDLE) > poll() > > And as this is most often just going to be an optional hint it's easier to > e.g. just ignore EINVAL > from the prctl(). > This solves the fact that epoll_pwait() already is a 6 argument (maximum allowed) syscall. But what if the process has multiple epoll() instances in multiple threads? [-- Attachment #2: Type: text/html, Size: 2277 bytes --] ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 19:02 ` Shawn Landden @ 2017-11-01 19:37 ` Colin Walters 0 siblings, 0 replies; 58+ messages in thread From: Colin Walters @ 2017-11-01 19:37 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On Wed, Nov 1, 2017, at 03:02 PM, Shawn Landden wrote: > > This solves the fact that epoll_pwait() already is a 6 argument (maximum allowed) syscall. But what if the process has multiple epoll() instances in multiple threads? Well, that's a subset of the general question of - what is the interaction of this system call and threading? It looks like you've prototyped this out in userspace with systemd, but from a quick glance at the current git, systemd's threading is limited doing sync()/fsync() and gethostbyname() async. But languages with a GC tend to at least use a background thread for that, and of course lots of modern userspace makes heavy use of multithreading (or variants like goroutines). A common pattern though is to have a "main thread" that acts as a control point and runs the mainloop (particularly for anything with a GUI). That's going to be the thing calling prctl(SET_IDLE) - but I think its idle state should implicitly affect the whole process, since for a lot of apps those other threads are going to just be "background". It'd probably then be an error to use prctl(SET_IDLE) in more than one thread ever? (Although that might break in golang due to the way goroutines can be migrated across threads) That'd probably be a good "generality test" - what would it take to have this system call be used for a simple golang webserver app that's e.g. socket activated by systemd, or a Kubernetes service? Or another really interesting case would be qemu; make it easy to flag VMs as always having this state (most of my testing VMs are like this; it's OK if they get destroyed, I just reinitialize them from the gold state). Going back to threading - a tricky thing we should handle in general is when userspace libraries create threads that are unknown to the app; the "async gethostbyname()" is a good example. To be conservative we'd likely need to "fail non-idle", but figure out some way tell the kernel for e.g. GC threads that they're still idle. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-01 19:37 ` Colin Walters 0 siblings, 0 replies; 58+ messages in thread From: Colin Walters @ 2017-11-01 19:37 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On Wed, Nov 1, 2017, at 03:02 PM, Shawn Landden wrote: > > This solves the fact that epoll_pwait() already is a 6 argument (maximum allowed) syscall. But what if the process has multiple epoll() instances in multiple threads? Well, that's a subset of the general question of - what is the interaction of this system call and threading? It looks like you've prototyped this out in userspace with systemd, but from a quick glance at the current git, systemd's threading is limited doing sync()/fsync() and gethostbyname() async. But languages with a GC tend to at least use a background thread for that, and of course lots of modern userspace makes heavy use of multithreading (or variants like goroutines). A common pattern though is to have a "main thread" that acts as a control point and runs the mainloop (particularly for anything with a GUI). That's going to be the thing calling prctl(SET_IDLE) - but I think its idle state should implicitly affect the whole process, since for a lot of apps those other threads are going to just be "background". It'd probably then be an error to use prctl(SET_IDLE) in more than one thread ever? (Although that might break in golang due to the way goroutines can be migrated across threads) That'd probably be a good "generality test" - what would it take to have this system call be used for a simple golang webserver app that's e.g. socket activated by systemd, or a Kubernetes service? Or another really interesting case would be qemu; make it easy to flag VMs as always having this state (most of my testing VMs are like this; it's OK if they get destroyed, I just reinitialize them from the gold state). Going back to threading - a tricky thing we should handle in general is when userspace libraries create threads that are unknown to the app; the "async gethostbyname()" is a good example. To be conservative we'd likely need to "fail non-idle", but figure out some way tell the kernel for e.g. GC threads that they're still idle. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 19:37 ` Colin Walters (?) @ 2017-11-01 19:43 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-01 19:43 UTC (permalink / raw) To: Colin Walters; +Cc: linux-kernel, linux-fsdevel, linux-mm [-- Attachment #1: Type: text/plain, Size: 2427 bytes --] On Wed, Nov 1, 2017 at 12:37 PM, Colin Walters <walters@verbum.org> wrote: > On Wed, Nov 1, 2017, at 03:02 PM, Shawn Landden wrote: > > > > This solves the fact that epoll_pwait() already is a 6 argument (maximum > allowed) syscall. But what if the process has multiple epoll() instances in > multiple threads? > > Well, that's a subset of the general question of - what is the interaction > of this system call and threading? It looks like you've prototyped this > out in userspace with systemd, but from a quick glance at the current git, > systemd's threading is limited doing sync()/fsync() and gethostbyname() > async. > > But languages with a GC tend to at least use a background thread for that, > and of course lots of modern userspace makes heavy use of multithreading > (or variants like goroutines). > > A common pattern though is to have a "main thread" that acts as a control > point and runs the mainloop (particularly for anything with a GUI). > That's > going to be the thing calling prctl(SET_IDLE) - but I think its idle state > should implicitly > affect the whole process, since for a lot of apps those other threads are > going to > just be "background". > > It'd probably then be an error to use prctl(SET_IDLE) in more than one > thread > ever? (Although that might break in golang due to the way goroutines can > be migrated across threads) > > That'd probably be a good "generality test" - what would it take to have > this system call be used for a simple golang webserver app that's e.g. > socket activated by systemd, or a Kubernetes service? Or another > really interesting case would be qemu; make it easy to flag VMs as always > having this state (most of my testing VMs are like this; it's OK if they > get > destroyed, I just reinitialize them from the gold state). > I think just setting it globally will work for 99.99% of cases, where there is only one event loop, but I'd like to handle 100% of cases. Unfortunately, epoll_pwait() is one of those cases, and that only will work through a prctl() because of limited support for 7 arguments. > > Going back to threading - a tricky thing we should handle in general > is when userspace libraries create threads that are unknown to the app; > the "async gethostbyname()" is a good example. To be conservative we'd > likely need to "fail non-idle", but figure out some way tell the kernel > for e.g. GC threads that they're still idle. > [-- Attachment #2: Type: text/html, Size: 3067 bytes --] ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 19:37 ` Colin Walters (?) (?) @ 2017-11-01 20:54 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-01 20:54 UTC (permalink / raw) To: Colin Walters; +Cc: linux-kernel, linux-fsdevel, linux-mm [-- Attachment #1: Type: text/plain, Size: 2376 bytes --] On Wed, Nov 1, 2017 at 12:37 PM, Colin Walters <walters@verbum.org> wrote: > On Wed, Nov 1, 2017, at 03:02 PM, Shawn Landden wrote: > > > > This solves the fact that epoll_pwait() already is a 6 argument (maximum > allowed) syscall. But what if the process has multiple epoll() instances in > multiple threads? > > Well, that's a subset of the general question of - what is the interaction > of this system call and threading? It looks like you've prototyped this > out in userspace with systemd, but from a quick glance at the current git, > systemd's threading is limited doing sync()/fsync() and gethostbyname() > async. > > But languages with a GC tend to at least use a background thread for that, > and of course lots of modern userspace makes heavy use of multithreading > (or variants like goroutines). > > A common pattern though is to have a "main thread" that acts as a control > point and runs the mainloop (particularly for anything with a GUI). > That's > going to be the thing calling prctl(SET_IDLE) - but I think its idle state > should implicitly > affect the whole process, since for a lot of apps those other threads are > going to > just be "background". > > It'd probably then be an error to use prctl(SET_IDLE) in more than one > thread > ever? (Although that might break in golang due to the way goroutines can > be migrated across threads) > > That'd probably be a good "generality test" - what would it take to have > this system call be used for a simple golang webserver app that's e.g. > socket activated by systemd, or a Kubernetes service? Or another > really interesting case would be qemu; make it easy to flag VMs as always > having this state (most of my testing VMs are like this; it's OK if they > get > destroyed, I just reinitialize them from the gold state). > > Going back to threading - a tricky thing we should handle in general > is when userspace libraries create threads that are unknown to the app; > the "async gethostbyname()" is a good example. To be conservative we'd > likely need to "fail non-idle", but figure out some way tell the kernel > for e.g. GC threads that they're still idle. > prctl() still seems like it wouldn't work with threads. How about fnctl(F_SETFD, FD_KILLME) ? Attached only to epoll fds would be my preference, but allowing it to be attached to all fds would allow poll() and select() to work. [-- Attachment #2: Type: text/html, Size: 2973 bytes --] ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 19:37 ` Colin Walters @ 2017-11-02 15:24 ` Shawn Paul Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Paul Landden @ 2017-11-02 15:24 UTC (permalink / raw) To: Colin Walters; +Cc: linux-kernel, linux-fsdevel, linux-mm On Wed, 2017-11-01 at 15:37 -0400, Colin Walters wrote: > threading is limited doing sync()/fsync() and gethostbyname() async. > > But languages with a GC tend to at least use a background thread for > that, > and of course lots of modern userspace makes heavy use of > multithreading > (or variants like goroutines). > > A common pattern though is to have a "main thread" that acts as a > control > point and runs the mainloop (particularly for anything with a GUI). > That's > going to be the thing calling prctl(SET_IDLE) - but I think its idle > state should implicitly > affect the whole process, since for a lot of apps those other threads > are going to > just be "background". > > It'd probably then be an error to use prctl(SET_IDLE) in more than > one thread > ever? (Although that might break in golang due to the way goroutines > can > be migrated across threads) > > That'd probably be a good "generality test" - what would it take to > have > this system call be used for a simple golang webserver app that's > e.g. > socket activated by systemd, or a Kubernetes service? Or another > really interesting case would be qemu; make it easy to flag VMs as > always > having this state (most of my testing VMs are like this; it's OK if > they get > destroyed, I just reinitialize them from the gold state). > > Going back to threading - a tricky thing we should handle in general > is when userspace libraries create threads that are unknown to the > app; > the "async gethostbyname()" is a good example. To be conservative > we'd > likely need to "fail non-idle", but figure out some way tell the > kernel > for e.g. GC threads that they're still I realize none of this is a problem because when prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME) is set the *entire* process has declared itsself stateless and ready to be killed. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-02 15:24 ` Shawn Paul Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Paul Landden @ 2017-11-02 15:24 UTC (permalink / raw) To: Colin Walters; +Cc: linux-kernel, linux-fsdevel, linux-mm On Wed, 2017-11-01 at 15:37 -0400, Colin Walters wrote: > threading is limited doing sync()/fsync() and gethostbyname() async. > > But languages with a GC tend to at least use a background thread for > that, > and of course lots of modern userspace makes heavy use of > multithreading > (or variants like goroutines). > > A common pattern though is to have a "main thread" that acts as a > control > point and runs the mainloop (particularly for anything with a GUI). > That's > going to be the thing calling prctl(SET_IDLE) - but I think its idle > state should implicitly > affect the whole process, since for a lot of apps those other threads > are going to > just be "background". > > It'd probably then be an error to use prctl(SET_IDLE) in more than > one thread > ever? (Although that might break in golang due to the way goroutines > can > be migrated across threads) > > That'd probably be a good "generality test" - what would it take to > have > this system call be used for a simple golang webserver app that's > e.g. > socket activated by systemd, or a Kubernetes service? Or another > really interesting case would be qemu; make it easy to flag VMs as > always > having this state (most of my testing VMs are like this; it's OK if > they get > destroyed, I just reinitialize them from the gold state). > > Going back to threading - a tricky thing we should handle in general > is when userspace libraries create threads that are unknown to the > app; > the "async gethostbyname()" is a good example. To be conservative > we'd > likely need to "fail non-idle", but figure out some way tell the > kernel > for e.g. GC threads that they're still I realize none of this is a problem because when prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME) is set the *entire* process has declared itsself stateless and ready to be killed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 15:16 ` Colin Walters ` (2 preceding siblings ...) (?) @ 2017-11-01 19:05 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-01 19:05 UTC (permalink / raw) To: Colin Walters; +Cc: linux-kernel, linux-fsdevel, linux-mm [-- Attachment #1: Type: text/plain, Size: 1631 bytes --] On Wed, Nov 1, 2017 at 8:16 AM, Colin Walters <walters@verbum.org> wrote: > > > On Wed, Nov 1, 2017, at 01:32 AM, Shawn Landden wrote: > > It is common for services to be stateless around their main event loop. > > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > > signals to the kernel that epoll_wait5() may not complete, and the kernel > > may send SIGKILL if resources get tight. > > > > I've thought about something like this in the past too and would love > to see it land. Bigger picture, this also comes up in (server) container > environments, see e.g.: > > https://docs.openshift.com/container-platform/3.3/admin_ > guide/idling_applications.html > > There's going to be a long slog getting apps to actually make use > of this, but I suspect if it gets wrapped up nicely in some "framework" > libraries for C/C++, and be bound in the language ecosystems like golang > we could see a fair amount of adoption on the order of a year or two. > > However, while I understand why it feels natural to tie this to epoll, > as the maintainer of glib2 which is used by a *lot* of things; I'm not > sure we're going to port to epoll anytime soon. > > Why not just make this a prctl()? It's not like it's really any less racy > to do: > > prctl(PR_SET_IDLE) > epoll() > > and this also allows: > > prctl(PR_SET_IDLE) > poll() > > And as this is most often just going to be an optional hint it's easier to > e.g. just ignore EINVAL > from the prctl(). > This solves the issue of epoll_pwait() already having the maximum number of arguments (6). But what if you want multiple epoll_wait()s in multiple threads? [-- Attachment #2: Type: text/html, Size: 2266 bytes --] ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 5:32 ` Shawn Landden @ 2017-11-01 22:10 ` Tetsuo Handa -1 siblings, 0 replies; 58+ messages in thread From: Tetsuo Handa @ 2017-11-01 22:10 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On 2017/11/01 14:32, Shawn Landden wrote: > @@ -1029,6 +1030,22 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row. > + */ > + if (!list_empty(eventpoll_deathrow_list())) { > + struct list_head *l = eventpoll_deathrow_list(); Unsafe traversal. List can become empty at this moment. > + struct task_struct *ts = list_first_entry(l, > + struct task_struct, se.deathrow); > + > + pr_debug("Killing pid %u from EPOLL_KILLME death row.", > + ts->pid); > + > + /* We use SIGKILL so as to cleanly interrupt ep_poll() */ > + kill_pid(task_pid(ts), SIGKILL, 1); send_sig() ? > + return true; > + } > + > /* > * The OOM killer does not compensate for IO-less reclaim. > * pagefault_out_of_memory lost its gfp context so we have to > And why is static int oom_fd = open("/proc/self/oom_score_adj", O_WRONLY); and then toggling between write(fd, "1000", 4); and write(fd, "0", 1); not sufficient? Adding prctl() that do this might be handy though. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-01 22:10 ` Tetsuo Handa 0 siblings, 0 replies; 58+ messages in thread From: Tetsuo Handa @ 2017-11-01 22:10 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm On 2017/11/01 14:32, Shawn Landden wrote: > @@ -1029,6 +1030,22 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row. > + */ > + if (!list_empty(eventpoll_deathrow_list())) { > + struct list_head *l = eventpoll_deathrow_list(); Unsafe traversal. List can become empty at this moment. > + struct task_struct *ts = list_first_entry(l, > + struct task_struct, se.deathrow); > + > + pr_debug("Killing pid %u from EPOLL_KILLME death row.", > + ts->pid); > + > + /* We use SIGKILL so as to cleanly interrupt ep_poll() */ > + kill_pid(task_pid(ts), SIGKILL, 1); send_sig() ? > + return true; > + } > + > /* > * The OOM killer does not compensate for IO-less reclaim. > * pagefault_out_of_memory lost its gfp context so we have to > And why is static int oom_fd = open("/proc/self/oom_score_adj", O_WRONLY); and then toggling between write(fd, "1000", 4); and write(fd, "0", 1); not sufficient? Adding prctl() that do this might be handy though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 22:10 ` Tetsuo Handa (?) @ 2017-11-02 7:36 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-02 7:36 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-kernel, linux-fsdevel, linux-mm [-- Attachment #1: Type: text/plain, Size: 1668 bytes --] On Wed, Nov 1, 2017 at 3:10 PM, Tetsuo Handa <penguin-kernel@i-love.sakura. ne.jp> wrote: > On 2017/11/01 14:32, Shawn Landden wrote: > > @@ -1029,6 +1030,22 @@ bool out_of_memory(struct oom_control *oc) > > return true; > > } > > > > + /* > > + * Check death row. > > + */ > > + if (!list_empty(eventpoll_deathrow_list())) { > > + struct list_head *l = eventpoll_deathrow_list(); > > Unsafe traversal. List can become empty at this moment. > > > + struct task_struct *ts = list_first_entry(l, > > + struct task_struct, se.deathrow); > > + > > + pr_debug("Killing pid %u from EPOLL_KILLME death row.", > > + ts->pid); > > + > > + /* We use SIGKILL so as to cleanly interrupt ep_poll() */ > > + kill_pid(task_pid(ts), SIGKILL, 1); > > send_sig() ? > > > + return true; > > + } > > + > > /* > > * The OOM killer does not compensate for IO-less reclaim. > > * pagefault_out_of_memory lost its gfp context so we have to > > > > And why is > > static int oom_fd = open("/proc/self/oom_score_adj", O_WRONLY); > > and then toggling between > > write(fd, "1000", 4); > > and > > write(fd, "0", 1); > > not sufficient? Adding prctl() that do this might be handy though. > I want to do special process accounting. Also, in Android using this type of memory management is mandatory, and to do that other processes would have to make delivery of their messages (like a wake-up for user input) contingent on setting this. oom_score 1000 could gain all this special handling however. [-- Attachment #2: Type: text/html, Size: 2413 bytes --] ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) 2017-11-01 5:32 ` Shawn Landden @ 2017-11-02 15:45 ` Michal Hocko -1 siblings, 0 replies; 58+ messages in thread From: Michal Hocko @ 2017-11-02 15:45 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm [Always cc linux-api mailing list when proposing user visible api changes] On Tue 31-10-17 22:32:44, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > signals to the kernel that epoll_wait5() may not complete, and the kernel > may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/killme > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). I have to say I completely hate the idea. You are abusing epoll_wait5 for the out of memory handling? Why is this syscall any special from any other one which sleeps and waits idle for an event? We do have per task oom_score_adj for that purposes. Besides that the patch is simply wrong because [...] > @@ -1029,6 +1030,22 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row. > + */ > + if (!list_empty(eventpoll_deathrow_list())) { > + struct list_head *l = eventpoll_deathrow_list(); > + struct task_struct *ts = list_first_entry(l, > + struct task_struct, se.deathrow); > + > + pr_debug("Killing pid %u from EPOLL_KILLME death row.", > + ts->pid); > + > + /* We use SIGKILL so as to cleanly interrupt ep_poll() */ > + kill_pid(task_pid(ts), SIGKILL, 1); > + return true; > + } > + this doesn't reflect the oom domain (is this memcg, mempolicy/tastset constrained OOM). You might be killing tasks which are not in the target domain. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) @ 2017-11-02 15:45 ` Michal Hocko 0 siblings, 0 replies; 58+ messages in thread From: Michal Hocko @ 2017-11-02 15:45 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm [Always cc linux-api mailing list when proposing user visible api changes] On Tue 31-10-17 22:32:44, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process passes the EPOLL_KILLME flag to epoll_wait5() then it > signals to the kernel that epoll_wait5() may not complete, and the kernel > may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/killme > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). I have to say I completely hate the idea. You are abusing epoll_wait5 for the out of memory handling? Why is this syscall any special from any other one which sleeps and waits idle for an event? We do have per task oom_score_adj for that purposes. Besides that the patch is simply wrong because [...] > @@ -1029,6 +1030,22 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row. > + */ > + if (!list_empty(eventpoll_deathrow_list())) { > + struct list_head *l = eventpoll_deathrow_list(); > + struct task_struct *ts = list_first_entry(l, > + struct task_struct, se.deathrow); > + > + pr_debug("Killing pid %u from EPOLL_KILLME death row.", > + ts->pid); > + > + /* We use SIGKILL so as to cleanly interrupt ep_poll() */ > + kill_pid(task_pid(ts), SIGKILL, 1); > + return true; > + } > + this doesn't reflect the oom domain (is this memcg, mempolicy/tastset constrained OOM). You might be killing tasks which are not in the target domain. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-01 5:32 ` Shawn Landden (?) @ 2017-11-03 6:35 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-03 6:35 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, Shawn Landden It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. See my systemd patch: https://github.com/shawnl/systemd/tree/prctl Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). 16 bytes per process is kinda spendy, but I want to keep lru behavior, which mem_score_adj does not allow. When a supervisor, like Android's user input is keeping track this can be done in user-space. It could be pulled out of task_struct if an cross-indexing additional red-black tree is added to support pid-based lookup. v2 switch to prctl, memcg support --- fs/eventpoll.c | 17 +++++++++++++ fs/proc/array.c | 7 ++++++ include/linux/memcontrol.h | 3 +++ include/linux/oom.h | 4 ++++ include/linux/sched.h | 4 ++++ include/uapi/linux/prctl.h | 4 ++++ kernel/cgroup/cgroup.c | 12 ++++++++++ kernel/exit.c | 2 ++ kernel/sys.c | 9 +++++++ mm/memcontrol.c | 4 ++++ mm/oom_kill.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++ 11 files changed, 126 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..04011fca038b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -43,6 +43,7 @@ #include <linux/compat.h> #include <linux/rculist.h> #include <net/busy_poll.h> +#include <linux/oom.h> /* * LOCKING: @@ -1762,6 +1763,14 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, wait_queue_entry_t wait; ktime_t expires, *to = NULL; + if (current->oom_target) { + spin_lock(oom_target_get_spinlock(current)); + list_add(¤t->se.oom_target_queue, + oom_target_get_queue(current)); + current->se.oom_target_on_queue = 1; + spin_unlock(oom_target_get_spinlock(current)); + } + if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -1783,6 +1792,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, if (!ep_events_available(ep)) ep_busy_loop(ep, timed_out); + spin_lock_irqsave(&ep->lock, flags); if (!ep_events_available(ep)) { @@ -1850,6 +1860,13 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; + if (current->oom_target) { + spin_lock(oom_target_get_spinlock(current)); + list_del(¤t->se.oom_target_queue); + current->se.oom_target_on_queue = 0; + spin_unlock(oom_target_get_spinlock(current)); + } + return res; } diff --git a/fs/proc/array.c b/fs/proc/array.c index 77a8eacbe032..cab009727a7f 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -349,6 +349,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) seq_putc(m, '\n'); } +static inline void task_idle(struct seq_file *m, struct task_struct *p) +{ + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); + seq_putc(m, '\n'); +} + static inline void task_context_switch_counts(struct seq_file *m, struct task_struct *p) { @@ -380,6 +386,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_sig(m, task); task_cap(m, task); task_seccomp(m, task); + task_idle(m, task); task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..40a2db8ae522 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -261,6 +261,9 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; + struct list_head oom_target_queue; + spinlock_t oom_target_spinlock; + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; diff --git a/include/linux/oom.h b/include/linux/oom.h index 76aac4ce39bc..a5d16eb05297 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -101,6 +101,10 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern void exit_oom_target(void); +struct list_head *oom_target_get_queue(struct task_struct *ts); +spinlock_t *oom_target_get_spinlock(struct task_struct *ts); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a7df4e558c..2b110c4d7357 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -380,6 +380,9 @@ struct sched_entity { struct list_head group_node; unsigned int on_rq; + unsigned oom_target_on_queue:1; + struct list_head oom_target_queue; + u64 exec_start; u64 sum_exec_runtime; u64 vruntime; @@ -651,6 +654,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_target:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..eba3c3c8375b 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,8 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +#define PR_SET_IDLE 48 +#define PR_GET_IDLE 49 +# define PR_IDLE_MODE_KILLME 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 44857278eb8a..bd48b84d9565 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -55,6 +55,7 @@ #include <linux/nsproxy.h> #include <linux/file.h> #include <net/sock.h> +#include <linux/oom.h> #define CREATE_TRACE_POINTS #include <trace/events/cgroup.h> @@ -779,6 +780,11 @@ static void css_set_move_task(struct task_struct *task, css_task_iter_advance(it); list_del_init(&task->cg_list); + if (task->se.oom_target_on_queue) { + spin_lock(oom_target_get_spinlock(task)); + list_del_init(&task->se.oom_target_queue); + spin_unlock(oom_target_get_spinlock(task)); + } if (!css_set_populated(from_cset)) css_set_update_populated(from_cset, false); } else { @@ -797,6 +803,12 @@ static void css_set_move_task(struct task_struct *task, rcu_assign_pointer(task->cgroups, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); + if (task->se.oom_target_on_queue) { + spin_lock(oom_target_get_spinlock(task)); + list_add_tail(&task->se.oom_target_queue, + oom_target_get_queue(task)); + spin_unlock(oom_target_get_spinlock(task)); + } } } diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..bb13a359b5e7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> @@ -917,6 +918,7 @@ void __noreturn do_exit(long code) __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); exit_rcu(); exit_tasks_rcu_finish(); + exit_oom_target(); lockdep_free_task(tsk); do_task_dead(); diff --git a/kernel/sys.c b/kernel/sys.c index 9aebc2935013..f949b193f126 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2385,6 +2385,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; + case PR_SET_IDLE: + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) + return -EINVAL; + me->oom_target = arg2; + error = 0; + break; + case PR_GET_IDLE: + error = me->oom_target; + break; default: error = -EINVAL; break; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 661f046ad318..f6ea5adac586 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4300,6 +4300,10 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memory_cgrp_subsys.broken_hierarchy = true; } + INIT_LIST_HEAD(&memcg->oom_target_queue); + memcg->oom_target_spinlock = __SPIN_LOCK_UNLOCKED( + &memcg->oom_target_spinlock); + /* The following stuff does not apply to the root */ if (!parent) { root_mem_cgroup = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..05394f0bd6ab 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,7 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> #include <asm/tlb.h> #include "internal.h" @@ -54,6 +55,46 @@ int sysctl_oom_dump_tasks = 1; DEFINE_MUTEX(oom_lock); +static DEFINE_SPINLOCK(oom_target_spinlock); +static LIST_HEAD(oom_target_global_queue); + +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +void exit_oom_target(void) +{ + if (current->se.oom_target_on_queue) { + spin_lock(&oom_target_spinlock); + current->se.oom_target_on_queue = 0; + list_del(¤t->se.oom_target_queue); + spin_unlock(&oom_target_spinlock); + } +} + +inline struct list_head *oom_target_get_queue(struct task_struct *ts) +{ +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(ts); + if (mcg) + return &mcg->oom_target_queue; +#endif + return &oom_target_global_queue; +} + +inline spinlock_t *oom_target_get_spinlock(struct task_struct *ts) +{ +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(ts); + if (mcg) + return &mcg->oom_target_spinlock; +#endif + return &oom_target_spinlock; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -1007,6 +1048,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + struct list_head *l; if (oom_killer_disabled) return false; @@ -1018,6 +1060,24 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row for current memcg or global. + */ + l = oom_target_get_queue(current); + if (!list_empty(l)) { + struct task_struct *ts = list_first_entry(l, + struct task_struct, se.oom_target_queue); + + pr_debug("Killing pid %u from EPOLL_KILLME death row.", + ts->pid); + + /* We use SIGKILL instead of the oom killer + * so as to cleanly interrupt ep_poll() + */ + send_sig(SIGKILL, ts, 1); + return true; + } + /* * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may -- 2.15.0.rc2 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-03 6:35 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-03 6:35 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, Shawn Landden It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. See my systemd patch: https://github.com/shawnl/systemd/tree/prctl Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). 16 bytes per process is kinda spendy, but I want to keep lru behavior, which mem_score_adj does not allow. When a supervisor, like Android's user input is keeping track this can be done in user-space. It could be pulled out of task_struct if an cross-indexing additional red-black tree is added to support pid-based lookup. v2 switch to prctl, memcg support --- fs/eventpoll.c | 17 +++++++++++++ fs/proc/array.c | 7 ++++++ include/linux/memcontrol.h | 3 +++ include/linux/oom.h | 4 ++++ include/linux/sched.h | 4 ++++ include/uapi/linux/prctl.h | 4 ++++ kernel/cgroup/cgroup.c | 12 ++++++++++ kernel/exit.c | 2 ++ kernel/sys.c | 9 +++++++ mm/memcontrol.c | 4 ++++ mm/oom_kill.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++ 11 files changed, 126 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..04011fca038b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -43,6 +43,7 @@ #include <linux/compat.h> #include <linux/rculist.h> #include <net/busy_poll.h> +#include <linux/oom.h> /* * LOCKING: @@ -1762,6 +1763,14 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, wait_queue_entry_t wait; ktime_t expires, *to = NULL; + if (current->oom_target) { + spin_lock(oom_target_get_spinlock(current)); + list_add(¤t->se.oom_target_queue, + oom_target_get_queue(current)); + current->se.oom_target_on_queue = 1; + spin_unlock(oom_target_get_spinlock(current)); + } + if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -1783,6 +1792,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, if (!ep_events_available(ep)) ep_busy_loop(ep, timed_out); + spin_lock_irqsave(&ep->lock, flags); if (!ep_events_available(ep)) { @@ -1850,6 +1860,13 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; + if (current->oom_target) { + spin_lock(oom_target_get_spinlock(current)); + list_del(¤t->se.oom_target_queue); + current->se.oom_target_on_queue = 0; + spin_unlock(oom_target_get_spinlock(current)); + } + return res; } diff --git a/fs/proc/array.c b/fs/proc/array.c index 77a8eacbe032..cab009727a7f 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -349,6 +349,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) seq_putc(m, '\n'); } +static inline void task_idle(struct seq_file *m, struct task_struct *p) +{ + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); + seq_putc(m, '\n'); +} + static inline void task_context_switch_counts(struct seq_file *m, struct task_struct *p) { @@ -380,6 +386,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_sig(m, task); task_cap(m, task); task_seccomp(m, task); + task_idle(m, task); task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..40a2db8ae522 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -261,6 +261,9 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; + struct list_head oom_target_queue; + spinlock_t oom_target_spinlock; + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; diff --git a/include/linux/oom.h b/include/linux/oom.h index 76aac4ce39bc..a5d16eb05297 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -101,6 +101,10 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern void exit_oom_target(void); +struct list_head *oom_target_get_queue(struct task_struct *ts); +spinlock_t *oom_target_get_spinlock(struct task_struct *ts); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a7df4e558c..2b110c4d7357 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -380,6 +380,9 @@ struct sched_entity { struct list_head group_node; unsigned int on_rq; + unsigned oom_target_on_queue:1; + struct list_head oom_target_queue; + u64 exec_start; u64 sum_exec_runtime; u64 vruntime; @@ -651,6 +654,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_target:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..eba3c3c8375b 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,8 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +#define PR_SET_IDLE 48 +#define PR_GET_IDLE 49 +# define PR_IDLE_MODE_KILLME 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 44857278eb8a..bd48b84d9565 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -55,6 +55,7 @@ #include <linux/nsproxy.h> #include <linux/file.h> #include <net/sock.h> +#include <linux/oom.h> #define CREATE_TRACE_POINTS #include <trace/events/cgroup.h> @@ -779,6 +780,11 @@ static void css_set_move_task(struct task_struct *task, css_task_iter_advance(it); list_del_init(&task->cg_list); + if (task->se.oom_target_on_queue) { + spin_lock(oom_target_get_spinlock(task)); + list_del_init(&task->se.oom_target_queue); + spin_unlock(oom_target_get_spinlock(task)); + } if (!css_set_populated(from_cset)) css_set_update_populated(from_cset, false); } else { @@ -797,6 +803,12 @@ static void css_set_move_task(struct task_struct *task, rcu_assign_pointer(task->cgroups, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); + if (task->se.oom_target_on_queue) { + spin_lock(oom_target_get_spinlock(task)); + list_add_tail(&task->se.oom_target_queue, + oom_target_get_queue(task)); + spin_unlock(oom_target_get_spinlock(task)); + } } } diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..bb13a359b5e7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> @@ -917,6 +918,7 @@ void __noreturn do_exit(long code) __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); exit_rcu(); exit_tasks_rcu_finish(); + exit_oom_target(); lockdep_free_task(tsk); do_task_dead(); diff --git a/kernel/sys.c b/kernel/sys.c index 9aebc2935013..f949b193f126 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2385,6 +2385,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; + case PR_SET_IDLE: + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) + return -EINVAL; + me->oom_target = arg2; + error = 0; + break; + case PR_GET_IDLE: + error = me->oom_target; + break; default: error = -EINVAL; break; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 661f046ad318..f6ea5adac586 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4300,6 +4300,10 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memory_cgrp_subsys.broken_hierarchy = true; } + INIT_LIST_HEAD(&memcg->oom_target_queue); + memcg->oom_target_spinlock = __SPIN_LOCK_UNLOCKED( + &memcg->oom_target_spinlock); + /* The following stuff does not apply to the root */ if (!parent) { root_mem_cgroup = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..05394f0bd6ab 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,7 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> #include <asm/tlb.h> #include "internal.h" @@ -54,6 +55,46 @@ int sysctl_oom_dump_tasks = 1; DEFINE_MUTEX(oom_lock); +static DEFINE_SPINLOCK(oom_target_spinlock); +static LIST_HEAD(oom_target_global_queue); + +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +void exit_oom_target(void) +{ + if (current->se.oom_target_on_queue) { + spin_lock(&oom_target_spinlock); + current->se.oom_target_on_queue = 0; + list_del(¤t->se.oom_target_queue); + spin_unlock(&oom_target_spinlock); + } +} + +inline struct list_head *oom_target_get_queue(struct task_struct *ts) +{ +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(ts); + if (mcg) + return &mcg->oom_target_queue; +#endif + return &oom_target_global_queue; +} + +inline spinlock_t *oom_target_get_spinlock(struct task_struct *ts) +{ +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(ts); + if (mcg) + return &mcg->oom_target_spinlock; +#endif + return &oom_target_spinlock; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -1007,6 +1048,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + struct list_head *l; if (oom_killer_disabled) return false; @@ -1018,6 +1060,24 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row for current memcg or global. + */ + l = oom_target_get_queue(current); + if (!list_empty(l)) { + struct task_struct *ts = list_first_entry(l, + struct task_struct, se.oom_target_queue); + + pr_debug("Killing pid %u from EPOLL_KILLME death row.", + ts->pid); + + /* We use SIGKILL instead of the oom killer + * so as to cleanly interrupt ep_poll() + */ + send_sig(SIGKILL, ts, 1); + return true; + } + /* * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may -- 2.15.0.rc2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-03 6:35 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-03 6:35 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, Shawn Landden It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. See my systemd patch: https://github.com/shawnl/systemd/tree/prctl Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). 16 bytes per process is kinda spendy, but I want to keep lru behavior, which mem_score_adj does not allow. When a supervisor, like Android's user input is keeping track this can be done in user-space. It could be pulled out of task_struct if an cross-indexing additional red-black tree is added to support pid-based lookup. v2 switch to prctl, memcg support --- fs/eventpoll.c | 17 +++++++++++++ fs/proc/array.c | 7 ++++++ include/linux/memcontrol.h | 3 +++ include/linux/oom.h | 4 ++++ include/linux/sched.h | 4 ++++ include/uapi/linux/prctl.h | 4 ++++ kernel/cgroup/cgroup.c | 12 ++++++++++ kernel/exit.c | 2 ++ kernel/sys.c | 9 +++++++ mm/memcontrol.c | 4 ++++ mm/oom_kill.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++ 11 files changed, 126 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..04011fca038b 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -43,6 +43,7 @@ #include <linux/compat.h> #include <linux/rculist.h> #include <net/busy_poll.h> +#include <linux/oom.h> /* * LOCKING: @@ -1762,6 +1763,14 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, wait_queue_entry_t wait; ktime_t expires, *to = NULL; + if (current->oom_target) { + spin_lock(oom_target_get_spinlock(current)); + list_add(¤t->se.oom_target_queue, + oom_target_get_queue(current)); + current->se.oom_target_on_queue = 1; + spin_unlock(oom_target_get_spinlock(current)); + } + if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -1783,6 +1792,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, if (!ep_events_available(ep)) ep_busy_loop(ep, timed_out); + spin_lock_irqsave(&ep->lock, flags); if (!ep_events_available(ep)) { @@ -1850,6 +1860,13 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; + if (current->oom_target) { + spin_lock(oom_target_get_spinlock(current)); + list_del(¤t->se.oom_target_queue); + current->se.oom_target_on_queue = 0; + spin_unlock(oom_target_get_spinlock(current)); + } + return res; } diff --git a/fs/proc/array.c b/fs/proc/array.c index 77a8eacbe032..cab009727a7f 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -349,6 +349,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) seq_putc(m, '\n'); } +static inline void task_idle(struct seq_file *m, struct task_struct *p) +{ + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); + seq_putc(m, '\n'); +} + static inline void task_context_switch_counts(struct seq_file *m, struct task_struct *p) { @@ -380,6 +386,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_sig(m, task); task_cap(m, task); task_seccomp(m, task); + task_idle(m, task); task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..40a2db8ae522 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -261,6 +261,9 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; + struct list_head oom_target_queue; + spinlock_t oom_target_spinlock; + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; diff --git a/include/linux/oom.h b/include/linux/oom.h index 76aac4ce39bc..a5d16eb05297 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -101,6 +101,10 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern void exit_oom_target(void); +struct list_head *oom_target_get_queue(struct task_struct *ts); +spinlock_t *oom_target_get_spinlock(struct task_struct *ts); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index 26a7df4e558c..2b110c4d7357 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -380,6 +380,9 @@ struct sched_entity { struct list_head group_node; unsigned int on_rq; + unsigned oom_target_on_queue:1; + struct list_head oom_target_queue; + u64 exec_start; u64 sum_exec_runtime; u64 vruntime; @@ -651,6 +654,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_target:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a8d0759a9e40..eba3c3c8375b 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -197,4 +197,8 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +#define PR_SET_IDLE 48 +#define PR_GET_IDLE 49 +# define PR_IDLE_MODE_KILLME 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 44857278eb8a..bd48b84d9565 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -55,6 +55,7 @@ #include <linux/nsproxy.h> #include <linux/file.h> #include <net/sock.h> +#include <linux/oom.h> #define CREATE_TRACE_POINTS #include <trace/events/cgroup.h> @@ -779,6 +780,11 @@ static void css_set_move_task(struct task_struct *task, css_task_iter_advance(it); list_del_init(&task->cg_list); + if (task->se.oom_target_on_queue) { + spin_lock(oom_target_get_spinlock(task)); + list_del_init(&task->se.oom_target_queue); + spin_unlock(oom_target_get_spinlock(task)); + } if (!css_set_populated(from_cset)) css_set_update_populated(from_cset, false); } else { @@ -797,6 +803,12 @@ static void css_set_move_task(struct task_struct *task, rcu_assign_pointer(task->cgroups, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); + if (task->se.oom_target_on_queue) { + spin_lock(oom_target_get_spinlock(task)); + list_add_tail(&task->se.oom_target_queue, + oom_target_get_queue(task)); + spin_unlock(oom_target_get_spinlock(task)); + } } } diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..bb13a359b5e7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> @@ -917,6 +918,7 @@ void __noreturn do_exit(long code) __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); exit_rcu(); exit_tasks_rcu_finish(); + exit_oom_target(); lockdep_free_task(tsk); do_task_dead(); diff --git a/kernel/sys.c b/kernel/sys.c index 9aebc2935013..f949b193f126 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2385,6 +2385,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; + case PR_SET_IDLE: + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) + return -EINVAL; + me->oom_target = arg2; + error = 0; + break; + case PR_GET_IDLE: + error = me->oom_target; + break; default: error = -EINVAL; break; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 661f046ad318..f6ea5adac586 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4300,6 +4300,10 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memory_cgrp_subsys.broken_hierarchy = true; } + INIT_LIST_HEAD(&memcg->oom_target_queue); + memcg->oom_target_spinlock = __SPIN_LOCK_UNLOCKED( + &memcg->oom_target_spinlock); + /* The following stuff does not apply to the root */ if (!parent) { root_mem_cgroup = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..05394f0bd6ab 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,7 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> #include <asm/tlb.h> #include "internal.h" @@ -54,6 +55,46 @@ int sysctl_oom_dump_tasks = 1; DEFINE_MUTEX(oom_lock); +static DEFINE_SPINLOCK(oom_target_spinlock); +static LIST_HEAD(oom_target_global_queue); + +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +void exit_oom_target(void) +{ + if (current->se.oom_target_on_queue) { + spin_lock(&oom_target_spinlock); + current->se.oom_target_on_queue = 0; + list_del(¤t->se.oom_target_queue); + spin_unlock(&oom_target_spinlock); + } +} + +inline struct list_head *oom_target_get_queue(struct task_struct *ts) +{ +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(ts); + if (mcg) + return &mcg->oom_target_queue; +#endif + return &oom_target_global_queue; +} + +inline spinlock_t *oom_target_get_spinlock(struct task_struct *ts) +{ +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(ts); + if (mcg) + return &mcg->oom_target_spinlock; +#endif + return &oom_target_spinlock; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -1007,6 +1048,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + struct list_head *l; if (oom_killer_disabled) return false; @@ -1018,6 +1060,24 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row for current memcg or global. + */ + l = oom_target_get_queue(current); + if (!list_empty(l)) { + struct task_struct *ts = list_first_entry(l, + struct task_struct, se.oom_target_queue); + + pr_debug("Killing pid %u from EPOLL_KILLME death row.", + ts->pid); + + /* We use SIGKILL instead of the oom killer + * so as to cleanly interrupt ep_poll() + */ + send_sig(SIGKILL, ts, 1); + return true; + } + /* * If current has a pending SIGKILL or is exiting, then automatically * select it. The goal is to allow it to allocate so that it may -- 2.15.0.rc2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-03 6:35 ` Shawn Landden @ 2017-11-03 9:09 ` Michal Hocko -1 siblings, 0 replies; 58+ messages in thread From: Michal Hocko @ 2017-11-03 9:09 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Thu 02-11-17 23:35:44, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > signals to the kernel that epoll_wait() and friends may not complete, > and the kernel may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). > > 16 bytes per process is kinda spendy, but I want to keep > lru behavior, which mem_score_adj does not allow. When a supervisor, > like Android's user input is keeping track this can be done in user-space. > It could be pulled out of task_struct if an cross-indexing additional > red-black tree is added to support pid-based lookup. This is still an abuse and the patch is wrong. We really do have an API to use I fail to see why you do not use it. [...] > @@ -1018,6 +1060,24 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row for current memcg or global. > + */ > + l = oom_target_get_queue(current); > + if (!list_empty(l)) { > + struct task_struct *ts = list_first_entry(l, > + struct task_struct, se.oom_target_queue); > + > + pr_debug("Killing pid %u from EPOLL_KILLME death row.", > + ts->pid); > + > + /* We use SIGKILL instead of the oom killer > + * so as to cleanly interrupt ep_poll() > + */ > + send_sig(SIGKILL, ts, 1); > + return true; > + } Still not NUMA aware and completely backwards. If this is a memcg OOM then it is _memcg_ to evaluate not the current. The oom might happen up the hierarchy due to hard limit. But still, you should be very clear _why_ the existing oom tuning is not appropropriate and we can think of a way to hanle it better but cramming the oom selection this way is simply not acceptable. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-03 9:09 ` Michal Hocko 0 siblings, 0 replies; 58+ messages in thread From: Michal Hocko @ 2017-11-03 9:09 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Thu 02-11-17 23:35:44, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > signals to the kernel that epoll_wait() and friends may not complete, > and the kernel may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). > > 16 bytes per process is kinda spendy, but I want to keep > lru behavior, which mem_score_adj does not allow. When a supervisor, > like Android's user input is keeping track this can be done in user-space. > It could be pulled out of task_struct if an cross-indexing additional > red-black tree is added to support pid-based lookup. This is still an abuse and the patch is wrong. We really do have an API to use I fail to see why you do not use it. [...] > @@ -1018,6 +1060,24 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row for current memcg or global. > + */ > + l = oom_target_get_queue(current); > + if (!list_empty(l)) { > + struct task_struct *ts = list_first_entry(l, > + struct task_struct, se.oom_target_queue); > + > + pr_debug("Killing pid %u from EPOLL_KILLME death row.", > + ts->pid); > + > + /* We use SIGKILL instead of the oom killer > + * so as to cleanly interrupt ep_poll() > + */ > + send_sig(SIGKILL, ts, 1); > + return true; > + } Still not NUMA aware and completely backwards. If this is a memcg OOM then it is _memcg_ to evaluate not the current. The oom might happen up the hierarchy due to hard limit. But still, you should be very clear _why_ the existing oom tuning is not appropropriate and we can think of a way to hanle it better but cramming the oom selection this way is simply not acceptable. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-03 9:09 ` Michal Hocko (?) @ 2017-11-18 4:45 ` Shawn Landden 2017-11-19 4:19 ` Matthew Wilcox 2017-11-20 8:35 ` Michal Hocko -1 siblings, 2 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-18 4:45 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api [-- Attachment #1: Type: text/plain, Size: 2369 bytes --] On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Thu 02-11-17 23:35:44, Shawn Landden wrote: > > It is common for services to be stateless around their main event loop. > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > > signals to the kernel that epoll_wait() and friends may not complete, > > and the kernel may send SIGKILL if resources get tight. > > > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > > > Android uses this memory model for all programs, and having it in the > > kernel will enable integration with the page cache (not in this > > series). > > > > 16 bytes per process is kinda spendy, but I want to keep > > lru behavior, which mem_score_adj does not allow. When a supervisor, > > like Android's user input is keeping track this can be done in > user-space. > > It could be pulled out of task_struct if an cross-indexing additional > > red-black tree is added to support pid-based lookup. > > This is still an abuse and the patch is wrong. We really do have an API > to use I fail to see why you do not use it. > When I looked at wait_queue_head_t it was 20 byes. > > [...] > > @@ -1018,6 +1060,24 @@ bool out_of_memory(struct oom_control *oc) > > return true; > > } > > > > + /* > > + * Check death row for current memcg or global. > > + */ > > + l = oom_target_get_queue(current); > > + if (!list_empty(l)) { > > + struct task_struct *ts = list_first_entry(l, > > + struct task_struct, se.oom_target_queue); > > + > > + pr_debug("Killing pid %u from EPOLL_KILLME death row.", > > + ts->pid); > > + > > + /* We use SIGKILL instead of the oom killer > > + * so as to cleanly interrupt ep_poll() > > + */ > > + send_sig(SIGKILL, ts, 1); > > + return true; > > + } > > Still not NUMA aware and completely backwards. If this is a memcg OOM > then it is _memcg_ to evaluate not the current. The oom might happen up > the hierarchy due to hard limit. > > But still, you should be very clear _why_ the existing oom tuning is not > appropropriate and we can think of a way to hanle it better but cramming > the oom selection this way is simply not acceptable. > -- > Michal Hocko > SUSE Labs > [-- Attachment #2: Type: text/html, Size: 3382 bytes --] ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-18 4:45 ` Shawn Landden 2017-11-19 4:19 ` Matthew Wilcox @ 2017-11-19 4:19 ` Matthew Wilcox 1 sibling, 0 replies; 58+ messages in thread From: Matthew Wilcox @ 2017-11-19 4:19 UTC (permalink / raw) To: Shawn Landden Cc: Michal Hocko, linux-kernel, linux-fsdevel, linux-mm, linux-api On Fri, Nov 17, 2017 at 08:45:03PM -0800, Shawn Landden wrote: > On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 02-11-17 23:35:44, Shawn Landden wrote: > > > 16 bytes per process is kinda spendy, but I want to keep > > > lru behavior, which mem_score_adj does not allow. When a supervisor, > > > like Android's user input is keeping track this can be done in > > user-space. > > > It could be pulled out of task_struct if an cross-indexing additional > > > red-black tree is added to support pid-based lookup. > > > > This is still an abuse and the patch is wrong. We really do have an API > > to use I fail to see why you do not use it. > > > When I looked at wait_queue_head_t it was 20 byes. 24 bytes actually; the compiler will add 4 bytes of padding between the spinlock and the list_head. But there's one for the entire system. Then you add a 40 byte structure (wait_queue_entry) on the stack for each sleeping process. There's no per-process cost. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-19 4:19 ` Matthew Wilcox 0 siblings, 0 replies; 58+ messages in thread From: Matthew Wilcox @ 2017-11-19 4:19 UTC (permalink / raw) To: Shawn Landden Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA On Fri, Nov 17, 2017 at 08:45:03PM -0800, Shawn Landden wrote: > On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > > On Thu 02-11-17 23:35:44, Shawn Landden wrote: > > > 16 bytes per process is kinda spendy, but I want to keep > > > lru behavior, which mem_score_adj does not allow. When a supervisor, > > > like Android's user input is keeping track this can be done in > > user-space. > > > It could be pulled out of task_struct if an cross-indexing additional > > > red-black tree is added to support pid-based lookup. > > > > This is still an abuse and the patch is wrong. We really do have an API > > to use I fail to see why you do not use it. > > > When I looked at wait_queue_head_t it was 20 byes. 24 bytes actually; the compiler will add 4 bytes of padding between the spinlock and the list_head. But there's one for the entire system. Then you add a 40 byte structure (wait_queue_entry) on the stack for each sleeping process. There's no per-process cost. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-19 4:19 ` Matthew Wilcox 0 siblings, 0 replies; 58+ messages in thread From: Matthew Wilcox @ 2017-11-19 4:19 UTC (permalink / raw) To: Shawn Landden Cc: Michal Hocko, linux-kernel, linux-fsdevel, linux-mm, linux-api On Fri, Nov 17, 2017 at 08:45:03PM -0800, Shawn Landden wrote: > On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Thu 02-11-17 23:35:44, Shawn Landden wrote: > > > 16 bytes per process is kinda spendy, but I want to keep > > > lru behavior, which mem_score_adj does not allow. When a supervisor, > > > like Android's user input is keeping track this can be done in > > user-space. > > > It could be pulled out of task_struct if an cross-indexing additional > > > red-black tree is added to support pid-based lookup. > > > > This is still an abuse and the patch is wrong. We really do have an API > > to use I fail to see why you do not use it. > > > When I looked at wait_queue_head_t it was 20 byes. 24 bytes actually; the compiler will add 4 bytes of padding between the spinlock and the list_head. But there's one for the entire system. Then you add a 40 byte structure (wait_queue_entry) on the stack for each sleeping process. There's no per-process cost. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-18 4:45 ` Shawn Landden @ 2017-11-20 8:35 ` Michal Hocko 2017-11-20 8:35 ` Michal Hocko 1 sibling, 0 replies; 58+ messages in thread From: Michal Hocko @ 2017-11-20 8:35 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Fri 17-11-17 20:45:03, Shawn Landden wrote: > On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > On Thu 02-11-17 23:35:44, Shawn Landden wrote: > > > It is common for services to be stateless around their main event loop. > > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > > > signals to the kernel that epoll_wait() and friends may not complete, > > > and the kernel may send SIGKILL if resources get tight. > > > > > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > > > > > Android uses this memory model for all programs, and having it in the > > > kernel will enable integration with the page cache (not in this > > > series). > > > > > > 16 bytes per process is kinda spendy, but I want to keep > > > lru behavior, which mem_score_adj does not allow. When a supervisor, > > > like Android's user input is keeping track this can be done in > > user-space. > > > It could be pulled out of task_struct if an cross-indexing additional > > > red-black tree is added to support pid-based lookup. > > > > This is still an abuse and the patch is wrong. We really do have an API > > to use I fail to see why you do not use it. > > > When I looked at wait_queue_head_t it was 20 byes. I do not understand. What I meant to say is that we do have a proper user api to hint OOM killer decisions. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-20 8:35 ` Michal Hocko 0 siblings, 0 replies; 58+ messages in thread From: Michal Hocko @ 2017-11-20 8:35 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Fri 17-11-17 20:45:03, Shawn Landden wrote: > On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > On Thu 02-11-17 23:35:44, Shawn Landden wrote: > > > It is common for services to be stateless around their main event loop. > > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > > > signals to the kernel that epoll_wait() and friends may not complete, > > > and the kernel may send SIGKILL if resources get tight. > > > > > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > > > > > Android uses this memory model for all programs, and having it in the > > > kernel will enable integration with the page cache (not in this > > > series). > > > > > > 16 bytes per process is kinda spendy, but I want to keep > > > lru behavior, which mem_score_adj does not allow. When a supervisor, > > > like Android's user input is keeping track this can be done in > > user-space. > > > It could be pulled out of task_struct if an cross-indexing additional > > > red-black tree is added to support pid-based lookup. > > > > This is still an abuse and the patch is wrong. We really do have an API > > to use I fail to see why you do not use it. > > > When I looked at wait_queue_head_t it was 20 byes. I do not understand. What I meant to say is that we do have a proper user api to hint OOM killer decisions. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-20 8:35 ` Michal Hocko @ 2017-11-21 4:48 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 4:48 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Mon, Nov 20, 2017 at 12:35 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Fri 17-11-17 20:45:03, Shawn Landden wrote: >> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: >> >> > On Thu 02-11-17 23:35:44, Shawn Landden wrote: >> > > It is common for services to be stateless around their main event loop. >> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it >> > > signals to the kernel that epoll_wait() and friends may not complete, >> > > and the kernel may send SIGKILL if resources get tight. >> > > >> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl >> > > >> > > Android uses this memory model for all programs, and having it in the >> > > kernel will enable integration with the page cache (not in this >> > > series). >> > > >> > > 16 bytes per process is kinda spendy, but I want to keep >> > > lru behavior, which mem_score_adj does not allow. When a supervisor, >> > > like Android's user input is keeping track this can be done in >> > user-space. >> > > It could be pulled out of task_struct if an cross-indexing additional >> > > red-black tree is added to support pid-based lookup. >> > >> > This is still an abuse and the patch is wrong. We really do have an API >> > to use I fail to see why you do not use it. >> > >> When I looked at wait_queue_head_t it was 20 byes. > > I do not understand. What I meant to say is that we do have a proper > user api to hint OOM killer decisions. This is a FIFO queue, rather than a heuristic, which is all you get with the current API. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-21 4:48 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 4:48 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Mon, Nov 20, 2017 at 12:35 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Fri 17-11-17 20:45:03, Shawn Landden wrote: >> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: >> >> > On Thu 02-11-17 23:35:44, Shawn Landden wrote: >> > > It is common for services to be stateless around their main event loop. >> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it >> > > signals to the kernel that epoll_wait() and friends may not complete, >> > > and the kernel may send SIGKILL if resources get tight. >> > > >> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl >> > > >> > > Android uses this memory model for all programs, and having it in the >> > > kernel will enable integration with the page cache (not in this >> > > series). >> > > >> > > 16 bytes per process is kinda spendy, but I want to keep >> > > lru behavior, which mem_score_adj does not allow. When a supervisor, >> > > like Android's user input is keeping track this can be done in >> > user-space. >> > > It could be pulled out of task_struct if an cross-indexing additional >> > > red-black tree is added to support pid-based lookup. >> > >> > This is still an abuse and the patch is wrong. We really do have an API >> > to use I fail to see why you do not use it. >> > >> When I looked at wait_queue_head_t it was 20 byes. > > I do not understand. What I meant to say is that we do have a proper > user api to hint OOM killer decisions. This is a FIFO queue, rather than a heuristic, which is all you get with the current API. > -- > Michal Hocko > SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-21 4:48 ` Shawn Landden @ 2017-11-21 7:05 ` Michal Hocko -1 siblings, 0 replies; 58+ messages in thread From: Michal Hocko @ 2017-11-21 7:05 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Mon 20-11-17 20:48:10, Shawn Landden wrote: > On Mon, Nov 20, 2017 at 12:35 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 17-11-17 20:45:03, Shawn Landden wrote: > >> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: > >> > >> > On Thu 02-11-17 23:35:44, Shawn Landden wrote: > >> > > It is common for services to be stateless around their main event loop. > >> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > >> > > signals to the kernel that epoll_wait() and friends may not complete, > >> > > and the kernel may send SIGKILL if resources get tight. > >> > > > >> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > >> > > > >> > > Android uses this memory model for all programs, and having it in the > >> > > kernel will enable integration with the page cache (not in this > >> > > series). > >> > > > >> > > 16 bytes per process is kinda spendy, but I want to keep > >> > > lru behavior, which mem_score_adj does not allow. When a supervisor, > >> > > like Android's user input is keeping track this can be done in > >> > user-space. > >> > > It could be pulled out of task_struct if an cross-indexing additional > >> > > red-black tree is added to support pid-based lookup. > >> > > >> > This is still an abuse and the patch is wrong. We really do have an API > >> > to use I fail to see why you do not use it. > >> > > >> When I looked at wait_queue_head_t it was 20 byes. > > > > I do not understand. What I meant to say is that we do have a proper > > user api to hint OOM killer decisions. > This is a FIFO queue, rather than a heuristic, which is all you get > with the current API. Yes I can read the code. All I am saing is that we already have an API to achieve what you want or at least very similar. Let me be explicit. Nacked-by: Michal Hocko <mhocko@suse.com> until it is sufficiently explained that the oom_score_adj is not suitable and there are no other means to achieve what you need. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-21 7:05 ` Michal Hocko 0 siblings, 0 replies; 58+ messages in thread From: Michal Hocko @ 2017-11-21 7:05 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Mon 20-11-17 20:48:10, Shawn Landden wrote: > On Mon, Nov 20, 2017 at 12:35 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 17-11-17 20:45:03, Shawn Landden wrote: > >> On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: > >> > >> > On Thu 02-11-17 23:35:44, Shawn Landden wrote: > >> > > It is common for services to be stateless around their main event loop. > >> > > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > >> > > signals to the kernel that epoll_wait() and friends may not complete, > >> > > and the kernel may send SIGKILL if resources get tight. > >> > > > >> > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > >> > > > >> > > Android uses this memory model for all programs, and having it in the > >> > > kernel will enable integration with the page cache (not in this > >> > > series). > >> > > > >> > > 16 bytes per process is kinda spendy, but I want to keep > >> > > lru behavior, which mem_score_adj does not allow. When a supervisor, > >> > > like Android's user input is keeping track this can be done in > >> > user-space. > >> > > It could be pulled out of task_struct if an cross-indexing additional > >> > > red-black tree is added to support pid-based lookup. > >> > > >> > This is still an abuse and the patch is wrong. We really do have an API > >> > to use I fail to see why you do not use it. > >> > > >> When I looked at wait_queue_head_t it was 20 byes. > > > > I do not understand. What I meant to say is that we do have a proper > > user api to hint OOM killer decisions. > This is a FIFO queue, rather than a heuristic, which is all you get > with the current API. Yes I can read the code. All I am saing is that we already have an API to achieve what you want or at least very similar. Let me be explicit. Nacked-by: Michal Hocko <mhocko@suse.com> until it is sufficiently explained that the oom_score_adj is not suitable and there are no other means to achieve what you need. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-03 9:09 ` Michal Hocko @ 2017-11-18 20:33 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-18 20:33 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Thu 02-11-17 23:35:44, Shawn Landden wrote: >> 16 bytes per process is kinda spendy, but I want to keep >> lru behavior, which mem_score_adj does not allow. When a supervisor, >> like Android's user input is keeping track this can be done in user-space. >> It could be pulled out of task_struct if an cross-indexing additional >> red-black tree is added to support pid-based lookup. > > This is still an abuse and the patch is wrong. We really do have an API > to use I fail to see why you do not use it. When I looked at wait_queue_head_t it was 20 bytes. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-18 20:33 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-18 20:33 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On Fri, Nov 3, 2017 at 2:09 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Thu 02-11-17 23:35:44, Shawn Landden wrote: >> 16 bytes per process is kinda spendy, but I want to keep >> lru behavior, which mem_score_adj does not allow. When a supervisor, >> like Android's user input is keeping track this can be done in user-space. >> It could be pulled out of task_struct if an cross-indexing additional >> red-black tree is added to support pid-based lookup. > > This is still an abuse and the patch is wrong. We really do have an API > to use I fail to see why you do not use it. When I looked at wait_queue_head_t it was 20 bytes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-03 6:35 ` Shawn Landden ` (2 preceding siblings ...) (?) @ 2017-11-15 21:11 ` Pavel Machek -1 siblings, 0 replies; 58+ messages in thread From: Pavel Machek @ 2017-11-15 21:11 UTC (permalink / raw) To: Shawn Landden; +Cc: kernel list [-- Attachment #1: Type: text/plain, Size: 1212 bytes --] Hi! > It is common for services to be stateless around their main event loop. > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > signals to the kernel that epoll_wait() and friends may not complete, > and the kernel may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). > > 16 bytes per process is kinda spendy, but I want to keep > lru behavior, which mem_score_adj does not allow. When a supervisor, > like Android's user input is keeping track this can be done in user-space. > It could be pulled out of task_struct if an cross-indexing additional > red-black tree is added to support pid-based lookup. Having android-like system for low memory killing might be interesting... but rather than throwing around patches maybe there should be discussion on lkml how the interface should look like, first? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 181 bytes --] ^ permalink raw reply [flat|nested] 58+ messages in thread
* [RFC v3] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. 2017-11-03 6:35 ` Shawn Landden (?) @ 2017-11-21 4:49 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 4:49 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, mhocko, willy, Shawn Landden See my systemd patch: https://github.com/shawnl/systemd/tree/prctl Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). v2 switch to prctl, memcg support v3 use <linux/wait.h> put OOM after constraint checking --- fs/eventpoll.c | 27 ++++++++++++++++++++ fs/proc/array.c | 7 ++++++ include/linux/memcontrol.h | 3 +++ include/linux/oom.h | 4 +++ include/linux/sched.h | 1 + include/uapi/linux/prctl.h | 4 +++ kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/exit.c | 1 + kernel/sys.c | 9 +++++++ mm/memcontrol.c | 2 ++ mm/oom_kill.c | 47 +++++++++++++++++++++++++++++++++++ 11 files changed, 166 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..745662f9a7e1 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -43,6 +43,8 @@ #include <linux/compat.h> #include <linux/rculist.h> #include <net/busy_poll.h> +#include <linux/memcontrol.h> +#include <linux/oom.h> /* * LOCKING: @@ -1761,6 +1763,19 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL; + DEFINE_WAIT_FUNC(oom_target_wait, oom_target_callback); + DEFINE_WAIT_FUNC(oom_target_wait_mcg, oom_target_callback); + + if (current->oom_target) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(current); + if (mcg) + add_wait_queue(&mcg->oom_target, &oom_target_wait_mcg); +#endif + add_wait_queue(oom_target_get_wait(), &oom_target_wait); + } if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -1850,6 +1865,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; + if (current->oom_target) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(current); + if (mcg) + remove_wait_queue(&mcg->oom_target, + &oom_target_wait_mcg); +#endif + remove_wait_queue(oom_target_get_wait(), &oom_target_wait); + } + return res; } diff --git a/fs/proc/array.c b/fs/proc/array.c index 9390032a11e1..1954ae87cb88 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -350,6 +350,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) seq_putc(m, '\n'); } +static inline void task_idle(struct seq_file *m, struct task_struct *p) +{ + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); + seq_putc(m, '\n'); +} + static inline void task_context_switch_counts(struct seq_file *m, struct task_struct *p) { @@ -381,6 +387,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_sig(m, task); task_cap(m, task); task_seccomp(m, task); + task_idle(m, task); task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..02eb92e7eff5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -30,6 +30,7 @@ #include <linux/vmstat.h> #include <linux/writeback.h> #include <linux/page-flags.h> +#include <linux/wait.h> struct mem_cgroup; struct page; @@ -261,6 +262,8 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; + wait_queue_head_t oom_target; + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; diff --git a/include/linux/oom.h b/include/linux/oom.h index 01c91d874a57..88acea9e0a59 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -102,6 +102,10 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern void exit_oom_target(void); +struct wait_queue_head *oom_target_get_wait(void); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index fdf74f27acf1..51b0e5987e8c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_target:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index b640071421f7..94868317c6f2 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,4 +198,8 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +#define PR_SET_IDLE 48 +#define PR_GET_IDLE 49 +# define PR_IDLE_MODE_KILLME 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 44857278eb8a..081bcd84a8d0 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -55,6 +55,8 @@ #include <linux/nsproxy.h> #include <linux/file.h> #include <net/sock.h> +#include <linux/oom.h> +#include <linux/memcontrol.h> #define CREATE_TRACE_POINTS #include <trace/events/cgroup.h> @@ -756,6 +758,9 @@ static void css_set_move_task(struct task_struct *task, struct css_set *from_cset, struct css_set *to_cset, bool use_mg_tasks) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; +#endif lockdep_assert_held(&css_set_lock); if (to_cset && !css_set_populated(to_cset)) @@ -779,6 +784,35 @@ static void css_set_move_task(struct task_struct *task, css_task_iter_advance(it); list_del_init(&task->cg_list); +#ifdef CONFIG_MEMCG + /* dequeue from memcg->oom_target + * TODO: this is O(n), add rb-tree to make it O(logn) + */ + mcg = mem_cgroup_from_task(task); + if (mcg) { + struct wait_queue_entry *wait; + + spin_lock(&mcg->oom_target.lock); + if (!waitqueue_active(&mcg->oom_target)) + goto empty_from; + wait = list_first_entry(&mcg->oom_target.head, + wait_queue_entry_t, entry); + do { + struct list_head *list; + + if (wait->private == task) + __remove_wait_queue(&mcg->oom_target, + wait); + list = wait->entry.next; + if (list_is_last(list, &mcg->oom_target.head)) + break; + wait = list_entry(list, + struct wait_queue_entry, entry); + } while (1); +empty_from: + spin_unlock(&mcg->oom_target.lock); + } +#endif if (!css_set_populated(from_cset)) css_set_update_populated(from_cset, false); } else { @@ -797,6 +831,33 @@ static void css_set_move_task(struct task_struct *task, rcu_assign_pointer(task->cgroups, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); +#ifdef CONFIG_MEMCG + /* dequeue from memcg->oom_target */ + mcg = mem_cgroup_from_task(task); + if (mcg) { + struct wait_queue_entry *wait; + + spin_lock(&mcg->oom_target.lock); + if (!waitqueue_active(&mcg->oom_target)) + goto empty_to; + wait = list_first_entry(&mcg->oom_target.head, + wait_queue_entry_t, entry); + do { + struct list_head *list; + + if (wait->private == task) + __add_wait_queue(&mcg->oom_target, + wait); + list = wait->entry.next; + if (list_is_last(list, &mcg->oom_target.head)) + break; + wait = list_entry(list, + struct wait_queue_entry, entry); + } while (1); +empty_to: + spin_unlock(&mcg->oom_target.lock); + } +#endif } } diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..2788fbdae267 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> diff --git a/kernel/sys.c b/kernel/sys.c index 524a4cb9bbe2..e1eb049a85e6 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2386,6 +2386,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; + case PR_SET_IDLE: + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) + return -EINVAL; + me->oom_target = arg2; + error = 0; + break; + case PR_GET_IDLE: + error = me->oom_target; + break; default: error = -EINVAL; break; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 661f046ad318..a4e3b93aeccd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4300,6 +4300,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memory_cgrp_subsys.broken_hierarchy = true; } + init_waitqueue_head(&memcg->oom_target); + /* The following stuff does not apply to the root */ if (!parent) { root_mem_cgroup = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..c5d8f5a716bc 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,9 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> +#include <linux/wait.h> +#include <linux/memcontrol.h> #include <asm/tlb.h> #include "internal.h" @@ -54,6 +57,23 @@ int sysctl_oom_dump_tasks = 1; DEFINE_MUTEX(oom_lock); +static DECLARE_WAIT_QUEUE_HEAD(oom_target); + +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +void exit_oom_target(void) +{ + DECLARE_WAITQUEUE(wait, current); + + remove_wait_queue(&oom_target, &wait); +} + +inline struct wait_queue_head *oom_target_get_wait() +{ + return &oom_target; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -994,6 +1014,18 @@ int unregister_oom_notifier(struct notifier_block *nb) } EXPORT_SYMBOL_GPL(unregister_oom_notifier); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +{ + struct task_struct *ts = wait->private; + + /* We use SIGKILL instead of the oom killer + * so as to cleanly interrupt ep_poll() + */ + pr_info("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); + send_sig(SIGKILL, ts, 1); + return 0; +} + /** * out_of_memory - kill the "best" process when we run out of memory * @oc: pointer to struct oom_control @@ -1007,6 +1039,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + wait_queue_head_t *w; if (oom_killer_disabled) return false; @@ -1056,6 +1089,20 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row for current memcg or global. + */ +#ifdef CONFIG_MEMCG + if (is_memcg_oom(oc)) + w = &oc->memcg->oom_target; + else +#endif + w = oom_target_get_wait(); + if (waitqueue_active(w)) { + wake_up(w); + return true; + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { -- 2.14.1 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [RFC v3] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. @ 2017-11-21 4:49 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 4:49 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, mhocko, willy, Shawn Landden See my systemd patch: https://github.com/shawnl/systemd/tree/prctl Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). v2 switch to prctl, memcg support v3 use <linux/wait.h> put OOM after constraint checking --- fs/eventpoll.c | 27 ++++++++++++++++++++ fs/proc/array.c | 7 ++++++ include/linux/memcontrol.h | 3 +++ include/linux/oom.h | 4 +++ include/linux/sched.h | 1 + include/uapi/linux/prctl.h | 4 +++ kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/exit.c | 1 + kernel/sys.c | 9 +++++++ mm/memcontrol.c | 2 ++ mm/oom_kill.c | 47 +++++++++++++++++++++++++++++++++++ 11 files changed, 166 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..745662f9a7e1 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -43,6 +43,8 @@ #include <linux/compat.h> #include <linux/rculist.h> #include <net/busy_poll.h> +#include <linux/memcontrol.h> +#include <linux/oom.h> /* * LOCKING: @@ -1761,6 +1763,19 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL; + DEFINE_WAIT_FUNC(oom_target_wait, oom_target_callback); + DEFINE_WAIT_FUNC(oom_target_wait_mcg, oom_target_callback); + + if (current->oom_target) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(current); + if (mcg) + add_wait_queue(&mcg->oom_target, &oom_target_wait_mcg); +#endif + add_wait_queue(oom_target_get_wait(), &oom_target_wait); + } if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -1850,6 +1865,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; + if (current->oom_target) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(current); + if (mcg) + remove_wait_queue(&mcg->oom_target, + &oom_target_wait_mcg); +#endif + remove_wait_queue(oom_target_get_wait(), &oom_target_wait); + } + return res; } diff --git a/fs/proc/array.c b/fs/proc/array.c index 9390032a11e1..1954ae87cb88 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -350,6 +350,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) seq_putc(m, '\n'); } +static inline void task_idle(struct seq_file *m, struct task_struct *p) +{ + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); + seq_putc(m, '\n'); +} + static inline void task_context_switch_counts(struct seq_file *m, struct task_struct *p) { @@ -381,6 +387,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_sig(m, task); task_cap(m, task); task_seccomp(m, task); + task_idle(m, task); task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..02eb92e7eff5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -30,6 +30,7 @@ #include <linux/vmstat.h> #include <linux/writeback.h> #include <linux/page-flags.h> +#include <linux/wait.h> struct mem_cgroup; struct page; @@ -261,6 +262,8 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; + wait_queue_head_t oom_target; + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; diff --git a/include/linux/oom.h b/include/linux/oom.h index 01c91d874a57..88acea9e0a59 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -102,6 +102,10 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern void exit_oom_target(void); +struct wait_queue_head *oom_target_get_wait(void); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index fdf74f27acf1..51b0e5987e8c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_target:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index b640071421f7..94868317c6f2 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,4 +198,8 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +#define PR_SET_IDLE 48 +#define PR_GET_IDLE 49 +# define PR_IDLE_MODE_KILLME 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 44857278eb8a..081bcd84a8d0 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -55,6 +55,8 @@ #include <linux/nsproxy.h> #include <linux/file.h> #include <net/sock.h> +#include <linux/oom.h> +#include <linux/memcontrol.h> #define CREATE_TRACE_POINTS #include <trace/events/cgroup.h> @@ -756,6 +758,9 @@ static void css_set_move_task(struct task_struct *task, struct css_set *from_cset, struct css_set *to_cset, bool use_mg_tasks) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; +#endif lockdep_assert_held(&css_set_lock); if (to_cset && !css_set_populated(to_cset)) @@ -779,6 +784,35 @@ static void css_set_move_task(struct task_struct *task, css_task_iter_advance(it); list_del_init(&task->cg_list); +#ifdef CONFIG_MEMCG + /* dequeue from memcg->oom_target + * TODO: this is O(n), add rb-tree to make it O(logn) + */ + mcg = mem_cgroup_from_task(task); + if (mcg) { + struct wait_queue_entry *wait; + + spin_lock(&mcg->oom_target.lock); + if (!waitqueue_active(&mcg->oom_target)) + goto empty_from; + wait = list_first_entry(&mcg->oom_target.head, + wait_queue_entry_t, entry); + do { + struct list_head *list; + + if (wait->private == task) + __remove_wait_queue(&mcg->oom_target, + wait); + list = wait->entry.next; + if (list_is_last(list, &mcg->oom_target.head)) + break; + wait = list_entry(list, + struct wait_queue_entry, entry); + } while (1); +empty_from: + spin_unlock(&mcg->oom_target.lock); + } +#endif if (!css_set_populated(from_cset)) css_set_update_populated(from_cset, false); } else { @@ -797,6 +831,33 @@ static void css_set_move_task(struct task_struct *task, rcu_assign_pointer(task->cgroups, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); +#ifdef CONFIG_MEMCG + /* dequeue from memcg->oom_target */ + mcg = mem_cgroup_from_task(task); + if (mcg) { + struct wait_queue_entry *wait; + + spin_lock(&mcg->oom_target.lock); + if (!waitqueue_active(&mcg->oom_target)) + goto empty_to; + wait = list_first_entry(&mcg->oom_target.head, + wait_queue_entry_t, entry); + do { + struct list_head *list; + + if (wait->private == task) + __add_wait_queue(&mcg->oom_target, + wait); + list = wait->entry.next; + if (list_is_last(list, &mcg->oom_target.head)) + break; + wait = list_entry(list, + struct wait_queue_entry, entry); + } while (1); +empty_to: + spin_unlock(&mcg->oom_target.lock); + } +#endif } } diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..2788fbdae267 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> diff --git a/kernel/sys.c b/kernel/sys.c index 524a4cb9bbe2..e1eb049a85e6 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2386,6 +2386,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; + case PR_SET_IDLE: + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) + return -EINVAL; + me->oom_target = arg2; + error = 0; + break; + case PR_GET_IDLE: + error = me->oom_target; + break; default: error = -EINVAL; break; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 661f046ad318..a4e3b93aeccd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4300,6 +4300,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memory_cgrp_subsys.broken_hierarchy = true; } + init_waitqueue_head(&memcg->oom_target); + /* The following stuff does not apply to the root */ if (!parent) { root_mem_cgroup = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..c5d8f5a716bc 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,9 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> +#include <linux/wait.h> +#include <linux/memcontrol.h> #include <asm/tlb.h> #include "internal.h" @@ -54,6 +57,23 @@ int sysctl_oom_dump_tasks = 1; DEFINE_MUTEX(oom_lock); +static DECLARE_WAIT_QUEUE_HEAD(oom_target); + +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +void exit_oom_target(void) +{ + DECLARE_WAITQUEUE(wait, current); + + remove_wait_queue(&oom_target, &wait); +} + +inline struct wait_queue_head *oom_target_get_wait() +{ + return &oom_target; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -994,6 +1014,18 @@ int unregister_oom_notifier(struct notifier_block *nb) } EXPORT_SYMBOL_GPL(unregister_oom_notifier); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +{ + struct task_struct *ts = wait->private; + + /* We use SIGKILL instead of the oom killer + * so as to cleanly interrupt ep_poll() + */ + pr_info("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); + send_sig(SIGKILL, ts, 1); + return 0; +} + /** * out_of_memory - kill the "best" process when we run out of memory * @oc: pointer to struct oom_control @@ -1007,6 +1039,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + wait_queue_head_t *w; if (oom_killer_disabled) return false; @@ -1056,6 +1089,20 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row for current memcg or global. + */ +#ifdef CONFIG_MEMCG + if (is_memcg_oom(oc)) + w = &oc->memcg->oom_target; + else +#endif + w = oom_target_get_wait(); + if (waitqueue_active(w)) { + wake_up(w); + return true; + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { -- 2.14.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [RFC v3] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. @ 2017-11-21 4:49 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 4:49 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, mhocko, willy, Shawn Landden See my systemd patch: https://github.com/shawnl/systemd/tree/prctl Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). v2 switch to prctl, memcg support v3 use <linux/wait.h> put OOM after constraint checking --- fs/eventpoll.c | 27 ++++++++++++++++++++ fs/proc/array.c | 7 ++++++ include/linux/memcontrol.h | 3 +++ include/linux/oom.h | 4 +++ include/linux/sched.h | 1 + include/uapi/linux/prctl.h | 4 +++ kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/exit.c | 1 + kernel/sys.c | 9 +++++++ mm/memcontrol.c | 2 ++ mm/oom_kill.c | 47 +++++++++++++++++++++++++++++++++++ 11 files changed, 166 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..745662f9a7e1 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -43,6 +43,8 @@ #include <linux/compat.h> #include <linux/rculist.h> #include <net/busy_poll.h> +#include <linux/memcontrol.h> +#include <linux/oom.h> /* * LOCKING: @@ -1761,6 +1763,19 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL; + DEFINE_WAIT_FUNC(oom_target_wait, oom_target_callback); + DEFINE_WAIT_FUNC(oom_target_wait_mcg, oom_target_callback); + + if (current->oom_target) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(current); + if (mcg) + add_wait_queue(&mcg->oom_target, &oom_target_wait_mcg); +#endif + add_wait_queue(oom_target_get_wait(), &oom_target_wait); + } if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -1850,6 +1865,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; + if (current->oom_target) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; + + mcg = mem_cgroup_from_task(current); + if (mcg) + remove_wait_queue(&mcg->oom_target, + &oom_target_wait_mcg); +#endif + remove_wait_queue(oom_target_get_wait(), &oom_target_wait); + } + return res; } diff --git a/fs/proc/array.c b/fs/proc/array.c index 9390032a11e1..1954ae87cb88 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -350,6 +350,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) seq_putc(m, '\n'); } +static inline void task_idle(struct seq_file *m, struct task_struct *p) +{ + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); + seq_putc(m, '\n'); +} + static inline void task_context_switch_counts(struct seq_file *m, struct task_struct *p) { @@ -381,6 +387,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_sig(m, task); task_cap(m, task); task_seccomp(m, task); + task_idle(m, task); task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..02eb92e7eff5 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -30,6 +30,7 @@ #include <linux/vmstat.h> #include <linux/writeback.h> #include <linux/page-flags.h> +#include <linux/wait.h> struct mem_cgroup; struct page; @@ -261,6 +262,8 @@ struct mem_cgroup { struct list_head event_list; spinlock_t event_list_lock; + wait_queue_head_t oom_target; + struct mem_cgroup_per_node *nodeinfo[0]; /* WARNING: nodeinfo must be the last member here */ }; diff --git a/include/linux/oom.h b/include/linux/oom.h index 01c91d874a57..88acea9e0a59 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -102,6 +102,10 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern void exit_oom_target(void); +struct wait_queue_head *oom_target_get_wait(void); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index fdf74f27acf1..51b0e5987e8c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_target:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index b640071421f7..94868317c6f2 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,4 +198,8 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +#define PR_SET_IDLE 48 +#define PR_GET_IDLE 49 +# define PR_IDLE_MODE_KILLME 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 44857278eb8a..081bcd84a8d0 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -55,6 +55,8 @@ #include <linux/nsproxy.h> #include <linux/file.h> #include <net/sock.h> +#include <linux/oom.h> +#include <linux/memcontrol.h> #define CREATE_TRACE_POINTS #include <trace/events/cgroup.h> @@ -756,6 +758,9 @@ static void css_set_move_task(struct task_struct *task, struct css_set *from_cset, struct css_set *to_cset, bool use_mg_tasks) { +#ifdef CONFIG_MEMCG + struct mem_cgroup *mcg; +#endif lockdep_assert_held(&css_set_lock); if (to_cset && !css_set_populated(to_cset)) @@ -779,6 +784,35 @@ static void css_set_move_task(struct task_struct *task, css_task_iter_advance(it); list_del_init(&task->cg_list); +#ifdef CONFIG_MEMCG + /* dequeue from memcg->oom_target + * TODO: this is O(n), add rb-tree to make it O(logn) + */ + mcg = mem_cgroup_from_task(task); + if (mcg) { + struct wait_queue_entry *wait; + + spin_lock(&mcg->oom_target.lock); + if (!waitqueue_active(&mcg->oom_target)) + goto empty_from; + wait = list_first_entry(&mcg->oom_target.head, + wait_queue_entry_t, entry); + do { + struct list_head *list; + + if (wait->private == task) + __remove_wait_queue(&mcg->oom_target, + wait); + list = wait->entry.next; + if (list_is_last(list, &mcg->oom_target.head)) + break; + wait = list_entry(list, + struct wait_queue_entry, entry); + } while (1); +empty_from: + spin_unlock(&mcg->oom_target.lock); + } +#endif if (!css_set_populated(from_cset)) css_set_update_populated(from_cset, false); } else { @@ -797,6 +831,33 @@ static void css_set_move_task(struct task_struct *task, rcu_assign_pointer(task->cgroups, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); +#ifdef CONFIG_MEMCG + /* dequeue from memcg->oom_target */ + mcg = mem_cgroup_from_task(task); + if (mcg) { + struct wait_queue_entry *wait; + + spin_lock(&mcg->oom_target.lock); + if (!waitqueue_active(&mcg->oom_target)) + goto empty_to; + wait = list_first_entry(&mcg->oom_target.head, + wait_queue_entry_t, entry); + do { + struct list_head *list; + + if (wait->private == task) + __add_wait_queue(&mcg->oom_target, + wait); + list = wait->entry.next; + if (list_is_last(list, &mcg->oom_target.head)) + break; + wait = list_entry(list, + struct wait_queue_entry, entry); + } while (1); +empty_to: + spin_unlock(&mcg->oom_target.lock); + } +#endif } } diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..2788fbdae267 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> diff --git a/kernel/sys.c b/kernel/sys.c index 524a4cb9bbe2..e1eb049a85e6 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2386,6 +2386,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; + case PR_SET_IDLE: + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) + return -EINVAL; + me->oom_target = arg2; + error = 0; + break; + case PR_GET_IDLE: + error = me->oom_target; + break; default: error = -EINVAL; break; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 661f046ad318..a4e3b93aeccd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4300,6 +4300,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) memory_cgrp_subsys.broken_hierarchy = true; } + init_waitqueue_head(&memcg->oom_target); + /* The following stuff does not apply to the root */ if (!parent) { root_mem_cgroup = memcg; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..c5d8f5a716bc 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,9 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> +#include <linux/wait.h> +#include <linux/memcontrol.h> #include <asm/tlb.h> #include "internal.h" @@ -54,6 +57,23 @@ int sysctl_oom_dump_tasks = 1; DEFINE_MUTEX(oom_lock); +static DECLARE_WAIT_QUEUE_HEAD(oom_target); + +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +void exit_oom_target(void) +{ + DECLARE_WAITQUEUE(wait, current); + + remove_wait_queue(&oom_target, &wait); +} + +inline struct wait_queue_head *oom_target_get_wait() +{ + return &oom_target; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -994,6 +1014,18 @@ int unregister_oom_notifier(struct notifier_block *nb) } EXPORT_SYMBOL_GPL(unregister_oom_notifier); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +{ + struct task_struct *ts = wait->private; + + /* We use SIGKILL instead of the oom killer + * so as to cleanly interrupt ep_poll() + */ + pr_info("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); + send_sig(SIGKILL, ts, 1); + return 0; +} + /** * out_of_memory - kill the "best" process when we run out of memory * @oc: pointer to struct oom_control @@ -1007,6 +1039,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + wait_queue_head_t *w; if (oom_killer_disabled) return false; @@ -1056,6 +1089,20 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row for current memcg or global. + */ +#ifdef CONFIG_MEMCG + if (is_memcg_oom(oc)) + w = &oc->memcg->oom_target; + else +#endif + w = oom_target_get_wait(); + if (waitqueue_active(w)) { + wake_up(w); + return true; + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { -- 2.14.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [RFC v3] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. 2017-11-21 4:49 ` Shawn Landden @ 2017-11-21 4:56 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 4:56 UTC (permalink / raw) To: Shawn Landden Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, Michal Hocko, willy On Mon, Nov 20, 2017 at 8:49 PM, Shawn Landden <slandden@gmail.com> wrote: > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). > > v2 > switch to prctl, memcg support > > v3 > use <linux/wait.h> > put OOM after constraint checking > --- > fs/eventpoll.c | 27 ++++++++++++++++++++ > fs/proc/array.c | 7 ++++++ > include/linux/memcontrol.h | 3 +++ > include/linux/oom.h | 4 +++ > include/linux/sched.h | 1 + > include/uapi/linux/prctl.h | 4 +++ > kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > kernel/exit.c | 1 + > kernel/sys.c | 9 +++++++ > mm/memcontrol.c | 2 ++ > mm/oom_kill.c | 47 +++++++++++++++++++++++++++++++++++ > 11 files changed, 166 insertions(+) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index 2fabd19cdeea..745662f9a7e1 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -43,6 +43,8 @@ > #include <linux/compat.h> > #include <linux/rculist.h> > #include <net/busy_poll.h> > +#include <linux/memcontrol.h> > +#include <linux/oom.h> > > /* > * LOCKING: > @@ -1761,6 +1763,19 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, > u64 slack = 0; > wait_queue_entry_t wait; > ktime_t expires, *to = NULL; > + DEFINE_WAIT_FUNC(oom_target_wait, oom_target_callback); > + DEFINE_WAIT_FUNC(oom_target_wait_mcg, oom_target_callback); > + > + if (current->oom_target) { > +#ifdef CONFIG_MEMCG > + struct mem_cgroup *mcg; > + > + mcg = mem_cgroup_from_task(current); > + if (mcg) > + add_wait_queue(&mcg->oom_target, &oom_target_wait_mcg); > +#endif > + add_wait_queue(oom_target_get_wait(), &oom_target_wait); > + } > > if (timeout > 0) { > struct timespec64 end_time = ep_set_mstimeout(timeout); > @@ -1850,6 +1865,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, > !(res = ep_send_events(ep, events, maxevents)) && !timed_out) > goto fetch_events; > > + if (current->oom_target) { > +#ifdef CONFIG_MEMCG > + struct mem_cgroup *mcg; > + > + mcg = mem_cgroup_from_task(current); > + if (mcg) > + remove_wait_queue(&mcg->oom_target, > + &oom_target_wait_mcg); > +#endif > + remove_wait_queue(oom_target_get_wait(), &oom_target_wait); > + } > + > return res; > } > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 9390032a11e1..1954ae87cb88 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -350,6 +350,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > seq_putc(m, '\n'); > } > > +static inline void task_idle(struct seq_file *m, struct task_struct *p) > +{ > + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); > + seq_putc(m, '\n'); > +} > + > static inline void task_context_switch_counts(struct seq_file *m, > struct task_struct *p) > { > @@ -381,6 +387,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, > task_sig(m, task); > task_cap(m, task); > task_seccomp(m, task); > + task_idle(m, task); > task_cpus_allowed(m, task); > cpuset_task_status_allowed(m, task); > task_context_switch_counts(m, task); > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 69966c461d1c..02eb92e7eff5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -30,6 +30,7 @@ > #include <linux/vmstat.h> > #include <linux/writeback.h> > #include <linux/page-flags.h> > +#include <linux/wait.h> > > struct mem_cgroup; > struct page; > @@ -261,6 +262,8 @@ struct mem_cgroup { > struct list_head event_list; > spinlock_t event_list_lock; > > + wait_queue_head_t oom_target; > + > struct mem_cgroup_per_node *nodeinfo[0]; > /* WARNING: nodeinfo must be the last member here */ > }; > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 01c91d874a57..88acea9e0a59 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -102,6 +102,10 @@ extern void oom_killer_enable(void); > > extern struct task_struct *find_lock_task_mm(struct task_struct *p); > > +extern void exit_oom_target(void); > +struct wait_queue_head *oom_target_get_wait(void); > +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key); > + > /* sysctls */ > extern int sysctl_oom_dump_tasks; > extern int sysctl_oom_kill_allocating_task; > diff --git a/include/linux/sched.h b/include/linux/sched.h > index fdf74f27acf1..51b0e5987e8c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -652,6 +652,7 @@ struct task_struct { > /* disallow userland-initiated cgroup migration */ > unsigned no_cgroup_migration:1; > #endif > + unsigned oom_target:1; > > unsigned long atomic_flags; /* Flags requiring atomic access. */ > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index b640071421f7..94868317c6f2 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,4 +198,8 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +#define PR_SET_IDLE 48 > +#define PR_GET_IDLE 49 > +# define PR_IDLE_MODE_KILLME 1 > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 44857278eb8a..081bcd84a8d0 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -55,6 +55,8 @@ > #include <linux/nsproxy.h> > #include <linux/file.h> > #include <net/sock.h> > +#include <linux/oom.h> > +#include <linux/memcontrol.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/cgroup.h> > @@ -756,6 +758,9 @@ static void css_set_move_task(struct task_struct *task, > struct css_set *from_cset, struct css_set *to_cset, > bool use_mg_tasks) > { > +#ifdef CONFIG_MEMCG > + struct mem_cgroup *mcg; > +#endif > lockdep_assert_held(&css_set_lock); > > if (to_cset && !css_set_populated(to_cset)) > @@ -779,6 +784,35 @@ static void css_set_move_task(struct task_struct *task, > css_task_iter_advance(it); > > list_del_init(&task->cg_list); > +#ifdef CONFIG_MEMCG > + /* dequeue from memcg->oom_target Ahh this is all shitty here. Sorry for the noise of this shit. > + * TODO: this is O(n), add rb-tree to make it O(logn) > + */ > + mcg = mem_cgroup_from_task(task); > + if (mcg) { > + struct wait_queue_entry *wait; > + > + spin_lock(&mcg->oom_target.lock); > + if (!waitqueue_active(&mcg->oom_target)) > + goto empty_from; > + wait = list_first_entry(&mcg->oom_target.head, > + wait_queue_entry_t, entry); > + do { > + struct list_head *list; > + > + if (wait->private == task) > + __remove_wait_queue(&mcg->oom_target, > + wait); > + list = wait->entry.next; > + if (list_is_last(list, &mcg->oom_target.head)) > + break; > + wait = list_entry(list, > + struct wait_queue_entry, entry); > + } while (1); > +empty_from: > + spin_unlock(&mcg->oom_target.lock); > + } > +#endif > if (!css_set_populated(from_cset)) > css_set_update_populated(from_cset, false); > } else { > @@ -797,6 +831,33 @@ static void css_set_move_task(struct task_struct *task, > rcu_assign_pointer(task->cgroups, to_cset); > list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : > &to_cset->tasks); > +#ifdef CONFIG_MEMCG > + /* dequeue from memcg->oom_target */ > + mcg = mem_cgroup_from_task(task); > + if (mcg) { > + struct wait_queue_entry *wait; > + > + spin_lock(&mcg->oom_target.lock); > + if (!waitqueue_active(&mcg->oom_target)) > + goto empty_to; > + wait = list_first_entry(&mcg->oom_target.head, > + wait_queue_entry_t, entry); > + do { > + struct list_head *list; > + > + if (wait->private == task) > + __add_wait_queue(&mcg->oom_target, > + wait); > + list = wait->entry.next; > + if (list_is_last(list, &mcg->oom_target.head)) > + break; > + wait = list_entry(list, > + struct wait_queue_entry, entry); > + } while (1); > +empty_to: > + spin_unlock(&mcg->oom_target.lock); > + } > +#endif > } > } > > diff --git a/kernel/exit.c b/kernel/exit.c > index f6cad39f35df..2788fbdae267 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -62,6 +62,7 @@ > #include <linux/random.h> > #include <linux/rcuwait.h> > #include <linux/compat.h> > +#include <linux/eventpoll.h> > > #include <linux/uaccess.h> > #include <asm/unistd.h> > diff --git a/kernel/sys.c b/kernel/sys.c > index 524a4cb9bbe2..e1eb049a85e6 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2386,6 +2386,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > case PR_GET_FP_MODE: > error = GET_FP_MODE(me); > break; > + case PR_SET_IDLE: > + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) > + return -EINVAL; > + me->oom_target = arg2; > + error = 0; > + break; > + case PR_GET_IDLE: > + error = me->oom_target; > + break; > default: > error = -EINVAL; > break; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 661f046ad318..a4e3b93aeccd 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4300,6 +4300,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > memory_cgrp_subsys.broken_hierarchy = true; > } > > + init_waitqueue_head(&memcg->oom_target); > + > /* The following stuff does not apply to the root */ > if (!parent) { > root_mem_cgroup = memcg; > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index dee0f75c3013..c5d8f5a716bc 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -41,6 +41,9 @@ > #include <linux/kthread.h> > #include <linux/init.h> > #include <linux/mmu_notifier.h> > +#include <linux/eventpoll.h> > +#include <linux/wait.h> > +#include <linux/memcontrol.h> > > #include <asm/tlb.h> > #include "internal.h" > @@ -54,6 +57,23 @@ int sysctl_oom_dump_tasks = 1; > > DEFINE_MUTEX(oom_lock); > > +static DECLARE_WAIT_QUEUE_HEAD(oom_target); > + > +/* Clean up after a EPOLL_KILLME process quits. > + * Called by kernel/exit.c. > + */ > +void exit_oom_target(void) > +{ > + DECLARE_WAITQUEUE(wait, current); > + > + remove_wait_queue(&oom_target, &wait); > +} > + > +inline struct wait_queue_head *oom_target_get_wait() > +{ > + return &oom_target; > +} > + > #ifdef CONFIG_NUMA > /** > * has_intersects_mems_allowed() - check task eligiblity for kill > @@ -994,6 +1014,18 @@ int unregister_oom_notifier(struct notifier_block *nb) > } > EXPORT_SYMBOL_GPL(unregister_oom_notifier); > > +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) > +{ > + struct task_struct *ts = wait->private; > + > + /* We use SIGKILL instead of the oom killer > + * so as to cleanly interrupt ep_poll() > + */ > + pr_info("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); > + send_sig(SIGKILL, ts, 1); > + return 0; > +} > + > /** > * out_of_memory - kill the "best" process when we run out of memory > * @oc: pointer to struct oom_control > @@ -1007,6 +1039,7 @@ bool out_of_memory(struct oom_control *oc) > { > unsigned long freed = 0; > enum oom_constraint constraint = CONSTRAINT_NONE; > + wait_queue_head_t *w; > > if (oom_killer_disabled) > return false; > @@ -1056,6 +1089,20 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row for current memcg or global. > + */ > +#ifdef CONFIG_MEMCG > + if (is_memcg_oom(oc)) > + w = &oc->memcg->oom_target; > + else > +#endif > + w = oom_target_get_wait(); > + if (waitqueue_active(w)) { > + wake_up(w); > + return true; > + } > + > select_bad_process(oc); > /* Found nothing?!?! Either we hang forever, or we panic. */ > if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { > -- > 2.14.1 > ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v3] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. @ 2017-11-21 4:56 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 4:56 UTC (permalink / raw) To: Shawn Landden Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, Michal Hocko, willy On Mon, Nov 20, 2017 at 8:49 PM, Shawn Landden <slandden@gmail.com> wrote: > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). > > v2 > switch to prctl, memcg support > > v3 > use <linux/wait.h> > put OOM after constraint checking > --- > fs/eventpoll.c | 27 ++++++++++++++++++++ > fs/proc/array.c | 7 ++++++ > include/linux/memcontrol.h | 3 +++ > include/linux/oom.h | 4 +++ > include/linux/sched.h | 1 + > include/uapi/linux/prctl.h | 4 +++ > kernel/cgroup/cgroup.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++ > kernel/exit.c | 1 + > kernel/sys.c | 9 +++++++ > mm/memcontrol.c | 2 ++ > mm/oom_kill.c | 47 +++++++++++++++++++++++++++++++++++ > 11 files changed, 166 insertions(+) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index 2fabd19cdeea..745662f9a7e1 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -43,6 +43,8 @@ > #include <linux/compat.h> > #include <linux/rculist.h> > #include <net/busy_poll.h> > +#include <linux/memcontrol.h> > +#include <linux/oom.h> > > /* > * LOCKING: > @@ -1761,6 +1763,19 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, > u64 slack = 0; > wait_queue_entry_t wait; > ktime_t expires, *to = NULL; > + DEFINE_WAIT_FUNC(oom_target_wait, oom_target_callback); > + DEFINE_WAIT_FUNC(oom_target_wait_mcg, oom_target_callback); > + > + if (current->oom_target) { > +#ifdef CONFIG_MEMCG > + struct mem_cgroup *mcg; > + > + mcg = mem_cgroup_from_task(current); > + if (mcg) > + add_wait_queue(&mcg->oom_target, &oom_target_wait_mcg); > +#endif > + add_wait_queue(oom_target_get_wait(), &oom_target_wait); > + } > > if (timeout > 0) { > struct timespec64 end_time = ep_set_mstimeout(timeout); > @@ -1850,6 +1865,18 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, > !(res = ep_send_events(ep, events, maxevents)) && !timed_out) > goto fetch_events; > > + if (current->oom_target) { > +#ifdef CONFIG_MEMCG > + struct mem_cgroup *mcg; > + > + mcg = mem_cgroup_from_task(current); > + if (mcg) > + remove_wait_queue(&mcg->oom_target, > + &oom_target_wait_mcg); > +#endif > + remove_wait_queue(oom_target_get_wait(), &oom_target_wait); > + } > + > return res; > } > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 9390032a11e1..1954ae87cb88 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -350,6 +350,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > seq_putc(m, '\n'); > } > > +static inline void task_idle(struct seq_file *m, struct task_struct *p) > +{ > + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); > + seq_putc(m, '\n'); > +} > + > static inline void task_context_switch_counts(struct seq_file *m, > struct task_struct *p) > { > @@ -381,6 +387,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, > task_sig(m, task); > task_cap(m, task); > task_seccomp(m, task); > + task_idle(m, task); > task_cpus_allowed(m, task); > cpuset_task_status_allowed(m, task); > task_context_switch_counts(m, task); > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 69966c461d1c..02eb92e7eff5 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -30,6 +30,7 @@ > #include <linux/vmstat.h> > #include <linux/writeback.h> > #include <linux/page-flags.h> > +#include <linux/wait.h> > > struct mem_cgroup; > struct page; > @@ -261,6 +262,8 @@ struct mem_cgroup { > struct list_head event_list; > spinlock_t event_list_lock; > > + wait_queue_head_t oom_target; > + > struct mem_cgroup_per_node *nodeinfo[0]; > /* WARNING: nodeinfo must be the last member here */ > }; > diff --git a/include/linux/oom.h b/include/linux/oom.h > index 01c91d874a57..88acea9e0a59 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -102,6 +102,10 @@ extern void oom_killer_enable(void); > > extern struct task_struct *find_lock_task_mm(struct task_struct *p); > > +extern void exit_oom_target(void); > +struct wait_queue_head *oom_target_get_wait(void); > +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key); > + > /* sysctls */ > extern int sysctl_oom_dump_tasks; > extern int sysctl_oom_kill_allocating_task; > diff --git a/include/linux/sched.h b/include/linux/sched.h > index fdf74f27acf1..51b0e5987e8c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -652,6 +652,7 @@ struct task_struct { > /* disallow userland-initiated cgroup migration */ > unsigned no_cgroup_migration:1; > #endif > + unsigned oom_target:1; > > unsigned long atomic_flags; /* Flags requiring atomic access. */ > > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index b640071421f7..94868317c6f2 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,4 +198,8 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +#define PR_SET_IDLE 48 > +#define PR_GET_IDLE 49 > +# define PR_IDLE_MODE_KILLME 1 > + > #endif /* _LINUX_PRCTL_H */ > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 44857278eb8a..081bcd84a8d0 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -55,6 +55,8 @@ > #include <linux/nsproxy.h> > #include <linux/file.h> > #include <net/sock.h> > +#include <linux/oom.h> > +#include <linux/memcontrol.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/cgroup.h> > @@ -756,6 +758,9 @@ static void css_set_move_task(struct task_struct *task, > struct css_set *from_cset, struct css_set *to_cset, > bool use_mg_tasks) > { > +#ifdef CONFIG_MEMCG > + struct mem_cgroup *mcg; > +#endif > lockdep_assert_held(&css_set_lock); > > if (to_cset && !css_set_populated(to_cset)) > @@ -779,6 +784,35 @@ static void css_set_move_task(struct task_struct *task, > css_task_iter_advance(it); > > list_del_init(&task->cg_list); > +#ifdef CONFIG_MEMCG > + /* dequeue from memcg->oom_target Ahh this is all shitty here. Sorry for the noise of this shit. > + * TODO: this is O(n), add rb-tree to make it O(logn) > + */ > + mcg = mem_cgroup_from_task(task); > + if (mcg) { > + struct wait_queue_entry *wait; > + > + spin_lock(&mcg->oom_target.lock); > + if (!waitqueue_active(&mcg->oom_target)) > + goto empty_from; > + wait = list_first_entry(&mcg->oom_target.head, > + wait_queue_entry_t, entry); > + do { > + struct list_head *list; > + > + if (wait->private == task) > + __remove_wait_queue(&mcg->oom_target, > + wait); > + list = wait->entry.next; > + if (list_is_last(list, &mcg->oom_target.head)) > + break; > + wait = list_entry(list, > + struct wait_queue_entry, entry); > + } while (1); > +empty_from: > + spin_unlock(&mcg->oom_target.lock); > + } > +#endif > if (!css_set_populated(from_cset)) > css_set_update_populated(from_cset, false); > } else { > @@ -797,6 +831,33 @@ static void css_set_move_task(struct task_struct *task, > rcu_assign_pointer(task->cgroups, to_cset); > list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : > &to_cset->tasks); > +#ifdef CONFIG_MEMCG > + /* dequeue from memcg->oom_target */ > + mcg = mem_cgroup_from_task(task); > + if (mcg) { > + struct wait_queue_entry *wait; > + > + spin_lock(&mcg->oom_target.lock); > + if (!waitqueue_active(&mcg->oom_target)) > + goto empty_to; > + wait = list_first_entry(&mcg->oom_target.head, > + wait_queue_entry_t, entry); > + do { > + struct list_head *list; > + > + if (wait->private == task) > + __add_wait_queue(&mcg->oom_target, > + wait); > + list = wait->entry.next; > + if (list_is_last(list, &mcg->oom_target.head)) > + break; > + wait = list_entry(list, > + struct wait_queue_entry, entry); > + } while (1); > +empty_to: > + spin_unlock(&mcg->oom_target.lock); > + } > +#endif > } > } > > diff --git a/kernel/exit.c b/kernel/exit.c > index f6cad39f35df..2788fbdae267 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -62,6 +62,7 @@ > #include <linux/random.h> > #include <linux/rcuwait.h> > #include <linux/compat.h> > +#include <linux/eventpoll.h> > > #include <linux/uaccess.h> > #include <asm/unistd.h> > diff --git a/kernel/sys.c b/kernel/sys.c > index 524a4cb9bbe2..e1eb049a85e6 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -2386,6 +2386,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > case PR_GET_FP_MODE: > error = GET_FP_MODE(me); > break; > + case PR_SET_IDLE: > + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) > + return -EINVAL; > + me->oom_target = arg2; > + error = 0; > + break; > + case PR_GET_IDLE: > + error = me->oom_target; > + break; > default: > error = -EINVAL; > break; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 661f046ad318..a4e3b93aeccd 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4300,6 +4300,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) > memory_cgrp_subsys.broken_hierarchy = true; > } > > + init_waitqueue_head(&memcg->oom_target); > + > /* The following stuff does not apply to the root */ > if (!parent) { > root_mem_cgroup = memcg; > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index dee0f75c3013..c5d8f5a716bc 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -41,6 +41,9 @@ > #include <linux/kthread.h> > #include <linux/init.h> > #include <linux/mmu_notifier.h> > +#include <linux/eventpoll.h> > +#include <linux/wait.h> > +#include <linux/memcontrol.h> > > #include <asm/tlb.h> > #include "internal.h" > @@ -54,6 +57,23 @@ int sysctl_oom_dump_tasks = 1; > > DEFINE_MUTEX(oom_lock); > > +static DECLARE_WAIT_QUEUE_HEAD(oom_target); > + > +/* Clean up after a EPOLL_KILLME process quits. > + * Called by kernel/exit.c. > + */ > +void exit_oom_target(void) > +{ > + DECLARE_WAITQUEUE(wait, current); > + > + remove_wait_queue(&oom_target, &wait); > +} > + > +inline struct wait_queue_head *oom_target_get_wait() > +{ > + return &oom_target; > +} > + > #ifdef CONFIG_NUMA > /** > * has_intersects_mems_allowed() - check task eligiblity for kill > @@ -994,6 +1014,18 @@ int unregister_oom_notifier(struct notifier_block *nb) > } > EXPORT_SYMBOL_GPL(unregister_oom_notifier); > > +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) > +{ > + struct task_struct *ts = wait->private; > + > + /* We use SIGKILL instead of the oom killer > + * so as to cleanly interrupt ep_poll() > + */ > + pr_info("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); > + send_sig(SIGKILL, ts, 1); > + return 0; > +} > + > /** > * out_of_memory - kill the "best" process when we run out of memory > * @oc: pointer to struct oom_control > @@ -1007,6 +1039,7 @@ bool out_of_memory(struct oom_control *oc) > { > unsigned long freed = 0; > enum oom_constraint constraint = CONSTRAINT_NONE; > + wait_queue_head_t *w; > > if (oom_killer_disabled) > return false; > @@ -1056,6 +1089,20 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row for current memcg or global. > + */ > +#ifdef CONFIG_MEMCG > + if (is_memcg_oom(oc)) > + w = &oc->memcg->oom_target; > + else > +#endif > + w = oom_target_get_wait(); > + if (waitqueue_active(w)) { > + wake_up(w); > + return true; > + } > + > select_bad_process(oc); > /* Found nothing?!?! Either we hang forever, or we panic. */ > if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { > -- > 2.14.1 > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* [RFC v4] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. 2017-11-21 4:49 ` Shawn Landden (?) @ 2017-11-21 5:16 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 5:16 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, mhocko, willy, Shawn Landden See my systemd patch: https://github.com/shawnl/systemd/tree/prctl Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). v2 switch to prctl, memcg support v3 use <linux/wait.h> put OOM after constraint checking v4 ignore memcg OOMs as should have been all along (sry for the noise) --- fs/eventpoll.c | 9 +++++++++ fs/proc/array.c | 7 +++++++ include/linux/memcontrol.h | 1 + include/linux/oom.h | 4 ++++ include/linux/sched.h | 1 + include/uapi/linux/prctl.h | 4 ++++ kernel/exit.c | 1 + kernel/sys.c | 9 +++++++++ mm/oom_kill.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 79 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..5b3f084b22d5 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -43,6 +43,8 @@ #include <linux/compat.h> #include <linux/rculist.h> #include <net/busy_poll.h> +#include <linux/memcontrol.h> +#include <linux/oom.h> /* * LOCKING: @@ -1761,6 +1763,10 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL; + DEFINE_WAIT_FUNC(oom_target_wait, oom_target_callback); + + if (current->oom_target) + add_wait_queue(oom_target_get_wait(), &oom_target_wait); if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -1850,6 +1856,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; + if (current->oom_target) + remove_wait_queue(oom_target_get_wait(), &oom_target_wait); + return res; } diff --git a/fs/proc/array.c b/fs/proc/array.c index 9390032a11e1..1954ae87cb88 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -350,6 +350,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) seq_putc(m, '\n'); } +static inline void task_idle(struct seq_file *m, struct task_struct *p) +{ + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); + seq_putc(m, '\n'); +} + static inline void task_context_switch_counts(struct seq_file *m, struct task_struct *p) { @@ -381,6 +387,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_sig(m, task); task_cap(m, task); task_seccomp(m, task); + task_idle(m, task); task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..471d1d52ae72 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -30,6 +30,7 @@ #include <linux/vmstat.h> #include <linux/writeback.h> #include <linux/page-flags.h> +#include <linux/wait.h> struct mem_cgroup; struct page; diff --git a/include/linux/oom.h b/include/linux/oom.h index 01c91d874a57..88acea9e0a59 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -102,6 +102,10 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern void exit_oom_target(void); +struct wait_queue_head *oom_target_get_wait(void); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index fdf74f27acf1..51b0e5987e8c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_target:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index b640071421f7..94868317c6f2 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,4 +198,8 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +#define PR_SET_IDLE 48 +#define PR_GET_IDLE 49 +# define PR_IDLE_MODE_KILLME 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..2788fbdae267 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> diff --git a/kernel/sys.c b/kernel/sys.c index 524a4cb9bbe2..e1eb049a85e6 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2386,6 +2386,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; + case PR_SET_IDLE: + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) + return -EINVAL; + me->oom_target = arg2; + error = 0; + break; + case PR_GET_IDLE: + error = me->oom_target; + break; default: error = -EINVAL; break; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..73ad7ee47c8e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,8 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> +#include <linux/wait.h> #include <asm/tlb.h> #include "internal.h" @@ -54,6 +56,23 @@ int sysctl_oom_dump_tasks = 1; DEFINE_MUTEX(oom_lock); +static DECLARE_WAIT_QUEUE_HEAD(oom_target); + +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +void exit_oom_target(void) +{ + DECLARE_WAITQUEUE(wait, current); + + remove_wait_queue(&oom_target, &wait); +} + +inline struct wait_queue_head *oom_target_get_wait() +{ + return &oom_target; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -994,6 +1013,18 @@ int unregister_oom_notifier(struct notifier_block *nb) } EXPORT_SYMBOL_GPL(unregister_oom_notifier); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +{ + struct task_struct *ts = wait->private; + + /* We use SIGKILL instead of the oom killer + * so as to cleanly interrupt ep_poll() + */ + pr_debug("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); + send_sig(SIGKILL, ts, 1); + return 0; +} + /** * out_of_memory - kill the "best" process when we run out of memory * @oc: pointer to struct oom_control @@ -1007,6 +1038,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + wait_queue_head_t *w; if (oom_killer_disabled) return false; @@ -1056,6 +1088,17 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row for current memcg or global. + */ + if (!is_memcg_oom(oc)) { + w = oom_target_get_wait(); + if (waitqueue_active(w)) { + wake_up(w); + return true; + } + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { -- 2.14.1 ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [RFC v4] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. @ 2017-11-21 5:16 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 5:16 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, mhocko, willy, Shawn Landden See my systemd patch: https://github.com/shawnl/systemd/tree/prctl Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). v2 switch to prctl, memcg support v3 use <linux/wait.h> put OOM after constraint checking v4 ignore memcg OOMs as should have been all along (sry for the noise) --- fs/eventpoll.c | 9 +++++++++ fs/proc/array.c | 7 +++++++ include/linux/memcontrol.h | 1 + include/linux/oom.h | 4 ++++ include/linux/sched.h | 1 + include/uapi/linux/prctl.h | 4 ++++ kernel/exit.c | 1 + kernel/sys.c | 9 +++++++++ mm/oom_kill.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 79 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..5b3f084b22d5 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -43,6 +43,8 @@ #include <linux/compat.h> #include <linux/rculist.h> #include <net/busy_poll.h> +#include <linux/memcontrol.h> +#include <linux/oom.h> /* * LOCKING: @@ -1761,6 +1763,10 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL; + DEFINE_WAIT_FUNC(oom_target_wait, oom_target_callback); + + if (current->oom_target) + add_wait_queue(oom_target_get_wait(), &oom_target_wait); if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -1850,6 +1856,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; + if (current->oom_target) + remove_wait_queue(oom_target_get_wait(), &oom_target_wait); + return res; } diff --git a/fs/proc/array.c b/fs/proc/array.c index 9390032a11e1..1954ae87cb88 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -350,6 +350,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) seq_putc(m, '\n'); } +static inline void task_idle(struct seq_file *m, struct task_struct *p) +{ + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); + seq_putc(m, '\n'); +} + static inline void task_context_switch_counts(struct seq_file *m, struct task_struct *p) { @@ -381,6 +387,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_sig(m, task); task_cap(m, task); task_seccomp(m, task); + task_idle(m, task); task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..471d1d52ae72 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -30,6 +30,7 @@ #include <linux/vmstat.h> #include <linux/writeback.h> #include <linux/page-flags.h> +#include <linux/wait.h> struct mem_cgroup; struct page; diff --git a/include/linux/oom.h b/include/linux/oom.h index 01c91d874a57..88acea9e0a59 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -102,6 +102,10 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern void exit_oom_target(void); +struct wait_queue_head *oom_target_get_wait(void); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index fdf74f27acf1..51b0e5987e8c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_target:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index b640071421f7..94868317c6f2 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,4 +198,8 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +#define PR_SET_IDLE 48 +#define PR_GET_IDLE 49 +# define PR_IDLE_MODE_KILLME 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..2788fbdae267 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> diff --git a/kernel/sys.c b/kernel/sys.c index 524a4cb9bbe2..e1eb049a85e6 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2386,6 +2386,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; + case PR_SET_IDLE: + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) + return -EINVAL; + me->oom_target = arg2; + error = 0; + break; + case PR_GET_IDLE: + error = me->oom_target; + break; default: error = -EINVAL; break; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..73ad7ee47c8e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,8 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> +#include <linux/wait.h> #include <asm/tlb.h> #include "internal.h" @@ -54,6 +56,23 @@ int sysctl_oom_dump_tasks = 1; DEFINE_MUTEX(oom_lock); +static DECLARE_WAIT_QUEUE_HEAD(oom_target); + +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +void exit_oom_target(void) +{ + DECLARE_WAITQUEUE(wait, current); + + remove_wait_queue(&oom_target, &wait); +} + +inline struct wait_queue_head *oom_target_get_wait() +{ + return &oom_target; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -994,6 +1013,18 @@ int unregister_oom_notifier(struct notifier_block *nb) } EXPORT_SYMBOL_GPL(unregister_oom_notifier); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +{ + struct task_struct *ts = wait->private; + + /* We use SIGKILL instead of the oom killer + * so as to cleanly interrupt ep_poll() + */ + pr_debug("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); + send_sig(SIGKILL, ts, 1); + return 0; +} + /** * out_of_memory - kill the "best" process when we run out of memory * @oc: pointer to struct oom_control @@ -1007,6 +1038,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + wait_queue_head_t *w; if (oom_killer_disabled) return false; @@ -1056,6 +1088,17 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row for current memcg or global. + */ + if (!is_memcg_oom(oc)) { + w = oom_target_get_wait(); + if (waitqueue_active(w)) { + wake_up(w); + return true; + } + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { -- 2.14.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [RFC v4] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. @ 2017-11-21 5:16 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 5:16 UTC (permalink / raw) Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, mhocko, willy, Shawn Landden See my systemd patch: https://github.com/shawnl/systemd/tree/prctl Android uses this memory model for all programs, and having it in the kernel will enable integration with the page cache (not in this series). v2 switch to prctl, memcg support v3 use <linux/wait.h> put OOM after constraint checking v4 ignore memcg OOMs as should have been all along (sry for the noise) --- fs/eventpoll.c | 9 +++++++++ fs/proc/array.c | 7 +++++++ include/linux/memcontrol.h | 1 + include/linux/oom.h | 4 ++++ include/linux/sched.h | 1 + include/uapi/linux/prctl.h | 4 ++++ kernel/exit.c | 1 + kernel/sys.c | 9 +++++++++ mm/oom_kill.c | 43 +++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 79 insertions(+) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 2fabd19cdeea..5b3f084b22d5 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -43,6 +43,8 @@ #include <linux/compat.h> #include <linux/rculist.h> #include <net/busy_poll.h> +#include <linux/memcontrol.h> +#include <linux/oom.h> /* * LOCKING: @@ -1761,6 +1763,10 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, u64 slack = 0; wait_queue_entry_t wait; ktime_t expires, *to = NULL; + DEFINE_WAIT_FUNC(oom_target_wait, oom_target_callback); + + if (current->oom_target) + add_wait_queue(oom_target_get_wait(), &oom_target_wait); if (timeout > 0) { struct timespec64 end_time = ep_set_mstimeout(timeout); @@ -1850,6 +1856,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; + if (current->oom_target) + remove_wait_queue(oom_target_get_wait(), &oom_target_wait); + return res; } diff --git a/fs/proc/array.c b/fs/proc/array.c index 9390032a11e1..1954ae87cb88 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -350,6 +350,12 @@ static inline void task_seccomp(struct seq_file *m, struct task_struct *p) seq_putc(m, '\n'); } +static inline void task_idle(struct seq_file *m, struct task_struct *p) +{ + seq_put_decimal_ull(m, "Idle:\t", p->oom_target); + seq_putc(m, '\n'); +} + static inline void task_context_switch_counts(struct seq_file *m, struct task_struct *p) { @@ -381,6 +387,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns, task_sig(m, task); task_cap(m, task); task_seccomp(m, task); + task_idle(m, task); task_cpus_allowed(m, task); cpuset_task_status_allowed(m, task); task_context_switch_counts(m, task); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 69966c461d1c..471d1d52ae72 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -30,6 +30,7 @@ #include <linux/vmstat.h> #include <linux/writeback.h> #include <linux/page-flags.h> +#include <linux/wait.h> struct mem_cgroup; struct page; diff --git a/include/linux/oom.h b/include/linux/oom.h index 01c91d874a57..88acea9e0a59 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -102,6 +102,10 @@ extern void oom_killer_enable(void); extern struct task_struct *find_lock_task_mm(struct task_struct *p); +extern void exit_oom_target(void); +struct wait_queue_head *oom_target_get_wait(void); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key); + /* sysctls */ extern int sysctl_oom_dump_tasks; extern int sysctl_oom_kill_allocating_task; diff --git a/include/linux/sched.h b/include/linux/sched.h index fdf74f27acf1..51b0e5987e8c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -652,6 +652,7 @@ struct task_struct { /* disallow userland-initiated cgroup migration */ unsigned no_cgroup_migration:1; #endif + unsigned oom_target:1; unsigned long atomic_flags; /* Flags requiring atomic access. */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index b640071421f7..94868317c6f2 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,4 +198,8 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +#define PR_SET_IDLE 48 +#define PR_GET_IDLE 49 +# define PR_IDLE_MODE_KILLME 1 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/exit.c b/kernel/exit.c index f6cad39f35df..2788fbdae267 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -62,6 +62,7 @@ #include <linux/random.h> #include <linux/rcuwait.h> #include <linux/compat.h> +#include <linux/eventpoll.h> #include <linux/uaccess.h> #include <asm/unistd.h> diff --git a/kernel/sys.c b/kernel/sys.c index 524a4cb9bbe2..e1eb049a85e6 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2386,6 +2386,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_GET_FP_MODE: error = GET_FP_MODE(me); break; + case PR_SET_IDLE: + if (!((arg2 == 0) || (arg2 == PR_IDLE_MODE_KILLME))) + return -EINVAL; + me->oom_target = arg2; + error = 0; + break; + case PR_GET_IDLE: + error = me->oom_target; + break; default: error = -EINVAL; break; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75c3013..73ad7ee47c8e 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,8 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/eventpoll.h> +#include <linux/wait.h> #include <asm/tlb.h> #include "internal.h" @@ -54,6 +56,23 @@ int sysctl_oom_dump_tasks = 1; DEFINE_MUTEX(oom_lock); +static DECLARE_WAIT_QUEUE_HEAD(oom_target); + +/* Clean up after a EPOLL_KILLME process quits. + * Called by kernel/exit.c. + */ +void exit_oom_target(void) +{ + DECLARE_WAITQUEUE(wait, current); + + remove_wait_queue(&oom_target, &wait); +} + +inline struct wait_queue_head *oom_target_get_wait() +{ + return &oom_target; +} + #ifdef CONFIG_NUMA /** * has_intersects_mems_allowed() - check task eligiblity for kill @@ -994,6 +1013,18 @@ int unregister_oom_notifier(struct notifier_block *nb) } EXPORT_SYMBOL_GPL(unregister_oom_notifier); +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) +{ + struct task_struct *ts = wait->private; + + /* We use SIGKILL instead of the oom killer + * so as to cleanly interrupt ep_poll() + */ + pr_debug("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); + send_sig(SIGKILL, ts, 1); + return 0; +} + /** * out_of_memory - kill the "best" process when we run out of memory * @oc: pointer to struct oom_control @@ -1007,6 +1038,7 @@ bool out_of_memory(struct oom_control *oc) { unsigned long freed = 0; enum oom_constraint constraint = CONSTRAINT_NONE; + wait_queue_head_t *w; if (oom_killer_disabled) return false; @@ -1056,6 +1088,17 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* + * Check death row for current memcg or global. + */ + if (!is_memcg_oom(oc)) { + w = oom_target_get_wait(); + if (waitqueue_active(w)) { + wake_up(w); + return true; + } + } + select_bad_process(oc); /* Found nothing?!?! Either we hang forever, or we panic. */ if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) { -- 2.14.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [RFC v4] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. 2017-11-21 5:16 ` Shawn Landden @ 2017-11-21 5:26 ` Shawn Landden -1 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 5:26 UTC (permalink / raw) To: Shawn Landden Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, Michal Hocko, willy On Mon, Nov 20, 2017 at 9:16 PM, Shawn Landden <slandden@gmail.com> wrote: > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). What about having a dedicated way to kill these type of processes, instead of overloading the OOM killer? This was suggested by Colin Walters <walters@verbum.org> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v4] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. @ 2017-11-21 5:26 ` Shawn Landden 0 siblings, 0 replies; 58+ messages in thread From: Shawn Landden @ 2017-11-21 5:26 UTC (permalink / raw) To: Shawn Landden Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, Michal Hocko, willy On Mon, Nov 20, 2017 at 9:16 PM, Shawn Landden <slandden@gmail.com> wrote: > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). What about having a dedicated way to kill these type of processes, instead of overloading the OOM killer? This was suggested by Colin Walters <walters@verbum.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v4] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. 2017-11-21 5:16 ` Shawn Landden @ 2017-11-21 9:14 ` Thomas Gleixner -1 siblings, 0 replies; 58+ messages in thread From: Thomas Gleixner @ 2017-11-21 9:14 UTC (permalink / raw) To: Shawn Landden Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, mhocko, willy On Mon, 20 Nov 2017, Shawn Landden wrote: Please use a short and comprehensible subject line and do not pack a full sentence into it. The sentence wants to be in the change log body. > +static DECLARE_WAIT_QUEUE_HEAD(oom_target); > + > +/* Clean up after a EPOLL_KILLME process quits. > + * Called by kernel/exit.c. It's hardly called by kernel/exit.c and aside of that multi line comments are formatted like this: /* * .... * .... */ > + */ > +void exit_oom_target(void) > +{ > + DECLARE_WAITQUEUE(wait, current); > + > + remove_wait_queue(&oom_target, &wait); This is completely pointless, really. It does: INIT_LIST_HEAD(&wait.entry); spin_lock_irqsave(&oom_target->lock, flags); list_del(&wait->entry); spin_lock_irqrestore(&oom_target->lock, flags); IOW. It's a NOOP. What are you trying to achieve? > +} > + > +inline struct wait_queue_head *oom_target_get_wait() > +{ > + return &oom_target; This wrapper is useless. > +} > + > #ifdef CONFIG_NUMA > /** > * has_intersects_mems_allowed() - check task eligiblity for kill > @@ -994,6 +1013,18 @@ int unregister_oom_notifier(struct notifier_block *nb) > } > EXPORT_SYMBOL_GPL(unregister_oom_notifier); > > +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) > +{ > + struct task_struct *ts = wait->private; > + > + /* We use SIGKILL instead of the oom killer > + * so as to cleanly interrupt ep_poll() Huch? oom_killer uses SIGKILL as well, it just does it correctly. > + */ > + pr_debug("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); > + send_sig(SIGKILL, ts, 1); > + return 0; > +} > + > /** > * out_of_memory - kill the "best" process when we run out of memory > * @oc: pointer to struct oom_control > @@ -1007,6 +1038,7 @@ bool out_of_memory(struct oom_control *oc) > { > unsigned long freed = 0; > enum oom_constraint constraint = CONSTRAINT_NONE; > + wait_queue_head_t *w; > > if (oom_killer_disabled) > return false; > @@ -1056,6 +1088,17 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row for current memcg or global. > + */ > + if (!is_memcg_oom(oc)) { > + w = oom_target_get_wait(); > + if (waitqueue_active(w)) { > + wake_up(w); > + return true; > + } > + } Why on earth do you need that extra wait_queue magic? You completely fail to explain in your empty changelog why the existing oom hinting infrastructure is not sufficient. If you can explain why, then there is no reason to have this side channel. Extend/fix the current hinting mechanism and be done with it. Thanks, tglx ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v4] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight. @ 2017-11-21 9:14 ` Thomas Gleixner 0 siblings, 0 replies; 58+ messages in thread From: Thomas Gleixner @ 2017-11-21 9:14 UTC (permalink / raw) To: Shawn Landden Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api, mhocko, willy On Mon, 20 Nov 2017, Shawn Landden wrote: Please use a short and comprehensible subject line and do not pack a full sentence into it. The sentence wants to be in the change log body. > +static DECLARE_WAIT_QUEUE_HEAD(oom_target); > + > +/* Clean up after a EPOLL_KILLME process quits. > + * Called by kernel/exit.c. It's hardly called by kernel/exit.c and aside of that multi line comments are formatted like this: /* * .... * .... */ > + */ > +void exit_oom_target(void) > +{ > + DECLARE_WAITQUEUE(wait, current); > + > + remove_wait_queue(&oom_target, &wait); This is completely pointless, really. It does: INIT_LIST_HEAD(&wait.entry); spin_lock_irqsave(&oom_target->lock, flags); list_del(&wait->entry); spin_lock_irqrestore(&oom_target->lock, flags); IOW. It's a NOOP. What are you trying to achieve? > +} > + > +inline struct wait_queue_head *oom_target_get_wait() > +{ > + return &oom_target; This wrapper is useless. > +} > + > #ifdef CONFIG_NUMA > /** > * has_intersects_mems_allowed() - check task eligiblity for kill > @@ -994,6 +1013,18 @@ int unregister_oom_notifier(struct notifier_block *nb) > } > EXPORT_SYMBOL_GPL(unregister_oom_notifier); > > +int oom_target_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key) > +{ > + struct task_struct *ts = wait->private; > + > + /* We use SIGKILL instead of the oom killer > + * so as to cleanly interrupt ep_poll() Huch? oom_killer uses SIGKILL as well, it just does it correctly. > + */ > + pr_debug("Killing pid %u from prctl(PR_SET_IDLE) death row.\n", ts->pid); > + send_sig(SIGKILL, ts, 1); > + return 0; > +} > + > /** > * out_of_memory - kill the "best" process when we run out of memory > * @oc: pointer to struct oom_control > @@ -1007,6 +1038,7 @@ bool out_of_memory(struct oom_control *oc) > { > unsigned long freed = 0; > enum oom_constraint constraint = CONSTRAINT_NONE; > + wait_queue_head_t *w; > > if (oom_killer_disabled) > return false; > @@ -1056,6 +1088,17 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* > + * Check death row for current memcg or global. > + */ > + if (!is_memcg_oom(oc)) { > + w = oom_target_get_wait(); > + if (waitqueue_active(w)) { > + wake_up(w); > + return true; > + } > + } Why on earth do you need that extra wait_queue magic? You completely fail to explain in your empty changelog why the existing oom hinting infrastructure is not sufficient. If you can explain why, then there is no reason to have this side channel. Extend/fix the current hinting mechanism and be done with it. Thanks, tglx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops 2017-11-03 6:35 ` Shawn Landden (?) @ 2017-11-22 10:29 ` peter enderborg -1 siblings, 0 replies; 58+ messages in thread From: peter enderborg @ 2017-11-22 10:29 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On 11/03/2017 07:35 AM, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > signals to the kernel that epoll_wait() and friends may not complete, > and the kernel may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). > > 16 bytes per process is kinda spendy, but I want to keep > lru behavior, which mem_score_adj does not allow. When a supervisor, > like Android's user input is keeping track this can be done in user-space. > It could be pulled out of task_struct if an cross-indexing additional > red-black tree is added to support pid-based lookup. What android version is using systemd? In android there is a OnTrimMemory that is sent from activitymanager that you can listen on and make a nice exit. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-22 10:29 ` peter enderborg 0 siblings, 0 replies; 58+ messages in thread From: peter enderborg @ 2017-11-22 10:29 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On 11/03/2017 07:35 AM, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > signals to the kernel that epoll_wait() and friends may not complete, > and the kernel may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). > > 16 bytes per process is kinda spendy, but I want to keep > lru behavior, which mem_score_adj does not allow. When a supervisor, > like Android's user input is keeping track this can be done in user-space. > It could be pulled out of task_struct if an cross-indexing additional > red-black tree is added to support pid-based lookup. What android version is using systemd? In android there is a OnTrimMemory that is sent from activitymanager that you can listen on and make a nice exit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops @ 2017-11-22 10:29 ` peter enderborg 0 siblings, 0 replies; 58+ messages in thread From: peter enderborg @ 2017-11-22 10:29 UTC (permalink / raw) To: Shawn Landden; +Cc: linux-kernel, linux-fsdevel, linux-mm, linux-api On 11/03/2017 07:35 AM, Shawn Landden wrote: > It is common for services to be stateless around their main event loop. > If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it > signals to the kernel that epoll_wait() and friends may not complete, > and the kernel may send SIGKILL if resources get tight. > > See my systemd patch: https://github.com/shawnl/systemd/tree/prctl > > Android uses this memory model for all programs, and having it in the > kernel will enable integration with the page cache (not in this > series). > > 16 bytes per process is kinda spendy, but I want to keep > lru behavior, which mem_score_adj does not allow. When a supervisor, > like Android's user input is keeping track this can be done in user-space. > It could be pulled out of task_struct if an cross-indexing additional > red-black tree is added to support pid-based lookup. What android version is using systemd? In android there is a OnTrimMemory that is sent from activitymanager that you can listen on and make a nice exit. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2017-11-22 10:29 UTC | newest] Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-01 5:32 [RFC] EPOLL_KILLME: New flag to epoll_wait() that subscribes process to death row (new syscall) Shawn Landden 2017-11-01 5:32 ` Shawn Landden 2017-11-01 5:32 ` Shawn Landden 2017-11-01 14:04 ` Matthew Wilcox 2017-11-01 14:04 ` Matthew Wilcox 2017-11-01 15:16 ` Colin Walters 2017-11-01 15:16 ` Colin Walters 2017-11-01 15:22 ` Colin Walters 2017-11-01 15:22 ` Colin Walters 2017-11-03 9:22 ` peter enderborg 2017-11-03 9:22 ` peter enderborg 2017-11-03 9:22 ` peter enderborg 2017-11-01 19:02 ` Shawn Landden 2017-11-01 19:37 ` Colin Walters 2017-11-01 19:37 ` Colin Walters 2017-11-01 19:43 ` Shawn Landden 2017-11-01 20:54 ` Shawn Landden 2017-11-02 15:24 ` Shawn Paul Landden 2017-11-02 15:24 ` Shawn Paul Landden 2017-11-01 19:05 ` Shawn Landden 2017-11-01 22:10 ` Tetsuo Handa 2017-11-01 22:10 ` Tetsuo Handa 2017-11-02 7:36 ` Shawn Landden 2017-11-02 15:45 ` Michal Hocko 2017-11-02 15:45 ` Michal Hocko 2017-11-03 6:35 ` [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops Shawn Landden 2017-11-03 6:35 ` Shawn Landden 2017-11-03 6:35 ` Shawn Landden 2017-11-03 9:09 ` Michal Hocko 2017-11-03 9:09 ` Michal Hocko 2017-11-18 4:45 ` Shawn Landden 2017-11-19 4:19 ` Matthew Wilcox 2017-11-19 4:19 ` Matthew Wilcox 2017-11-19 4:19 ` Matthew Wilcox 2017-11-20 8:35 ` Michal Hocko 2017-11-20 8:35 ` Michal Hocko 2017-11-21 4:48 ` Shawn Landden 2017-11-21 4:48 ` Shawn Landden 2017-11-21 7:05 ` Michal Hocko 2017-11-21 7:05 ` Michal Hocko 2017-11-18 20:33 ` Shawn Landden 2017-11-18 20:33 ` Shawn Landden 2017-11-15 21:11 ` Pavel Machek 2017-11-21 4:49 ` [RFC v3] It is common for services to be stateless around their main event loop. If a process sets PR_SET_IDLE to PR_IDLE_MODE_KILLME then it signals to the kernel that epoll_wait() and friends may not complete, and the kernel may send SIGKILL if resources get tight Shawn Landden 2017-11-21 4:49 ` Shawn Landden 2017-11-21 4:49 ` Shawn Landden 2017-11-21 4:56 ` Shawn Landden 2017-11-21 4:56 ` Shawn Landden 2017-11-21 5:16 ` [RFC v4] " Shawn Landden 2017-11-21 5:16 ` Shawn Landden 2017-11-21 5:16 ` Shawn Landden 2017-11-21 5:26 ` Shawn Landden 2017-11-21 5:26 ` Shawn Landden 2017-11-21 9:14 ` Thomas Gleixner 2017-11-21 9:14 ` Thomas Gleixner 2017-11-22 10:29 ` [RFC v2] prctl: prctl(PR_SET_IDLE, PR_IDLE_MODE_KILLME), for stateless idle loops peter enderborg 2017-11-22 10:29 ` peter enderborg 2017-11-22 10:29 ` peter enderborg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.