* [PATCH v2 0/4] seccomp trap to userspace @ 2018-05-17 15:12 Tycho Andersen 2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen ` (4 more replies) 0 siblings, 5 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw) To: linux-kernel, containers Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding, Tycho Andersen Hi, After a while focusing on other things, I finally managed ot get a v2 of this series prepared. I believe I've addressed all the feedback from v1, except for one major point: switching the communication protocol over the fd to nlattr. I looked into doing this, but the kernel stuff for dealing with nlattr seems to require an skb (via nlmsg_{new,put} and netlink_unicast), which means we need to deal with the netlink sequence numbers, portids, and create a socket protocol. I can do this if we still think nlattr is necessary, but based on looking at it, it seems like a lot of extra code for no real benefit. I've also added support for passing fds. The code itself is simple, but the API could/should probably be different, see patch 4 for discussion. Tycho Tycho Andersen (4): seccomp: add a return code to trap to userspace seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE seccomp: add a way to get a listener fd from ptrace seccomp: add support for passing fds via USER_NOTIF arch/Kconfig | 7 + include/linux/seccomp.h | 14 +- include/uapi/linux/ptrace.h | 2 + include/uapi/linux/seccomp.h | 20 +- kernel/ptrace.c | 4 + kernel/seccomp.c | 480 +++++++++++++++++- tools/testing/selftests/seccomp/seccomp_bpf.c | 359 ++++++++++++- 7 files changed, 878 insertions(+), 8 deletions(-) -- 2.17.0 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen @ 2018-05-17 15:12 ` Tycho Andersen 2018-05-17 15:33 ` Oleg Nesterov ` (3 more replies) 2018-05-17 15:12 ` [PATCH v2 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen ` (3 subsequent siblings) 4 siblings, 4 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw) To: linux-kernel, containers Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding, Tycho Andersen This patch introduces a means for syscalls matched in seccomp to notify some other task that a particular filter has been triggered. The motivation for this is primarily for use with containers. For example, if a container does an init_module(), we obviously don't want to load this untrusted code, which may be compiled for the wrong version of the kernel anyway. Instead, we could parse the module image, figure out which module the container is trying to load and load it on the host. As another example, containers cannot mknod(), since this checks capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or /dev/zero should be ok for containers to mknod, but we'd like to avoid hard coding some whitelist in the kernel. Another example is mount(), which has many security restrictions for good reason, but configuration or runtime knowledge could potentially be used to relax these restrictions. This patch adds functionality that is already possible via at least two other means that I know about, both of which involve ptrace(): first, one could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. Unfortunately this is slow, so a faster version would be to install a filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. Since ptrace allows only one tracer, if the container runtime is that tracer, users inside the container (or outside) trying to debug it will not be able to use ptrace, which is annoying. It also means that older distributions based on Upstart cannot boot inside containers using ptrace, since upstart itself uses ptrace to start services. The actual implementation of this is fairly small, although getting the synchronization right was/is slightly complex. Finally, it's worth noting that the classic seccomp TOCTOU of reading memory data from the task still applies here, but can be avoided with careful design of the userspace handler: if the userspace handler reads all of the task memory that is necessary before applying its security policy, the tracee's subsequent memory edits will not be read by the tracer. v2: * make id a u64; the idea here being that it will never overflow, because 64 is huge (one syscall every nanosecond => wrap every 584 years) * prevent nesting of user notifications: if someone is already attached the tree in one place, nobody else can attach to the tree * notify the listener of signals the tracee receives as well * implement poll Signed-off-by: Tycho Andersen <tycho@tycho.ws> CC: Kees Cook <keescook@chromium.org> CC: Andy Lutomirski <luto@amacapital.net> CC: Oleg Nesterov <oleg@redhat.com> CC: Eric W. Biederman <ebiederm@xmission.com> CC: "Serge E. Hallyn" <serge@hallyn.com> CC: Christian Brauner <christian.brauner@ubuntu.com> CC: Tyler Hicks <tyhicks@canonical.com> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp> --- arch/Kconfig | 7 + include/linux/seccomp.h | 3 +- include/uapi/linux/seccomp.h | 18 +- kernel/seccomp.c | 402 +++++++++++++++++- tools/testing/selftests/seccomp/seccomp_bpf.c | 181 +++++++- 5 files changed, 605 insertions(+), 6 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 8e0d665c8d53..dd99eef3e049 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -401,6 +401,13 @@ config SECCOMP_FILTER See Documentation/prctl/seccomp_filter.txt for details. +config SECCOMP_USER_NOTIFICATION + bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action" + depends on SECCOMP_FILTER + help + Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp + programs to notify a userspace listener that a particular event happened. + config HAVE_GCC_PLUGINS bool help diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index c723a5c4e3ff..0fd3e0676a1c 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -5,7 +5,8 @@ #include <uapi/linux/seccomp.h> #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ - SECCOMP_FILTER_FLAG_LOG) + SECCOMP_FILTER_FLAG_LOG | \ + SECCOMP_FILTER_FLAG_GET_LISTENER) #ifdef CONFIG_SECCOMP diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 2a0bd9dd104d..8160e6cad528 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -17,8 +17,9 @@ #define SECCOMP_GET_ACTION_AVAIL 2 /* Valid flags for SECCOMP_SET_MODE_FILTER */ -#define SECCOMP_FILTER_FLAG_TSYNC 1 -#define SECCOMP_FILTER_FLAG_LOG 2 +#define SECCOMP_FILTER_FLAG_TSYNC 1 +#define SECCOMP_FILTER_FLAG_LOG 2 +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4 /* * All BPF programs must return a 32-bit value. @@ -34,6 +35,7 @@ #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */ #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */ #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */ #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */ @@ -59,4 +61,16 @@ struct seccomp_data { __u64 args[6]; }; +struct seccomp_notif { + __u64 id; + pid_t pid; + struct seccomp_data data; +}; + +struct seccomp_notif_resp { + __u64 id; + __s32 error; + __s64 val; +}; + #endif /* _UAPI_LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index dc77548167ef..a169a62cb78a 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -38,6 +38,53 @@ #include <linux/tracehook.h> #include <linux/uaccess.h> +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +#include <linux/file.h> +#include <linux/anon_inodes.h> + +enum notify_state { + SECCOMP_NOTIFY_INIT, + SECCOMP_NOTIFY_SENT, + SECCOMP_NOTIFY_REPLIED, +}; + +struct seccomp_knotif { + /* The pid whose filter triggered the notification */ + pid_t pid; + + /* + * The "cookie" for this request; this is unique for this filter. + */ + u32 id; + + /* + * The seccomp data. This pointer is valid the entire time this + * notification is active, since it comes from __seccomp_filter which + * eclipses the entire lifecycle here. + */ + const struct seccomp_data *data; + + /* + * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a + * struct seccomp_knotif is created and starts out in INIT. Once the + * handler reads the notification off of an FD, it transitions to READ. + * If a signal is received the state transitions back to INIT and + * another message is sent. When the userspace handler replies, state + * transitions to REPLIED. + */ + enum notify_state state; + + /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */ + int error; + long val; + + /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */ + struct completion ready; + + struct list_head list; +}; +#endif + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -64,6 +111,27 @@ struct seccomp_filter { bool log; struct seccomp_filter *prev; struct bpf_prog *prog; + +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION + /* + * A semaphore that users of this notification can wait on for + * changes. Actual reads and writes are still controlled with + * filter->notify_lock. + */ + struct semaphore request; + + /* A lock for all notification-related accesses. */ + struct mutex notify_lock; + + /* Is there currently an attached listener? */ + bool has_listener; + + /* The id of the next request. */ + u64 next_id; + + /* A list of struct seccomp_knotif elements. */ + struct list_head notifications; +#endif }; /* Limit any path through the tree to 256KB worth of instructions. */ @@ -383,6 +451,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) if (!sfilter) return ERR_PTR(-ENOMEM); +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION + mutex_init(&sfilter->notify_lock); + sema_init(&sfilter->request, 0); + INIT_LIST_HEAD(&sfilter->notifications); + sfilter->next_id = get_random_u64(); +#endif + ret = bpf_prog_create_from_user(&sfilter->prog, fprog, seccomp_check_filter, save_orig); if (ret < 0) { @@ -547,13 +622,15 @@ static void seccomp_send_sigsys(int syscall, int reason) #define SECCOMP_LOG_TRACE (1 << 4) #define SECCOMP_LOG_LOG (1 << 5) #define SECCOMP_LOG_ALLOW (1 << 6) +#define SECCOMP_LOG_USER_NOTIF (1 << 7) static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS | SECCOMP_LOG_KILL_THREAD | SECCOMP_LOG_TRAP | SECCOMP_LOG_ERRNO | SECCOMP_LOG_TRACE | - SECCOMP_LOG_LOG; + SECCOMP_LOG_LOG | + SECCOMP_LOG_USER_NOTIF; static inline void seccomp_log(unsigned long syscall, long signr, u32 action, bool requested) @@ -572,6 +649,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action, case SECCOMP_RET_TRACE: log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE; break; + case SECCOMP_RET_USER_NOTIF: + log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF; + break; case SECCOMP_RET_LOG: log = seccomp_actions_logged & SECCOMP_LOG_LOG; break; @@ -645,6 +725,91 @@ void secure_computing_strict(int this_syscall) } #else +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +static u64 seccomp_next_notify_id(struct seccomp_filter *filter) +{ + u64 ret = filter->next_id; + + /* Note: overflow is ok here, the id just needs to be unique */ + filter->next_id++; + + return ret; +} + +static void seccomp_do_user_notification(int this_syscall, + struct seccomp_filter *match, + const struct seccomp_data *sd) +{ + int err; + long ret = 0; + struct seccomp_knotif n = {}; + + mutex_lock(&match->notify_lock); + if (!match->has_listener) { + err = -ENOSYS; + goto out; + } + + n.pid = current->pid; + n.state = SECCOMP_NOTIFY_INIT; + n.data = sd; + n.id = seccomp_next_notify_id(match); + init_completion(&n.ready); + + list_add(&n.list, &match->notifications); + + mutex_unlock(&match->notify_lock); + up(&match->request); + + err = wait_for_completion_interruptible(&n.ready); + mutex_lock(&match->notify_lock); + + /* + * Here it's possible we got a signal and then had to wait on the mutex + * while the reply was sent, so let's be sure there wasn't a response + * in the meantime. + */ + if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) { + /* + * We got a signal. Let's tell userspace about it (potentially + * again, if we had already notified them about the first one). + */ + if (n.state == SECCOMP_NOTIFY_SENT) { + n.state = SECCOMP_NOTIFY_INIT; + up(&match->request); + } + mutex_unlock(&match->notify_lock); + err = wait_for_completion_killable(&n.ready); + mutex_lock(&match->notify_lock); + if (err < 0) + goto remove_list; + } + + ret = n.val; + err = n.error; + + WARN(n.state != SECCOMP_NOTIFY_REPLIED, + "notified about write complete when state is not write"); + +remove_list: + list_del(&n.list); +out: + mutex_unlock(&match->notify_lock); + syscall_set_return_value(current, task_pt_regs(current), + err, ret); +} +#else +static void seccomp_do_user_notification(int this_syscall, + u32 action, + struct seccomp_filter *match, + const struct seccomp_data *sd) +{ + WARN(1, "user notification received, but disabled"); + seccomp_log(this_syscall, SIGSYS, action, true); + do_exit(SIGSYS); +} +#endif + #ifdef CONFIG_SECCOMP_FILTER static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, const bool recheck_after_trace) @@ -722,6 +887,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, return 0; + case SECCOMP_RET_USER_NOTIF: + seccomp_do_user_notification(this_syscall, match, sd); + goto skip; case SECCOMP_RET_LOG: seccomp_log(this_syscall, 0, action, true); return 0; @@ -828,6 +996,11 @@ static long seccomp_set_mode_strict(void) } #ifdef CONFIG_SECCOMP_FILTER +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +static struct file *init_listener(struct task_struct *, + struct seccomp_filter *); +#endif + /** * seccomp_set_mode_filter: internal function for setting seccomp filter * @flags: flags to change filter behavior @@ -847,6 +1020,8 @@ static long seccomp_set_mode_filter(unsigned int flags, const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; struct seccomp_filter *prepared = NULL; long ret = -EINVAL; + int listener = 0; + struct file *listener_f = NULL; /* Validate flags. */ if (flags & ~SECCOMP_FILTER_FLAG_MASK) @@ -857,13 +1032,28 @@ static long seccomp_set_mode_filter(unsigned int flags, if (IS_ERR(prepared)) return PTR_ERR(prepared); + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { + listener = get_unused_fd_flags(O_RDWR); + if (listener < 0) { + ret = listener; + goto out_free; + } + + listener_f = init_listener(current, prepared); + if (IS_ERR(listener_f)) { + put_unused_fd(listener); + ret = PTR_ERR(listener_f); + goto out_free; + } + } + /* * Make sure we cannot change seccomp or nnp state via TSYNC * while another thread is in the middle of calling exec. */ if (flags & SECCOMP_FILTER_FLAG_TSYNC && mutex_lock_killable(¤t->signal->cred_guard_mutex)) - goto out_free; + goto out_put_fd; spin_lock_irq(¤t->sighand->siglock); @@ -881,6 +1071,16 @@ static long seccomp_set_mode_filter(unsigned int flags, spin_unlock_irq(¤t->sighand->siglock); if (flags & SECCOMP_FILTER_FLAG_TSYNC) mutex_unlock(¤t->signal->cred_guard_mutex); +out_put_fd: + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { + if (ret < 0) { + fput(listener_f); + put_unused_fd(listener); + } else { + fd_install(listener, listener_f); + ret = listener; + } + } out_free: seccomp_filter_free(prepared); return ret; @@ -909,6 +1109,9 @@ static long seccomp_get_action_avail(const char __user *uaction) case SECCOMP_RET_LOG: case SECCOMP_RET_ALLOW: break; + case SECCOMP_RET_USER_NOTIF: + if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION)) + break; default: return -EOPNOTSUPP; } @@ -1105,6 +1308,7 @@ long seccomp_get_metadata(struct task_struct *task, #define SECCOMP_RET_KILL_THREAD_NAME "kill_thread" #define SECCOMP_RET_TRAP_NAME "trap" #define SECCOMP_RET_ERRNO_NAME "errno" +#define SECCOMP_RET_USER_NOTIF_NAME "user_notif" #define SECCOMP_RET_TRACE_NAME "trace" #define SECCOMP_RET_LOG_NAME "log" #define SECCOMP_RET_ALLOW_NAME "allow" @@ -1114,6 +1318,7 @@ static const char seccomp_actions_avail[] = SECCOMP_RET_KILL_THREAD_NAME " " SECCOMP_RET_TRAP_NAME " " SECCOMP_RET_ERRNO_NAME " " + SECCOMP_RET_USER_NOTIF_NAME " " SECCOMP_RET_TRACE_NAME " " SECCOMP_RET_LOG_NAME " " SECCOMP_RET_ALLOW_NAME; @@ -1131,6 +1336,7 @@ static const struct seccomp_log_name seccomp_log_names[] = { { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME }, { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME }, { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME }, + { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME }, { } }; @@ -1279,3 +1485,195 @@ static int __init seccomp_sysctl_init(void) device_initcall(seccomp_sysctl_init) #endif /* CONFIG_SYSCTL */ + +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +static int seccomp_notify_release(struct inode *inode, struct file *file) +{ + struct seccomp_filter *filter = file->private_data; + struct seccomp_knotif *knotif; + + mutex_lock(&filter->notify_lock); + + /* + * If this file is being closed because e.g. the task who owned it + * died, let's wake everyone up who was waiting on us. + */ + list_for_each_entry(knotif, &filter->notifications, list) { + if (knotif->state == SECCOMP_NOTIFY_REPLIED) + continue; + + knotif->state = SECCOMP_NOTIFY_REPLIED; + knotif->error = -ENOSYS; + knotif->val = 0; + + complete(&knotif->ready); + } + + filter->has_listener = false; + mutex_unlock(&filter->notify_lock); + __put_seccomp_filter(filter); + return 0; +} + +static ssize_t seccomp_notify_read(struct file *f, char __user *buf, + size_t size, loff_t *ppos) +{ + struct seccomp_filter *filter = f->private_data; + struct seccomp_knotif *knotif = NULL, *cur; + struct seccomp_notif unotif; + ssize_t ret; + + /* No offset reads. */ + if (*ppos != 0) + return -EINVAL; + + ret = down_interruptible(&filter->request); + if (ret < 0) + return ret; + + mutex_lock(&filter->notify_lock); + list_for_each_entry(cur, &filter->notifications, list) { + if (cur->state == SECCOMP_NOTIFY_INIT) { + knotif = cur; + break; + } + } + + /* + * If we didn't find a notification, it could be that the task was + * interrupted between the time we were woken and when we were able to + * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT + * for now. + */ + if (!knotif) { + ret = -ENOENT; + goto out; + } + + unotif.id = knotif->id; + unotif.pid = knotif->pid; + unotif.data = *(knotif->data); + + size = min_t(size_t, size, sizeof(struct seccomp_notif)); + if (copy_to_user(buf, &unotif, size)) { + ret = -EFAULT; + goto out; + } + + ret = sizeof(unotif); + knotif->state = SECCOMP_NOTIFY_SENT; + +out: + mutex_unlock(&filter->notify_lock); + return ret; +} + +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf, + size_t size, loff_t *ppos) +{ + struct seccomp_filter *filter = file->private_data; + struct seccomp_notif_resp resp = {}; + struct seccomp_knotif *knotif = NULL; + ssize_t ret = -EINVAL; + + /* No partial writes. */ + if (*ppos != 0) + return -EINVAL; + + size = min_t(size_t, size, sizeof(resp)); + if (copy_from_user(&resp, buf, size)) + return -EFAULT; + + ret = mutex_lock_interruptible(&filter->notify_lock); + if (ret < 0) + return ret; + + list_for_each_entry(knotif, &filter->notifications, list) { + if (knotif->id == resp.id) + break; + } + + if (!knotif || knotif->id != resp.id) { + ret = -EINVAL; + goto out; + } + + /* Allow exactly one reply. */ + if (knotif->state != SECCOMP_NOTIFY_SENT) { + ret = -EINVAL; + goto out; + } + + ret = size; + knotif->state = SECCOMP_NOTIFY_REPLIED; + knotif->error = resp.error; + knotif->val = resp.val; + complete(&knotif->ready); +out: + mutex_unlock(&filter->notify_lock); + return ret; +} + +static __poll_t seccomp_notify_poll(struct file *file, + struct poll_table_struct *poll_tab) +{ + struct seccomp_filter *filter = file->private_data; + __poll_t ret = 0; + struct seccomp_knotif *cur; + + ret = mutex_lock_interruptible(&filter->notify_lock); + if (ret < 0) + return ret; + + list_for_each_entry(cur, &filter->notifications, list) { + if (cur->state == SECCOMP_NOTIFY_INIT) + ret |= EPOLLIN | EPOLLRDNORM; + if (cur->state == SECCOMP_NOTIFY_SENT) + ret |= EPOLLOUT | EPOLLWRNORM; + } + + mutex_unlock(&filter->notify_lock); + + return ret; +} + +static const struct file_operations seccomp_notify_ops = { + .read = seccomp_notify_read, + .write = seccomp_notify_write, + .poll = seccomp_notify_poll, + .release = seccomp_notify_release, +}; + +static struct file *init_listener(struct task_struct *task, + struct seccomp_filter *filter) +{ + struct file *ret = ERR_PTR(-EBUSY); + struct seccomp_filter *cur; + bool have_listener = false; + + for (cur = task->seccomp.filter; cur; cur = cur->prev) { + mutex_lock(&cur->notify_lock); + if (cur->has_listener) + have_listener = true; + } + + if (have_listener) + goto out; + + ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops, + filter, O_RDWR); + if (IS_ERR(ret)) + goto out; + + + /* The file has a reference to it now */ + __get_seccomp_filter(filter); + filter->has_listener = true; + +out: + for (cur = task->seccomp.filter; cur; cur = cur->prev) + mutex_unlock(&cur->notify_lock); + + return ret; +} +#endif diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 168c66d74fc5..bb96df66222f 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -40,10 +40,12 @@ #include <sys/fcntl.h> #include <sys/mman.h> #include <sys/times.h> +#include <sys/socket.h> #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> +#include <poll.h> #include "../kselftest_harness.h" @@ -150,6 +152,24 @@ struct seccomp_metadata { }; #endif +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4 + +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U + +struct seccomp_notif { + __u64 id; + pid_t pid; + struct seccomp_data data; +}; + +struct seccomp_notif_resp { + __u64 id; + __s32 error; + __s64 val; +}; +#endif + #ifndef seccomp int seccomp(unsigned int op, unsigned int flags, void *args) { @@ -2072,7 +2092,8 @@ TEST(seccomp_syscall_mode_lock) TEST(detect_seccomp_filter_flags) { unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC, - SECCOMP_FILTER_FLAG_LOG }; + SECCOMP_FILTER_FLAG_LOG, + SECCOMP_FILTER_FLAG_GET_LISTENER }; unsigned int flag, all_flags; int i; long ret; @@ -2917,6 +2938,164 @@ TEST(get_metadata) ASSERT_EQ(0, kill(pid, SIGKILL)); } +static int user_trap_syscall(int nr, unsigned int flags) +{ + struct sock_filter filter[] = { + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, + offsetof(struct seccomp_data, nr)), + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1), + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF), + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), + }; + + struct sock_fprog prog = { + .len = (unsigned short)ARRAY_SIZE(filter), + .filter = filter, + }; + + return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog); +} + +static int read_notif(int listener, struct seccomp_notif *req) +{ + int ret; + + do { + errno = 0; + ret = read(listener, req, sizeof(*req)); + } while (ret == -1 && errno == ENOENT); + return ret; +} + +static void signal_handler(int signal) +{ +} + +#define USER_NOTIF_MAGIC 116983961184613L +TEST(get_user_notification_syscall) +{ + pid_t pid; + long ret; + int status, listener; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + struct pollfd pollfd; + + pid = fork(); + ASSERT_GE(pid, 0); + + /* Check that we get -ENOSYS with no listener attached */ + if (pid == 0) { + if (user_trap_syscall(__NR_getpid, 0) < 0) + exit(1); + ret = syscall(__NR_getpid); + exit(ret >= 0 || errno != ENOSYS); + } + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + /* Check that the basic notification machinery works */ + listener = user_trap_syscall(__NR_getpid, + SECCOMP_FILTER_FLAG_GET_LISTENER); + EXPECT_GE(listener, 0); + + /* Installing a second listener in the chain should EBUSY */ + EXPECT_EQ(user_trap_syscall(__NR_getpid, + SECCOMP_FILTER_FLAG_GET_LISTENER), + -1); + EXPECT_EQ(errno, EBUSY); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + ret = syscall(__NR_getpid); + exit(ret != USER_NOTIF_MAGIC); + } + + EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req)); + + pollfd.fd = listener; + pollfd.events = POLLIN | POLLOUT; + + EXPECT_GT(poll(&pollfd, 1, -1), 0); + EXPECT_EQ(pollfd.revents, POLLOUT); + + EXPECT_EQ(req.data.nr, __NR_getpid); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + + EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp)); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + /* + * Check that nothing bad happens when we kill the task in the middle + * of a syscall. + */ + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + ret = syscall(__NR_getpid); + exit(ret != USER_NOTIF_MAGIC); + } + + ret = read(listener, &req, sizeof(req)); + EXPECT_EQ(ret, sizeof(req)); + + EXPECT_EQ(kill(pid, SIGKILL), 0); + EXPECT_EQ(waitpid(pid, NULL, 0), pid); + + resp.id = req.id; + ret = write(listener, &resp, sizeof(resp)); + EXPECT_EQ(ret, -1); + EXPECT_EQ(errno, EINVAL); + + /* + * Check that we get another notification about a signal in the middle + * of a syscall. + */ + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + if (signal(SIGUSR1, signal_handler) == SIG_ERR) { + perror("signal"); + exit(1); + } + ret = syscall(__NR_getpid); + exit(ret != USER_NOTIF_MAGIC); + } + + ret = read_notif(listener, &req); + EXPECT_EQ(ret, sizeof(req)); + EXPECT_EQ(errno, 0); + + EXPECT_EQ(kill(pid, SIGUSR1), 0); + + ret = read_notif(listener, &req); + EXPECT_EQ(ret, sizeof(req)); + EXPECT_EQ(errno, 0); + + resp.id = req.id; + ret = write(listener, &resp, sizeof(resp)); + EXPECT_EQ(ret, sizeof(resp)); + EXPECT_EQ(errno, 0); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + close(listener); +} + /* * TODO: * - add microbenchmarks -- 2.17.0 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen @ 2018-05-17 15:33 ` Oleg Nesterov 2018-05-17 15:39 ` Tycho Andersen 2018-05-18 14:04 ` Christian Brauner ` (2 subsequent siblings) 3 siblings, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2018-05-17 15:33 UTC (permalink / raw) To: Tycho Andersen Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding I didn't read this series yet, and I don't even understand what are you trying to do, just one question... On 05/17, Tycho Andersen wrote: > > +static struct file *init_listener(struct task_struct *task, > + struct seccomp_filter *filter) > +{ > + struct file *ret = ERR_PTR(-EBUSY); > + struct seccomp_filter *cur; > + bool have_listener = false; > + > + for (cur = task->seccomp.filter; cur; cur = cur->prev) { > + mutex_lock(&cur->notify_lock); Did you test this patch with CONFIG_LOCKDEP ? >From lockdep pov this loop tries to take the same lock twice or more, it shoul complain. Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-17 15:33 ` Oleg Nesterov @ 2018-05-17 15:39 ` Tycho Andersen 2018-05-17 15:46 ` Oleg Nesterov 0 siblings, 1 reply; 20+ messages in thread From: Tycho Andersen @ 2018-05-17 15:39 UTC (permalink / raw) To: Oleg Nesterov Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding Hi Oleg, Thanks for taking a look! On Thu, May 17, 2018 at 05:33:24PM +0200, Oleg Nesterov wrote: > I didn't read this series yet, and I don't even understand what are you > trying to do, just one question... > > On 05/17, Tycho Andersen wrote: > > > > +static struct file *init_listener(struct task_struct *task, > > + struct seccomp_filter *filter) > > +{ > > + struct file *ret = ERR_PTR(-EBUSY); > > + struct seccomp_filter *cur; > > + bool have_listener = false; > > + > > + for (cur = task->seccomp.filter; cur; cur = cur->prev) { > > + mutex_lock(&cur->notify_lock); > > Did you test this patch with CONFIG_LOCKDEP ? Yes, with, CONFIG_LOCKDEP=y CONFIG_DEBUG_LOCKDEP=y CONFIG_DEBUG_ATOMIC_SLEEP=y > From lockdep pov this loop tries to take the same lock twice or more, it shoul > complain. I didn't, but I guess that's because it's not trying to take the same lock twice -- the pointer cur is changing in the loop. Unless I'm misunderstanding what you're saying. The idea behind this code is to lock the entire chain of filters up to the parent so that we can ensure none of them have a listener installed. This is based on a suggestion from Andy last review cycle to not allow two listeners, since nesting would be confusing. Tycho ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-17 15:39 ` Tycho Andersen @ 2018-05-17 15:46 ` Oleg Nesterov 2018-05-24 15:28 ` Tycho Andersen 0 siblings, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2018-05-17 15:46 UTC (permalink / raw) To: Tycho Andersen Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding On 05/17, Tycho Andersen wrote: > > > From lockdep pov this loop tries to take the same lock twice or more, it shoul > > complain. > > I didn't, but I guess that's because it's not trying to take the same lock > twice -- the pointer cur is changing in the loop. Yes, I see. But this is the same lock for lockdep, it has the same class. Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-17 15:46 ` Oleg Nesterov @ 2018-05-24 15:28 ` Tycho Andersen 0 siblings, 0 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-24 15:28 UTC (permalink / raw) To: Oleg Nesterov Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding Hi Oleg, On Thu, May 17, 2018 at 05:46:37PM +0200, Oleg Nesterov wrote: > On 05/17, Tycho Andersen wrote: > > > > > From lockdep pov this loop tries to take the same lock twice or more, it shoul > > > complain. > > > > I didn't, but I guess that's because it's not trying to take the same lock > > twice -- the pointer cur is changing in the loop. > > Yes, I see. But this is the same lock for lockdep, it has the same class. I finally figured this out, I needed CONFIG_PROVE_LOCKING=y too, anyway, I've added the nesting annotations for v3. Thanks! Tycho ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen 2018-05-17 15:33 ` Oleg Nesterov @ 2018-05-18 14:04 ` Christian Brauner 2018-05-18 15:21 ` Tycho Andersen 2018-05-19 0:14 ` kbuild test robot 2018-05-19 5:01 ` kbuild test robot 3 siblings, 1 reply; 20+ messages in thread From: Christian Brauner @ 2018-05-18 14:04 UTC (permalink / raw) To: Tycho Andersen Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook, Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman, Christian Brauner, Tyler Hicks On Thu, May 17, 2018 at 09:12:15AM -0600, Tycho Andersen wrote: > This patch introduces a means for syscalls matched in seccomp to notify > some other task that a particular filter has been triggered. > > The motivation for this is primarily for use with containers. For example, > if a container does an init_module(), we obviously don't want to load this > untrusted code, which may be compiled for the wrong version of the kernel > anyway. Instead, we could parse the module image, figure out which module > the container is trying to load and load it on the host. > > As another example, containers cannot mknod(), since this checks > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > coding some whitelist in the kernel. Another example is mount(), which has > many security restrictions for good reason, but configuration or runtime > knowledge could potentially be used to relax these restrictions. > > This patch adds functionality that is already possible via at least two > other means that I know about, both of which involve ptrace(): first, one > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. > Unfortunately this is slow, so a faster version would be to install a > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. > Since ptrace allows only one tracer, if the container runtime is that > tracer, users inside the container (or outside) trying to debug it will not > be able to use ptrace, which is annoying. It also means that older > distributions based on Upstart cannot boot inside containers using ptrace, > since upstart itself uses ptrace to start services. > > The actual implementation of this is fairly small, although getting the > synchronization right was/is slightly complex. > > Finally, it's worth noting that the classic seccomp TOCTOU of reading > memory data from the task still applies here, but can be avoided with > careful design of the userspace handler: if the userspace handler reads all > of the task memory that is necessary before applying its security policy, > the tracee's subsequent memory edits will not be read by the tracer. > > v2: * make id a u64; the idea here being that it will never overflow, > because 64 is huge (one syscall every nanosecond => wrap every 584 > years) > * prevent nesting of user notifications: if someone is already attached > the tree in one place, nobody else can attach to the tree > * notify the listener of signals the tracee receives as well > * implement poll > > Signed-off-by: Tycho Andersen <tycho@tycho.ws> > CC: Kees Cook <keescook@chromium.org> > CC: Andy Lutomirski <luto@amacapital.net> > CC: Oleg Nesterov <oleg@redhat.com> > CC: Eric W. Biederman <ebiederm@xmission.com> > CC: "Serge E. Hallyn" <serge@hallyn.com> > CC: Christian Brauner <christian.brauner@ubuntu.com> > CC: Tyler Hicks <tyhicks@canonical.com> > CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp> > --- > arch/Kconfig | 7 + > include/linux/seccomp.h | 3 +- > include/uapi/linux/seccomp.h | 18 +- > kernel/seccomp.c | 402 +++++++++++++++++- > tools/testing/selftests/seccomp/seccomp_bpf.c | 181 +++++++- > 5 files changed, 605 insertions(+), 6 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 8e0d665c8d53..dd99eef3e049 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -401,6 +401,13 @@ config SECCOMP_FILTER > > See Documentation/prctl/seccomp_filter.txt for details. > > +config SECCOMP_USER_NOTIFICATION > + bool "Enable the SECCOMP_RET_USER_NOTIF seccomp action" > + depends on SECCOMP_FILTER > + help > + Enable SECCOMP_RET_USER_NOTIF, a return code which can be used by seccomp > + programs to notify a userspace listener that a particular event happened. > + > config HAVE_GCC_PLUGINS > bool > help > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index c723a5c4e3ff..0fd3e0676a1c 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -5,7 +5,8 @@ > #include <uapi/linux/seccomp.h> > > #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ > - SECCOMP_FILTER_FLAG_LOG) > + SECCOMP_FILTER_FLAG_LOG | \ > + SECCOMP_FILTER_FLAG_GET_LISTENER) > > #ifdef CONFIG_SECCOMP > > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h > index 2a0bd9dd104d..8160e6cad528 100644 > --- a/include/uapi/linux/seccomp.h > +++ b/include/uapi/linux/seccomp.h > @@ -17,8 +17,9 @@ > #define SECCOMP_GET_ACTION_AVAIL 2 > > /* Valid flags for SECCOMP_SET_MODE_FILTER */ > -#define SECCOMP_FILTER_FLAG_TSYNC 1 > -#define SECCOMP_FILTER_FLAG_LOG 2 > +#define SECCOMP_FILTER_FLAG_TSYNC 1 > +#define SECCOMP_FILTER_FLAG_LOG 2 > +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4 > > /* > * All BPF programs must return a 32-bit value. > @@ -34,6 +35,7 @@ > #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD > #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ > #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ > +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */ > #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */ > #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */ > #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */ > @@ -59,4 +61,16 @@ struct seccomp_data { > __u64 args[6]; > }; > > +struct seccomp_notif { > + __u64 id; > + pid_t pid; > + struct seccomp_data data; > +}; > + > +struct seccomp_notif_resp { > + __u64 id; > + __s32 error; > + __s64 val; > +}; > + > #endif /* _UAPI_LINUX_SECCOMP_H */ > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index dc77548167ef..a169a62cb78a 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -38,6 +38,53 @@ > #include <linux/tracehook.h> > #include <linux/uaccess.h> > > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > +#include <linux/file.h> > +#include <linux/anon_inodes.h> > + > +enum notify_state { > + SECCOMP_NOTIFY_INIT, > + SECCOMP_NOTIFY_SENT, > + SECCOMP_NOTIFY_REPLIED, > +}; > + > +struct seccomp_knotif { > + /* The pid whose filter triggered the notification */ > + pid_t pid; > + > + /* > + * The "cookie" for this request; this is unique for this filter. > + */ > + u32 id; > + > + /* > + * The seccomp data. This pointer is valid the entire time this > + * notification is active, since it comes from __seccomp_filter which > + * eclipses the entire lifecycle here. > + */ > + const struct seccomp_data *data; > + > + /* > + * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a > + * struct seccomp_knotif is created and starts out in INIT. Once the > + * handler reads the notification off of an FD, it transitions to READ. > + * If a signal is received the state transitions back to INIT and > + * another message is sent. When the userspace handler replies, state > + * transitions to REPLIED. > + */ > + enum notify_state state; > + > + /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */ > + int error; > + long val; > + > + /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */ > + struct completion ready; > + > + struct list_head list; > +}; > +#endif > + > /** > * struct seccomp_filter - container for seccomp BPF programs > * > @@ -64,6 +111,27 @@ struct seccomp_filter { > bool log; > struct seccomp_filter *prev; > struct bpf_prog *prog; > + > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > + /* > + * A semaphore that users of this notification can wait on for > + * changes. Actual reads and writes are still controlled with > + * filter->notify_lock. > + */ > + struct semaphore request; > + > + /* A lock for all notification-related accesses. */ > + struct mutex notify_lock; > + > + /* Is there currently an attached listener? */ > + bool has_listener; > + > + /* The id of the next request. */ > + u64 next_id; > + > + /* A list of struct seccomp_knotif elements. */ > + struct list_head notifications; > +#endif > }; > > /* Limit any path through the tree to 256KB worth of instructions. */ > @@ -383,6 +451,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) > if (!sfilter) > return ERR_PTR(-ENOMEM); > > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > + mutex_init(&sfilter->notify_lock); > + sema_init(&sfilter->request, 0); > + INIT_LIST_HEAD(&sfilter->notifications); > + sfilter->next_id = get_random_u64(); > +#endif > + > ret = bpf_prog_create_from_user(&sfilter->prog, fprog, > seccomp_check_filter, save_orig); > if (ret < 0) { > @@ -547,13 +622,15 @@ static void seccomp_send_sigsys(int syscall, int reason) > #define SECCOMP_LOG_TRACE (1 << 4) > #define SECCOMP_LOG_LOG (1 << 5) > #define SECCOMP_LOG_ALLOW (1 << 6) > +#define SECCOMP_LOG_USER_NOTIF (1 << 7) > > static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS | > SECCOMP_LOG_KILL_THREAD | > SECCOMP_LOG_TRAP | > SECCOMP_LOG_ERRNO | > SECCOMP_LOG_TRACE | > - SECCOMP_LOG_LOG; > + SECCOMP_LOG_LOG | > + SECCOMP_LOG_USER_NOTIF; > > static inline void seccomp_log(unsigned long syscall, long signr, u32 action, > bool requested) > @@ -572,6 +649,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action, > case SECCOMP_RET_TRACE: > log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE; > break; > + case SECCOMP_RET_USER_NOTIF: > + log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF; > + break; > case SECCOMP_RET_LOG: > log = seccomp_actions_logged & SECCOMP_LOG_LOG; > break; > @@ -645,6 +725,91 @@ void secure_computing_strict(int this_syscall) > } > #else > > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > +static u64 seccomp_next_notify_id(struct seccomp_filter *filter) > +{ > + u64 ret = filter->next_id; > + > + /* Note: overflow is ok here, the id just needs to be unique */ > + filter->next_id++; > + > + return ret; > +} Nit: Depending on how averse people are to relying on side-effects this could be simplified to: static inline u64 seccomp_next_notify_id(struct seccomp_filter *filter) { /* Note: Overflow is ok. The id just needs to be unique. */ return filter->next_id++; } > + > +static void seccomp_do_user_notification(int this_syscall, > + struct seccomp_filter *match, > + const struct seccomp_data *sd) > +{ > + int err; > + long ret = 0; > + struct seccomp_knotif n = {}; > + > + mutex_lock(&match->notify_lock); > + if (!match->has_listener) { > + err = -ENOSYS; > + goto out; > + } Nit: err = -ENOSYS; mutex_lock(&match->notify_lock); if (!match->has_listener) goto out; looks cleaner to me or you do the err initalization at the top of the function. :) > + > + n.pid = current->pid; > + n.state = SECCOMP_NOTIFY_INIT; > + n.data = sd; > + n.id = seccomp_next_notify_id(match); > + init_completion(&n.ready); > + > + list_add(&n.list, &match->notifications); > + > + mutex_unlock(&match->notify_lock); > + up(&match->request); > + > + err = wait_for_completion_interruptible(&n.ready); > + mutex_lock(&match->notify_lock); > + > + /* > + * Here it's possible we got a signal and then had to wait on the mutex > + * while the reply was sent, so let's be sure there wasn't a response > + * in the meantime. > + */ > + if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) { > + /* > + * We got a signal. Let's tell userspace about it (potentially > + * again, if we had already notified them about the first one). > + */ > + if (n.state == SECCOMP_NOTIFY_SENT) { > + n.state = SECCOMP_NOTIFY_INIT; > + up(&match->request); > + } > + mutex_unlock(&match->notify_lock); > + err = wait_for_completion_killable(&n.ready); > + mutex_lock(&match->notify_lock); > + if (err < 0) > + goto remove_list; > + } > + > + ret = n.val; > + err = n.error; > + > + WARN(n.state != SECCOMP_NOTIFY_REPLIED, > + "notified about write complete when state is not write"); Nit: That message seems a little cryptic. > + > +remove_list: > + list_del(&n.list); > +out: > + mutex_unlock(&match->notify_lock); > + syscall_set_return_value(current, task_pt_regs(current), > + err, ret); > +} > +#else > +static void seccomp_do_user_notification(int this_syscall, > + u32 action, > + struct seccomp_filter *match, > + const struct seccomp_data *sd) > +{ > + WARN(1, "user notification received, but disabled"); Nit: "received unexpected user notification" might be clearer > + seccomp_log(this_syscall, SIGSYS, action, true); > + do_exit(SIGSYS); > +} > +#endif > + > #ifdef CONFIG_SECCOMP_FILTER > static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, > const bool recheck_after_trace) > @@ -722,6 +887,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, > > return 0; > > + case SECCOMP_RET_USER_NOTIF: > + seccomp_do_user_notification(this_syscall, match, sd); > + goto skip; > case SECCOMP_RET_LOG: > seccomp_log(this_syscall, 0, action, true); > return 0; > @@ -828,6 +996,11 @@ static long seccomp_set_mode_strict(void) > } > > #ifdef CONFIG_SECCOMP_FILTER > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > +static struct file *init_listener(struct task_struct *, > + struct seccomp_filter *); > +#endif > + > /** > * seccomp_set_mode_filter: internal function for setting seccomp filter > * @flags: flags to change filter behavior > @@ -847,6 +1020,8 @@ static long seccomp_set_mode_filter(unsigned int flags, > const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; > struct seccomp_filter *prepared = NULL; > long ret = -EINVAL; > + int listener = 0; > + struct file *listener_f = NULL; > > /* Validate flags. */ > if (flags & ~SECCOMP_FILTER_FLAG_MASK) > @@ -857,13 +1032,28 @@ static long seccomp_set_mode_filter(unsigned int flags, > if (IS_ERR(prepared)) > return PTR_ERR(prepared); > > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { > + listener = get_unused_fd_flags(O_RDWR); > + if (listener < 0) { > + ret = listener; > + goto out_free; > + } > + > + listener_f = init_listener(current, prepared); > + if (IS_ERR(listener_f)) { > + put_unused_fd(listener); > + ret = PTR_ERR(listener_f); > + goto out_free; > + } > + } > + > /* > * Make sure we cannot change seccomp or nnp state via TSYNC > * while another thread is in the middle of calling exec. > */ > if (flags & SECCOMP_FILTER_FLAG_TSYNC && > mutex_lock_killable(¤t->signal->cred_guard_mutex)) > - goto out_free; > + goto out_put_fd; > > spin_lock_irq(¤t->sighand->siglock); > > @@ -881,6 +1071,16 @@ static long seccomp_set_mode_filter(unsigned int flags, > spin_unlock_irq(¤t->sighand->siglock); > if (flags & SECCOMP_FILTER_FLAG_TSYNC) > mutex_unlock(¤t->signal->cred_guard_mutex); > +out_put_fd: > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { > + if (ret < 0) { > + fput(listener_f); > + put_unused_fd(listener); > + } else { > + fd_install(listener, listener_f); > + ret = listener; > + } > + } > out_free: > seccomp_filter_free(prepared); > return ret; > @@ -909,6 +1109,9 @@ static long seccomp_get_action_avail(const char __user *uaction) > case SECCOMP_RET_LOG: > case SECCOMP_RET_ALLOW: > break; > + case SECCOMP_RET_USER_NOTIF: > + if (IS_ENABLED(CONFIG_SECCOMP_USER_NOTIFICATION)) > + break; > default: > return -EOPNOTSUPP; > } > @@ -1105,6 +1308,7 @@ long seccomp_get_metadata(struct task_struct *task, > #define SECCOMP_RET_KILL_THREAD_NAME "kill_thread" > #define SECCOMP_RET_TRAP_NAME "trap" > #define SECCOMP_RET_ERRNO_NAME "errno" > +#define SECCOMP_RET_USER_NOTIF_NAME "user_notif" > #define SECCOMP_RET_TRACE_NAME "trace" > #define SECCOMP_RET_LOG_NAME "log" > #define SECCOMP_RET_ALLOW_NAME "allow" > @@ -1114,6 +1318,7 @@ static const char seccomp_actions_avail[] = > SECCOMP_RET_KILL_THREAD_NAME " " > SECCOMP_RET_TRAP_NAME " " > SECCOMP_RET_ERRNO_NAME " " > + SECCOMP_RET_USER_NOTIF_NAME " " > SECCOMP_RET_TRACE_NAME " " > SECCOMP_RET_LOG_NAME " " > SECCOMP_RET_ALLOW_NAME; > @@ -1131,6 +1336,7 @@ static const struct seccomp_log_name seccomp_log_names[] = { > { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME }, > { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME }, > { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME }, > + { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME }, > { } > }; > > @@ -1279,3 +1485,195 @@ static int __init seccomp_sysctl_init(void) > device_initcall(seccomp_sysctl_init) > > #endif /* CONFIG_SYSCTL */ > + > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > +static int seccomp_notify_release(struct inode *inode, struct file *file) > +{ > + struct seccomp_filter *filter = file->private_data; > + struct seccomp_knotif *knotif; > + > + mutex_lock(&filter->notify_lock); > + > + /* > + * If this file is being closed because e.g. the task who owned it > + * died, let's wake everyone up who was waiting on us. > + */ > + list_for_each_entry(knotif, &filter->notifications, list) { > + if (knotif->state == SECCOMP_NOTIFY_REPLIED) > + continue; > + > + knotif->state = SECCOMP_NOTIFY_REPLIED; > + knotif->error = -ENOSYS; > + knotif->val = 0; > + > + complete(&knotif->ready); > + } > + > + filter->has_listener = false; > + mutex_unlock(&filter->notify_lock); > + __put_seccomp_filter(filter); > + return 0; > +} > + > +static ssize_t seccomp_notify_read(struct file *f, char __user *buf, > + size_t size, loff_t *ppos) > +{ > + struct seccomp_filter *filter = f->private_data; > + struct seccomp_knotif *knotif = NULL, *cur; > + struct seccomp_notif unotif; > + ssize_t ret; > + > + /* No offset reads. */ > + if (*ppos != 0) > + return -EINVAL; > + > + ret = down_interruptible(&filter->request); > + if (ret < 0) > + return ret; > + > + mutex_lock(&filter->notify_lock); > + list_for_each_entry(cur, &filter->notifications, list) { > + if (cur->state == SECCOMP_NOTIFY_INIT) { > + knotif = cur; > + break; > + } > + } > + > + /* > + * If we didn't find a notification, it could be that the task was > + * interrupted between the time we were woken and when we were able to > + * acquire the rw lock. Should we retry here or just -ENOENT? -ENOENT > + * for now. > + */ > + if (!knotif) { > + ret = -ENOENT; > + goto out; > + } > + > + unotif.id = knotif->id; > + unotif.pid = knotif->pid; > + unotif.data = *(knotif->data); > + > + size = min_t(size_t, size, sizeof(struct seccomp_notif)); > + if (copy_to_user(buf, &unotif, size)) { > + ret = -EFAULT; > + goto out; > + } > + > + ret = sizeof(unotif); > + knotif->state = SECCOMP_NOTIFY_SENT; > + > +out: > + mutex_unlock(&filter->notify_lock); > + return ret; > +} > + > +static ssize_t seccomp_notify_write(struct file *file, const char __user *buf, > + size_t size, loff_t *ppos) > +{ > + struct seccomp_filter *filter = file->private_data; > + struct seccomp_notif_resp resp = {}; > + struct seccomp_knotif *knotif = NULL; > + ssize_t ret = -EINVAL; > + > + /* No partial writes. */ > + if (*ppos != 0) > + return -EINVAL; > + > + size = min_t(size_t, size, sizeof(resp)); > + if (copy_from_user(&resp, buf, size)) > + return -EFAULT; > + > + ret = mutex_lock_interruptible(&filter->notify_lock); > + if (ret < 0) > + return ret; > + > + list_for_each_entry(knotif, &filter->notifications, list) { > + if (knotif->id == resp.id) > + break; > + } > + > + if (!knotif || knotif->id != resp.id) { > + ret = -EINVAL; > + goto out; > + } > + > + /* Allow exactly one reply. */ > + if (knotif->state != SECCOMP_NOTIFY_SENT) { > + ret = -EINVAL; > + goto out; > + } > + > + ret = size; > + knotif->state = SECCOMP_NOTIFY_REPLIED; > + knotif->error = resp.error; > + knotif->val = resp.val; > + complete(&knotif->ready); > +out: > + mutex_unlock(&filter->notify_lock); > + return ret; > +} > + > +static __poll_t seccomp_notify_poll(struct file *file, > + struct poll_table_struct *poll_tab) > +{ > + struct seccomp_filter *filter = file->private_data; > + __poll_t ret = 0; > + struct seccomp_knotif *cur; > + > + ret = mutex_lock_interruptible(&filter->notify_lock); > + if (ret < 0) > + return ret; > + > + list_for_each_entry(cur, &filter->notifications, list) { > + if (cur->state == SECCOMP_NOTIFY_INIT) > + ret |= EPOLLIN | EPOLLRDNORM; > + if (cur->state == SECCOMP_NOTIFY_SENT) > + ret |= EPOLLOUT | EPOLLWRNORM; > + } > + > + mutex_unlock(&filter->notify_lock); > + > + return ret; > +} > + > +static const struct file_operations seccomp_notify_ops = { > + .read = seccomp_notify_read, > + .write = seccomp_notify_write, > + .poll = seccomp_notify_poll, > + .release = seccomp_notify_release, > +}; > + > +static struct file *init_listener(struct task_struct *task, > + struct seccomp_filter *filter) > +{ > + struct file *ret = ERR_PTR(-EBUSY); > + struct seccomp_filter *cur; > + bool have_listener = false; > + > + for (cur = task->seccomp.filter; cur; cur = cur->prev) { > + mutex_lock(&cur->notify_lock); > + if (cur->has_listener) > + have_listener = true; > + } > + > + if (have_listener) > + goto out; > + > + ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops, > + filter, O_RDWR); > + if (IS_ERR(ret)) > + goto out; > + > + > + /* The file has a reference to it now */ > + __get_seccomp_filter(filter); > + filter->has_listener = true; > + > +out: > + for (cur = task->seccomp.filter; cur; cur = cur->prev) > + mutex_unlock(&cur->notify_lock); > + > + return ret; > +} > +#endif > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c > index 168c66d74fc5..bb96df66222f 100644 > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c > @@ -40,10 +40,12 @@ > #include <sys/fcntl.h> > #include <sys/mman.h> > #include <sys/times.h> > +#include <sys/socket.h> > > #define _GNU_SOURCE > #include <unistd.h> > #include <sys/syscall.h> > +#include <poll.h> > > #include "../kselftest_harness.h" > > @@ -150,6 +152,24 @@ struct seccomp_metadata { > }; > #endif > > +#ifndef SECCOMP_FILTER_FLAG_GET_LISTENER > +#define SECCOMP_FILTER_FLAG_GET_LISTENER 4 > + > +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U > + > +struct seccomp_notif { > + __u64 id; > + pid_t pid; > + struct seccomp_data data; > +}; > + > +struct seccomp_notif_resp { > + __u64 id; > + __s32 error; > + __s64 val; > +}; > +#endif > + > #ifndef seccomp > int seccomp(unsigned int op, unsigned int flags, void *args) > { > @@ -2072,7 +2092,8 @@ TEST(seccomp_syscall_mode_lock) > TEST(detect_seccomp_filter_flags) > { > unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC, > - SECCOMP_FILTER_FLAG_LOG }; > + SECCOMP_FILTER_FLAG_LOG, > + SECCOMP_FILTER_FLAG_GET_LISTENER }; > unsigned int flag, all_flags; > int i; > long ret; > @@ -2917,6 +2938,164 @@ TEST(get_metadata) > ASSERT_EQ(0, kill(pid, SIGKILL)); > } > > +static int user_trap_syscall(int nr, unsigned int flags) > +{ > + struct sock_filter filter[] = { > + BPF_STMT(BPF_LD+BPF_W+BPF_ABS, > + offsetof(struct seccomp_data, nr)), > + BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1), > + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF), > + BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), > + }; > + > + struct sock_fprog prog = { > + .len = (unsigned short)ARRAY_SIZE(filter), > + .filter = filter, > + }; > + > + return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog); > +} > + > +static int read_notif(int listener, struct seccomp_notif *req) > +{ > + int ret; > + > + do { > + errno = 0; > + ret = read(listener, req, sizeof(*req)); > + } while (ret == -1 && errno == ENOENT); > + return ret; > +} > + > +static void signal_handler(int signal) > +{ > +} > + > +#define USER_NOTIF_MAGIC 116983961184613L > +TEST(get_user_notification_syscall) > +{ > + pid_t pid; > + long ret; > + int status, listener; > + struct seccomp_notif req = {}; > + struct seccomp_notif_resp resp = {}; > + struct pollfd pollfd; > + > + pid = fork(); > + ASSERT_GE(pid, 0); > + > + /* Check that we get -ENOSYS with no listener attached */ > + if (pid == 0) { > + if (user_trap_syscall(__NR_getpid, 0) < 0) > + exit(1); > + ret = syscall(__NR_getpid); > + exit(ret >= 0 || errno != ENOSYS); > + } > + > + EXPECT_EQ(waitpid(pid, &status, 0), pid); > + EXPECT_EQ(true, WIFEXITED(status)); > + EXPECT_EQ(0, WEXITSTATUS(status)); > + > + /* Check that the basic notification machinery works */ > + listener = user_trap_syscall(__NR_getpid, > + SECCOMP_FILTER_FLAG_GET_LISTENER); > + EXPECT_GE(listener, 0); > + > + /* Installing a second listener in the chain should EBUSY */ > + EXPECT_EQ(user_trap_syscall(__NR_getpid, > + SECCOMP_FILTER_FLAG_GET_LISTENER), > + -1); > + EXPECT_EQ(errno, EBUSY); > + > + pid = fork(); > + ASSERT_GE(pid, 0); > + > + if (pid == 0) { > + ret = syscall(__NR_getpid); > + exit(ret != USER_NOTIF_MAGIC); > + } > + > + EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req)); > + > + pollfd.fd = listener; > + pollfd.events = POLLIN | POLLOUT; > + > + EXPECT_GT(poll(&pollfd, 1, -1), 0); > + EXPECT_EQ(pollfd.revents, POLLOUT); > + > + EXPECT_EQ(req.data.nr, __NR_getpid); > + > + resp.id = req.id; > + resp.error = 0; > + resp.val = USER_NOTIF_MAGIC; > + > + EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp)); > + > + EXPECT_EQ(waitpid(pid, &status, 0), pid); > + EXPECT_EQ(true, WIFEXITED(status)); > + EXPECT_EQ(0, WEXITSTATUS(status)); > + > + /* > + * Check that nothing bad happens when we kill the task in the middle > + * of a syscall. > + */ > + pid = fork(); > + ASSERT_GE(pid, 0); > + > + if (pid == 0) { > + ret = syscall(__NR_getpid); > + exit(ret != USER_NOTIF_MAGIC); > + } > + > + ret = read(listener, &req, sizeof(req)); > + EXPECT_EQ(ret, sizeof(req)); > + > + EXPECT_EQ(kill(pid, SIGKILL), 0); > + EXPECT_EQ(waitpid(pid, NULL, 0), pid); > + > + resp.id = req.id; > + ret = write(listener, &resp, sizeof(resp)); > + EXPECT_EQ(ret, -1); > + EXPECT_EQ(errno, EINVAL); > + > + /* > + * Check that we get another notification about a signal in the middle > + * of a syscall. > + */ > + pid = fork(); > + ASSERT_GE(pid, 0); > + > + if (pid == 0) { > + if (signal(SIGUSR1, signal_handler) == SIG_ERR) { > + perror("signal"); > + exit(1); > + } > + ret = syscall(__NR_getpid); > + exit(ret != USER_NOTIF_MAGIC); > + } > + > + ret = read_notif(listener, &req); > + EXPECT_EQ(ret, sizeof(req)); > + EXPECT_EQ(errno, 0); > + > + EXPECT_EQ(kill(pid, SIGUSR1), 0); > + > + ret = read_notif(listener, &req); > + EXPECT_EQ(ret, sizeof(req)); > + EXPECT_EQ(errno, 0); > + > + resp.id = req.id; > + ret = write(listener, &resp, sizeof(resp)); > + EXPECT_EQ(ret, sizeof(resp)); > + EXPECT_EQ(errno, 0); > + > + EXPECT_EQ(waitpid(pid, &status, 0), pid); > + EXPECT_EQ(true, WIFEXITED(status)); > + EXPECT_EQ(0, WEXITSTATUS(status)); > + > + close(listener); > +} > + > /* > * TODO: > * - add microbenchmarks > -- > 2.17.0 > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-18 14:04 ` Christian Brauner @ 2018-05-18 15:21 ` Tycho Andersen 0 siblings, 0 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-18 15:21 UTC (permalink / raw) To: Christian Brauner Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook, Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman, Christian Brauner, Tyler Hicks On Fri, May 18, 2018 at 04:04:16PM +0200, Christian Brauner wrote: > On Thu, May 17, 2018 at 09:12:15AM -0600, Tycho Andersen wrote: > > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > > +static u64 seccomp_next_notify_id(struct seccomp_filter *filter) > > +{ > > + u64 ret = filter->next_id; > > + > > + /* Note: overflow is ok here, the id just needs to be unique */ > > + filter->next_id++; > > + > > + return ret; > > +} > > Nit: Depending on how averse people are to relying on side-effects this > could be simplified to: > > static inline u64 seccomp_next_notify_id(struct seccomp_filter *filter) > { > /* Note: Overflow is ok. The id just needs to be unique. */ > return filter->next_id++; > } Oh, yes, definitely. I think this is leftover from when this function worked a different way. > > + > > +static void seccomp_do_user_notification(int this_syscall, > > + struct seccomp_filter *match, > > + const struct seccomp_data *sd) > > +{ > > + int err; > > + long ret = 0; > > + struct seccomp_knotif n = {}; > > + > > + mutex_lock(&match->notify_lock); > > + if (!match->has_listener) { > > + err = -ENOSYS; > > + goto out; > > + } > > Nit: > > err = -ENOSYS; > mutex_lock(&match->notify_lock); > if (!match->has_listener) > goto out; > > looks cleaner to me or you do the err initalization at the top of the > function. :) Ok :) > > + > > + n.pid = current->pid; > > + n.state = SECCOMP_NOTIFY_INIT; > > + n.data = sd; > > + n.id = seccomp_next_notify_id(match); > > + init_completion(&n.ready); > > + > > + list_add(&n.list, &match->notifications); > > + > > + mutex_unlock(&match->notify_lock); > > + up(&match->request); > > + > > + err = wait_for_completion_interruptible(&n.ready); > > + mutex_lock(&match->notify_lock); > > + > > + /* > > + * Here it's possible we got a signal and then had to wait on the mutex > > + * while the reply was sent, so let's be sure there wasn't a response > > + * in the meantime. > > + */ > > + if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) { > > + /* > > + * We got a signal. Let's tell userspace about it (potentially > > + * again, if we had already notified them about the first one). > > + */ > > + if (n.state == SECCOMP_NOTIFY_SENT) { > > + n.state = SECCOMP_NOTIFY_INIT; > > + up(&match->request); > > + } > > + mutex_unlock(&match->notify_lock); > > + err = wait_for_completion_killable(&n.ready); > > + mutex_lock(&match->notify_lock); > > + if (err < 0) > > + goto remove_list; > > + } > > + > > + ret = n.val; > > + err = n.error; > > + > > + WARN(n.state != SECCOMP_NOTIFY_REPLIED, > > + "notified about write complete when state is not write"); > > Nit: That message seems a little cryptic. Perhaps we can just drop it. It's just a sanity check, but given the tests above, it doesn't seem likely. > > + > > +remove_list: > > + list_del(&n.list); > > +out: > > + mutex_unlock(&match->notify_lock); > > + syscall_set_return_value(current, task_pt_regs(current), > > + err, ret); > > +} > > +#else > > +static void seccomp_do_user_notification(int this_syscall, > > + u32 action, > > + struct seccomp_filter *match, > > + const struct seccomp_data *sd) > > +{ > > + WARN(1, "user notification received, but disabled"); > > Nit: "received unexpected user notification" might be clearer Yes, I wonder if we shouldn't just drop this too -- it's not a kernel bug, but a userspace bug that they're using features that aren't enabled. We could enhance the verifier with a static check for BPF_RET | BPF_K == SECCOMPO_RET_USER_NOTIF and reject such programs if user notification isn't enabled. Of course, it wouldn't handle the dynamic case, but it might be useful. Tycho ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen 2018-05-17 15:33 ` Oleg Nesterov 2018-05-18 14:04 ` Christian Brauner @ 2018-05-19 0:14 ` kbuild test robot 2018-05-19 5:01 ` kbuild test robot 3 siblings, 0 replies; 20+ messages in thread From: kbuild test robot @ 2018-05-19 0:14 UTC (permalink / raw) To: Tycho Andersen Cc: kbuild-all, linux-kernel, containers, Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding, Tycho Andersen [-- Attachment #1: Type: text/plain, Size: 15430 bytes --] Hi Tycho, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.17-rc5] [cannot apply to next-20180517] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Tycho-Andersen/seccomp-trap-to-userspace/20180519-071527 config: x86_64-randconfig-x010-201819 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All error/warnings (new ones prefixed by >>): kernel/seccomp.c: In function '__seccomp_filter': >> kernel/seccomp.c:891:46: warning: passing argument 2 of 'seccomp_do_user_notification' makes integer from pointer without a cast [-Wint-conversion] seccomp_do_user_notification(this_syscall, match, sd); ^~~~~ kernel/seccomp.c:802:13: note: expected 'u32 {aka unsigned int}' but argument is of type 'struct seccomp_filter *' static void seccomp_do_user_notification(int this_syscall, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> kernel/seccomp.c:891:53: error: passing argument 3 of 'seccomp_do_user_notification' from incompatible pointer type [-Werror=incompatible-pointer-types] seccomp_do_user_notification(this_syscall, match, sd); ^~ kernel/seccomp.c:802:13: note: expected 'struct seccomp_filter *' but argument is of type 'const struct seccomp_data *' static void seccomp_do_user_notification(int this_syscall, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> kernel/seccomp.c:891:3: error: too few arguments to function 'seccomp_do_user_notification' seccomp_do_user_notification(this_syscall, match, sd); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/seccomp.c:802:13: note: declared here static void seccomp_do_user_notification(int this_syscall, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/seccomp.c: In function 'seccomp_set_mode_filter': >> kernel/seccomp.c:1036:14: error: implicit declaration of function 'get_unused_fd_flags'; did you mean 'getname_flags'? [-Werror=implicit-function-declaration] listener = get_unused_fd_flags(O_RDWR); ^~~~~~~~~~~~~~~~~~~ getname_flags >> kernel/seccomp.c:1042:16: error: implicit declaration of function 'init_listener'; did you mean 'init_llist_head'? [-Werror=implicit-function-declaration] listener_f = init_listener(current, prepared); ^~~~~~~~~~~~~ init_llist_head >> kernel/seccomp.c:1042:14: warning: assignment makes pointer from integer without a cast [-Wint-conversion] listener_f = init_listener(current, prepared); ^ >> kernel/seccomp.c:1044:4: error: implicit declaration of function 'put_unused_fd'; did you mean 'put_user_ns'? [-Werror=implicit-function-declaration] put_unused_fd(listener); ^~~~~~~~~~~~~ put_user_ns >> kernel/seccomp.c:1077:4: error: implicit declaration of function 'fput'; did you mean 'iput'? [-Werror=implicit-function-declaration] fput(listener_f); ^~~~ iput >> kernel/seccomp.c:1080:4: error: implicit declaration of function 'fd_install'; did you mean 'fs_initcall'? [-Werror=implicit-function-declaration] fd_install(listener, listener_f); ^~~~~~~~~~ fs_initcall cc1: some warnings being treated as errors vim +/seccomp_do_user_notification +891 kernel/seccomp.c 738 739 static void seccomp_do_user_notification(int this_syscall, 740 struct seccomp_filter *match, 741 const struct seccomp_data *sd) 742 { 743 int err; 744 long ret = 0; 745 struct seccomp_knotif n = {}; 746 747 mutex_lock(&match->notify_lock); 748 if (!match->has_listener) { 749 err = -ENOSYS; 750 goto out; 751 } 752 753 n.pid = current->pid; 754 n.state = SECCOMP_NOTIFY_INIT; 755 n.data = sd; 756 n.id = seccomp_next_notify_id(match); 757 init_completion(&n.ready); 758 759 list_add(&n.list, &match->notifications); 760 761 mutex_unlock(&match->notify_lock); 762 up(&match->request); 763 764 err = wait_for_completion_interruptible(&n.ready); 765 mutex_lock(&match->notify_lock); 766 767 /* 768 * Here it's possible we got a signal and then had to wait on the mutex 769 * while the reply was sent, so let's be sure there wasn't a response 770 * in the meantime. 771 */ 772 if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) { 773 /* 774 * We got a signal. Let's tell userspace about it (potentially 775 * again, if we had already notified them about the first one). 776 */ 777 if (n.state == SECCOMP_NOTIFY_SENT) { 778 n.state = SECCOMP_NOTIFY_INIT; 779 up(&match->request); 780 } 781 mutex_unlock(&match->notify_lock); 782 err = wait_for_completion_killable(&n.ready); 783 mutex_lock(&match->notify_lock); 784 if (err < 0) 785 goto remove_list; 786 } 787 788 ret = n.val; 789 err = n.error; 790 791 WARN(n.state != SECCOMP_NOTIFY_REPLIED, 792 "notified about write complete when state is not write"); 793 794 remove_list: 795 list_del(&n.list); 796 out: 797 mutex_unlock(&match->notify_lock); 798 syscall_set_return_value(current, task_pt_regs(current), 799 err, ret); 800 } 801 #else > 802 static void seccomp_do_user_notification(int this_syscall, 803 u32 action, 804 struct seccomp_filter *match, 805 const struct seccomp_data *sd) 806 { 807 WARN(1, "user notification received, but disabled"); 808 seccomp_log(this_syscall, SIGSYS, action, true); 809 do_exit(SIGSYS); 810 } 811 #endif 812 813 #ifdef CONFIG_SECCOMP_FILTER 814 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, 815 const bool recheck_after_trace) 816 { 817 u32 filter_ret, action; 818 struct seccomp_filter *match = NULL; 819 int data; 820 821 /* 822 * Make sure that any changes to mode from another thread have 823 * been seen after TIF_SECCOMP was seen. 824 */ 825 rmb(); 826 827 filter_ret = seccomp_run_filters(sd, &match); 828 data = filter_ret & SECCOMP_RET_DATA; 829 action = filter_ret & SECCOMP_RET_ACTION_FULL; 830 831 switch (action) { 832 case SECCOMP_RET_ERRNO: 833 /* Set low-order bits as an errno, capped at MAX_ERRNO. */ 834 if (data > MAX_ERRNO) 835 data = MAX_ERRNO; 836 syscall_set_return_value(current, task_pt_regs(current), 837 -data, 0); 838 goto skip; 839 840 case SECCOMP_RET_TRAP: 841 /* Show the handler the original registers. */ 842 syscall_rollback(current, task_pt_regs(current)); 843 /* Let the filter pass back 16 bits of data. */ 844 seccomp_send_sigsys(this_syscall, data); 845 goto skip; 846 847 case SECCOMP_RET_TRACE: 848 /* We've been put in this state by the ptracer already. */ 849 if (recheck_after_trace) 850 return 0; 851 852 /* ENOSYS these calls if there is no tracer attached. */ 853 if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) { 854 syscall_set_return_value(current, 855 task_pt_regs(current), 856 -ENOSYS, 0); 857 goto skip; 858 } 859 860 /* Allow the BPF to provide the event message */ 861 ptrace_event(PTRACE_EVENT_SECCOMP, data); 862 /* 863 * The delivery of a fatal signal during event 864 * notification may silently skip tracer notification, 865 * which could leave us with a potentially unmodified 866 * syscall that the tracer would have liked to have 867 * changed. Since the process is about to die, we just 868 * force the syscall to be skipped and let the signal 869 * kill the process and correctly handle any tracer exit 870 * notifications. 871 */ 872 if (fatal_signal_pending(current)) 873 goto skip; 874 /* Check if the tracer forced the syscall to be skipped. */ 875 this_syscall = syscall_get_nr(current, task_pt_regs(current)); 876 if (this_syscall < 0) 877 goto skip; 878 879 /* 880 * Recheck the syscall, since it may have changed. This 881 * intentionally uses a NULL struct seccomp_data to force 882 * a reload of all registers. This does not goto skip since 883 * a skip would have already been reported. 884 */ 885 if (__seccomp_filter(this_syscall, NULL, true)) 886 return -1; 887 888 return 0; 889 890 case SECCOMP_RET_USER_NOTIF: > 891 seccomp_do_user_notification(this_syscall, match, sd); 892 goto skip; 893 case SECCOMP_RET_LOG: 894 seccomp_log(this_syscall, 0, action, true); 895 return 0; 896 897 case SECCOMP_RET_ALLOW: 898 /* 899 * Note that the "match" filter will always be NULL for 900 * this action since SECCOMP_RET_ALLOW is the starting 901 * state in seccomp_run_filters(). 902 */ 903 return 0; 904 905 case SECCOMP_RET_KILL_THREAD: 906 case SECCOMP_RET_KILL_PROCESS: 907 default: 908 seccomp_log(this_syscall, SIGSYS, action, true); 909 /* Dump core only if this is the last remaining thread. */ 910 if (action == SECCOMP_RET_KILL_PROCESS || 911 get_nr_threads(current) == 1) { 912 siginfo_t info; 913 914 /* Show the original registers in the dump. */ 915 syscall_rollback(current, task_pt_regs(current)); 916 /* Trigger a manual coredump since do_exit skips it. */ 917 seccomp_init_siginfo(&info, this_syscall, data); 918 do_coredump(&info); 919 } 920 if (action == SECCOMP_RET_KILL_PROCESS) 921 do_group_exit(SIGSYS); 922 else 923 do_exit(SIGSYS); 924 } 925 926 unreachable(); 927 928 skip: 929 seccomp_log(this_syscall, 0, action, match ? match->log : false); 930 return -1; 931 } 932 #else 933 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, 934 const bool recheck_after_trace) 935 { 936 BUG(); 937 } 938 #endif 939 940 int __secure_computing(const struct seccomp_data *sd) 941 { 942 int mode = current->seccomp.mode; 943 int this_syscall; 944 945 if (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) && 946 unlikely(current->ptrace & PT_SUSPEND_SECCOMP)) 947 return 0; 948 949 this_syscall = sd ? sd->nr : 950 syscall_get_nr(current, task_pt_regs(current)); 951 952 switch (mode) { 953 case SECCOMP_MODE_STRICT: 954 __secure_computing_strict(this_syscall); /* may call do_exit */ 955 return 0; 956 case SECCOMP_MODE_FILTER: 957 return __seccomp_filter(this_syscall, sd, false); 958 default: 959 BUG(); 960 } 961 } 962 #endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */ 963 964 long prctl_get_seccomp(void) 965 { 966 return current->seccomp.mode; 967 } 968 969 /** 970 * seccomp_set_mode_strict: internal function for setting strict seccomp 971 * 972 * Once current->seccomp.mode is non-zero, it may not be changed. 973 * 974 * Returns 0 on success or -EINVAL on failure. 975 */ 976 static long seccomp_set_mode_strict(void) 977 { 978 const unsigned long seccomp_mode = SECCOMP_MODE_STRICT; 979 long ret = -EINVAL; 980 981 spin_lock_irq(¤t->sighand->siglock); 982 983 if (!seccomp_may_assign_mode(seccomp_mode)) 984 goto out; 985 986 #ifdef TIF_NOTSC 987 disable_TSC(); 988 #endif 989 seccomp_assign_mode(current, seccomp_mode); 990 ret = 0; 991 992 out: 993 spin_unlock_irq(¤t->sighand->siglock); 994 995 return ret; 996 } 997 998 #ifdef CONFIG_SECCOMP_FILTER 999 #ifdef CONFIG_SECCOMP_USER_NOTIFICATION 1000 static struct file *init_listener(struct task_struct *, 1001 struct seccomp_filter *); 1002 #endif 1003 1004 /** 1005 * seccomp_set_mode_filter: internal function for setting seccomp filter 1006 * @flags: flags to change filter behavior 1007 * @filter: struct sock_fprog containing filter 1008 * 1009 * This function may be called repeatedly to install additional filters. 1010 * Every filter successfully installed will be evaluated (in reverse order) 1011 * for each system call the task makes. 1012 * 1013 * Once current->seccomp.mode is non-zero, it may not be changed. 1014 * 1015 * Returns 0 on success or -EINVAL on failure. 1016 */ 1017 static long seccomp_set_mode_filter(unsigned int flags, 1018 const char __user *filter) 1019 { 1020 const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; 1021 struct seccomp_filter *prepared = NULL; 1022 long ret = -EINVAL; 1023 int listener = 0; 1024 struct file *listener_f = NULL; 1025 1026 /* Validate flags. */ 1027 if (flags & ~SECCOMP_FILTER_FLAG_MASK) 1028 return -EINVAL; 1029 1030 /* Prepare the new filter before holding any locks. */ 1031 prepared = seccomp_prepare_user_filter(filter); 1032 if (IS_ERR(prepared)) 1033 return PTR_ERR(prepared); 1034 1035 if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { > 1036 listener = get_unused_fd_flags(O_RDWR); 1037 if (listener < 0) { 1038 ret = listener; 1039 goto out_free; 1040 } 1041 > 1042 listener_f = init_listener(current, prepared); 1043 if (IS_ERR(listener_f)) { > 1044 put_unused_fd(listener); 1045 ret = PTR_ERR(listener_f); 1046 goto out_free; 1047 } 1048 } 1049 1050 /* 1051 * Make sure we cannot change seccomp or nnp state via TSYNC 1052 * while another thread is in the middle of calling exec. 1053 */ 1054 if (flags & SECCOMP_FILTER_FLAG_TSYNC && 1055 mutex_lock_killable(¤t->signal->cred_guard_mutex)) 1056 goto out_put_fd; 1057 1058 spin_lock_irq(¤t->sighand->siglock); 1059 1060 if (!seccomp_may_assign_mode(seccomp_mode)) 1061 goto out; 1062 1063 ret = seccomp_attach_filter(flags, prepared); 1064 if (ret) 1065 goto out; 1066 /* Do not free the successfully attached filter. */ 1067 prepared = NULL; 1068 1069 seccomp_assign_mode(current, seccomp_mode); 1070 out: 1071 spin_unlock_irq(¤t->sighand->siglock); 1072 if (flags & SECCOMP_FILTER_FLAG_TSYNC) 1073 mutex_unlock(¤t->signal->cred_guard_mutex); 1074 out_put_fd: 1075 if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { 1076 if (ret < 0) { > 1077 fput(listener_f); 1078 put_unused_fd(listener); 1079 } else { > 1080 fd_install(listener, listener_f); 1081 ret = listener; 1082 } 1083 } 1084 out_free: 1085 seccomp_filter_free(prepared); 1086 return ret; 1087 } 1088 #else 1089 static inline long seccomp_set_mode_filter(unsigned int flags, 1090 const char __user *filter) 1091 { 1092 return -EINVAL; 1093 } 1094 #endif 1095 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 26139 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen ` (2 preceding siblings ...) 2018-05-19 0:14 ` kbuild test robot @ 2018-05-19 5:01 ` kbuild test robot 2018-05-21 22:55 ` Tycho Andersen 3 siblings, 1 reply; 20+ messages in thread From: kbuild test robot @ 2018-05-19 5:01 UTC (permalink / raw) To: Tycho Andersen Cc: kbuild-all, linux-kernel, containers, Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding, Tycho Andersen [-- Attachment #1: Type: text/plain, Size: 12534 bytes --] Hi Tycho, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.17-rc5] [cannot apply to next-20180517] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Tycho-Andersen/seccomp-trap-to-userspace/20180519-071527 config: i386-randconfig-a1-05181545 (attached as .config) compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4 reproduce: # save the attached .config to linux build tree make ARCH=i386 All error/warnings (new ones prefixed by >>): kernel/seccomp.c: In function '__seccomp_filter': kernel/seccomp.c:891:46: warning: passing argument 2 of 'seccomp_do_user_notification' makes integer from pointer without a cast seccomp_do_user_notification(this_syscall, match, sd); ^ kernel/seccomp.c:802:13: note: expected 'u32' but argument is of type 'struct seccomp_filter *' static void seccomp_do_user_notification(int this_syscall, ^ >> kernel/seccomp.c:891:53: warning: passing argument 3 of 'seccomp_do_user_notification' from incompatible pointer type seccomp_do_user_notification(this_syscall, match, sd); ^ kernel/seccomp.c:802:13: note: expected 'struct seccomp_filter *' but argument is of type 'const struct seccomp_data *' static void seccomp_do_user_notification(int this_syscall, ^ kernel/seccomp.c:891:3: error: too few arguments to function 'seccomp_do_user_notification' seccomp_do_user_notification(this_syscall, match, sd); ^ kernel/seccomp.c:802:13: note: declared here static void seccomp_do_user_notification(int this_syscall, ^ kernel/seccomp.c: In function 'seccomp_set_mode_filter': >> kernel/seccomp.c:1036:3: error: implicit declaration of function 'get_unused_fd_flags' [-Werror=implicit-function-declaration] listener = get_unused_fd_flags(O_RDWR); ^ >> kernel/seccomp.c:1042:3: error: implicit declaration of function 'init_listener' [-Werror=implicit-function-declaration] listener_f = init_listener(current, prepared); ^ kernel/seccomp.c:1042:14: warning: assignment makes pointer from integer without a cast listener_f = init_listener(current, prepared); ^ >> kernel/seccomp.c:1044:4: error: implicit declaration of function 'put_unused_fd' [-Werror=implicit-function-declaration] put_unused_fd(listener); ^ >> kernel/seccomp.c:1077:4: error: implicit declaration of function 'fput' [-Werror=implicit-function-declaration] fput(listener_f); ^ >> kernel/seccomp.c:1080:4: error: implicit declaration of function 'fd_install' [-Werror=implicit-function-declaration] fd_install(listener, listener_f); ^ cc1: some warnings being treated as errors vim +/get_unused_fd_flags +1036 kernel/seccomp.c 812 813 #ifdef CONFIG_SECCOMP_FILTER 814 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, 815 const bool recheck_after_trace) 816 { 817 u32 filter_ret, action; 818 struct seccomp_filter *match = NULL; 819 int data; 820 821 /* 822 * Make sure that any changes to mode from another thread have 823 * been seen after TIF_SECCOMP was seen. 824 */ 825 rmb(); 826 827 filter_ret = seccomp_run_filters(sd, &match); 828 data = filter_ret & SECCOMP_RET_DATA; 829 action = filter_ret & SECCOMP_RET_ACTION_FULL; 830 831 switch (action) { 832 case SECCOMP_RET_ERRNO: 833 /* Set low-order bits as an errno, capped at MAX_ERRNO. */ 834 if (data > MAX_ERRNO) 835 data = MAX_ERRNO; 836 syscall_set_return_value(current, task_pt_regs(current), 837 -data, 0); 838 goto skip; 839 840 case SECCOMP_RET_TRAP: 841 /* Show the handler the original registers. */ 842 syscall_rollback(current, task_pt_regs(current)); 843 /* Let the filter pass back 16 bits of data. */ 844 seccomp_send_sigsys(this_syscall, data); 845 goto skip; 846 847 case SECCOMP_RET_TRACE: 848 /* We've been put in this state by the ptracer already. */ 849 if (recheck_after_trace) 850 return 0; 851 852 /* ENOSYS these calls if there is no tracer attached. */ 853 if (!ptrace_event_enabled(current, PTRACE_EVENT_SECCOMP)) { 854 syscall_set_return_value(current, 855 task_pt_regs(current), 856 -ENOSYS, 0); 857 goto skip; 858 } 859 860 /* Allow the BPF to provide the event message */ 861 ptrace_event(PTRACE_EVENT_SECCOMP, data); 862 /* 863 * The delivery of a fatal signal during event 864 * notification may silently skip tracer notification, 865 * which could leave us with a potentially unmodified 866 * syscall that the tracer would have liked to have 867 * changed. Since the process is about to die, we just 868 * force the syscall to be skipped and let the signal 869 * kill the process and correctly handle any tracer exit 870 * notifications. 871 */ 872 if (fatal_signal_pending(current)) 873 goto skip; 874 /* Check if the tracer forced the syscall to be skipped. */ 875 this_syscall = syscall_get_nr(current, task_pt_regs(current)); 876 if (this_syscall < 0) 877 goto skip; 878 879 /* 880 * Recheck the syscall, since it may have changed. This 881 * intentionally uses a NULL struct seccomp_data to force 882 * a reload of all registers. This does not goto skip since 883 * a skip would have already been reported. 884 */ 885 if (__seccomp_filter(this_syscall, NULL, true)) 886 return -1; 887 888 return 0; 889 890 case SECCOMP_RET_USER_NOTIF: > 891 seccomp_do_user_notification(this_syscall, match, sd); 892 goto skip; 893 case SECCOMP_RET_LOG: 894 seccomp_log(this_syscall, 0, action, true); 895 return 0; 896 897 case SECCOMP_RET_ALLOW: 898 /* 899 * Note that the "match" filter will always be NULL for 900 * this action since SECCOMP_RET_ALLOW is the starting 901 * state in seccomp_run_filters(). 902 */ 903 return 0; 904 905 case SECCOMP_RET_KILL_THREAD: 906 case SECCOMP_RET_KILL_PROCESS: 907 default: 908 seccomp_log(this_syscall, SIGSYS, action, true); 909 /* Dump core only if this is the last remaining thread. */ 910 if (action == SECCOMP_RET_KILL_PROCESS || 911 get_nr_threads(current) == 1) { 912 siginfo_t info; 913 914 /* Show the original registers in the dump. */ 915 syscall_rollback(current, task_pt_regs(current)); 916 /* Trigger a manual coredump since do_exit skips it. */ 917 seccomp_init_siginfo(&info, this_syscall, data); 918 do_coredump(&info); 919 } 920 if (action == SECCOMP_RET_KILL_PROCESS) 921 do_group_exit(SIGSYS); 922 else 923 do_exit(SIGSYS); 924 } 925 926 unreachable(); 927 928 skip: 929 seccomp_log(this_syscall, 0, action, match ? match->log : false); 930 return -1; 931 } 932 #else 933 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd, 934 const bool recheck_after_trace) 935 { 936 BUG(); 937 } 938 #endif 939 940 int __secure_computing(const struct seccomp_data *sd) 941 { 942 int mode = current->seccomp.mode; 943 int this_syscall; 944 945 if (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) && 946 unlikely(current->ptrace & PT_SUSPEND_SECCOMP)) 947 return 0; 948 949 this_syscall = sd ? sd->nr : 950 syscall_get_nr(current, task_pt_regs(current)); 951 952 switch (mode) { 953 case SECCOMP_MODE_STRICT: 954 __secure_computing_strict(this_syscall); /* may call do_exit */ 955 return 0; 956 case SECCOMP_MODE_FILTER: 957 return __seccomp_filter(this_syscall, sd, false); 958 default: 959 BUG(); 960 } 961 } 962 #endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */ 963 964 long prctl_get_seccomp(void) 965 { 966 return current->seccomp.mode; 967 } 968 969 /** 970 * seccomp_set_mode_strict: internal function for setting strict seccomp 971 * 972 * Once current->seccomp.mode is non-zero, it may not be changed. 973 * 974 * Returns 0 on success or -EINVAL on failure. 975 */ 976 static long seccomp_set_mode_strict(void) 977 { 978 const unsigned long seccomp_mode = SECCOMP_MODE_STRICT; 979 long ret = -EINVAL; 980 981 spin_lock_irq(¤t->sighand->siglock); 982 983 if (!seccomp_may_assign_mode(seccomp_mode)) 984 goto out; 985 986 #ifdef TIF_NOTSC 987 disable_TSC(); 988 #endif 989 seccomp_assign_mode(current, seccomp_mode); 990 ret = 0; 991 992 out: 993 spin_unlock_irq(¤t->sighand->siglock); 994 995 return ret; 996 } 997 998 #ifdef CONFIG_SECCOMP_FILTER 999 #ifdef CONFIG_SECCOMP_USER_NOTIFICATION 1000 static struct file *init_listener(struct task_struct *, 1001 struct seccomp_filter *); 1002 #endif 1003 1004 /** 1005 * seccomp_set_mode_filter: internal function for setting seccomp filter 1006 * @flags: flags to change filter behavior 1007 * @filter: struct sock_fprog containing filter 1008 * 1009 * This function may be called repeatedly to install additional filters. 1010 * Every filter successfully installed will be evaluated (in reverse order) 1011 * for each system call the task makes. 1012 * 1013 * Once current->seccomp.mode is non-zero, it may not be changed. 1014 * 1015 * Returns 0 on success or -EINVAL on failure. 1016 */ 1017 static long seccomp_set_mode_filter(unsigned int flags, 1018 const char __user *filter) 1019 { 1020 const unsigned long seccomp_mode = SECCOMP_MODE_FILTER; 1021 struct seccomp_filter *prepared = NULL; 1022 long ret = -EINVAL; 1023 int listener = 0; 1024 struct file *listener_f = NULL; 1025 1026 /* Validate flags. */ 1027 if (flags & ~SECCOMP_FILTER_FLAG_MASK) 1028 return -EINVAL; 1029 1030 /* Prepare the new filter before holding any locks. */ 1031 prepared = seccomp_prepare_user_filter(filter); 1032 if (IS_ERR(prepared)) 1033 return PTR_ERR(prepared); 1034 1035 if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { > 1036 listener = get_unused_fd_flags(O_RDWR); 1037 if (listener < 0) { 1038 ret = listener; 1039 goto out_free; 1040 } 1041 > 1042 listener_f = init_listener(current, prepared); 1043 if (IS_ERR(listener_f)) { > 1044 put_unused_fd(listener); 1045 ret = PTR_ERR(listener_f); 1046 goto out_free; 1047 } 1048 } 1049 1050 /* 1051 * Make sure we cannot change seccomp or nnp state via TSYNC 1052 * while another thread is in the middle of calling exec. 1053 */ 1054 if (flags & SECCOMP_FILTER_FLAG_TSYNC && 1055 mutex_lock_killable(¤t->signal->cred_guard_mutex)) 1056 goto out_put_fd; 1057 1058 spin_lock_irq(¤t->sighand->siglock); 1059 1060 if (!seccomp_may_assign_mode(seccomp_mode)) 1061 goto out; 1062 1063 ret = seccomp_attach_filter(flags, prepared); 1064 if (ret) 1065 goto out; 1066 /* Do not free the successfully attached filter. */ 1067 prepared = NULL; 1068 1069 seccomp_assign_mode(current, seccomp_mode); 1070 out: 1071 spin_unlock_irq(¤t->sighand->siglock); 1072 if (flags & SECCOMP_FILTER_FLAG_TSYNC) 1073 mutex_unlock(¤t->signal->cred_guard_mutex); 1074 out_put_fd: 1075 if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { 1076 if (ret < 0) { > 1077 fput(listener_f); 1078 put_unused_fd(listener); 1079 } else { > 1080 fd_install(listener, listener_f); 1081 ret = listener; 1082 } 1083 } 1084 out_free: 1085 seccomp_filter_free(prepared); 1086 return ret; 1087 } 1088 #else 1089 static inline long seccomp_set_mode_filter(unsigned int flags, 1090 const char __user *filter) 1091 { 1092 return -EINVAL; 1093 } 1094 #endif 1095 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 28767 bytes --] ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/4] seccomp: add a return code to trap to userspace 2018-05-19 5:01 ` kbuild test robot @ 2018-05-21 22:55 ` Tycho Andersen 0 siblings, 0 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-21 22:55 UTC (permalink / raw) To: kbuild test robot Cc: kbuild-all, linux-kernel, containers, Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding On Sat, May 19, 2018 at 01:01:15PM +0800, kbuild test robot wrote: > Hi Tycho, > > I love your patch! Yet something to improve: Whoops, seems I forgot to compile the !CONFIG_SECCOMP_USER_NOTIFICATION case. Anyways, I've fixed this for v3. Tycho ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH v2 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE 2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen 2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen @ 2018-05-17 15:12 ` Tycho Andersen 2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen ` (2 subsequent siblings) 4 siblings, 0 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw) To: linux-kernel, containers Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding, Tycho Andersen In the next commit we'll use this same mnemonic to get a listener for the nth filter, so we need it available outside of CHECKPOINT_RESTORE. This is slightly looser than necessary, because it really could be CHECKPOINT_RESTORE || USER_NOTIFICATION, but it's declared static and this complicates the code less, so hopefully it's ok. Signed-off-by: Tycho Andersen <tycho@tycho.ws> CC: Kees Cook <keescook@chromium.org> CC: Andy Lutomirski <luto@amacapital.net> CC: Oleg Nesterov <oleg@redhat.com> CC: Eric W. Biederman <ebiederm@xmission.com> CC: "Serge E. Hallyn" <serge@hallyn.com> CC: Christian Brauner <christian.brauner@ubuntu.com> CC: Tyler Hicks <tyhicks@canonical.com> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp> v2: new in v2 --- kernel/seccomp.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index a169a62cb78a..f136eca93f2f 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1180,7 +1180,7 @@ long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) return do_seccomp(op, 0, uargs); } -#if defined(CONFIG_SECCOMP_FILTER) && defined(CONFIG_CHECKPOINT_RESTORE) +#if defined(CONFIG_SECCOMP_FILTER) static struct seccomp_filter *get_nth_filter(struct task_struct *task, unsigned long filter_off) { @@ -1227,6 +1227,7 @@ static struct seccomp_filter *get_nth_filter(struct task_struct *task, return filter; } +#if defined(CONFIG_CHECKPOINT_RESTORE) long seccomp_get_filter(struct task_struct *task, unsigned long filter_off, void __user *data) { @@ -1299,7 +1300,8 @@ long seccomp_get_metadata(struct task_struct *task, __put_seccomp_filter(filter); return ret; } -#endif +#endif /* CONFIG_CHECKPOINT_RESTORE */ +#endif /* CONFIG_SECCOMP_FILTER */ #ifdef CONFIG_SYSCTL -- 2.17.0 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace 2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen 2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen 2018-05-17 15:12 ` [PATCH v2 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen @ 2018-05-17 15:12 ` Tycho Andersen 2018-05-17 15:41 ` Oleg Nesterov 2018-05-18 14:05 ` Christian Brauner 2018-05-17 15:12 ` [PATCH v2 4/4] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen 2018-05-18 14:03 ` [PATCH v2 0/4] seccomp trap to userspace Christian Brauner 4 siblings, 2 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw) To: linux-kernel, containers Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding, Tycho Andersen As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace() version which can acquire filters is useful. There are at least two reasons this is preferable, even though it uses ptrace: 1. You can control tasks that aren't cooperating with you 2. You can control tasks whose filters block sendmsg() and socket(); if the task installs a filter which blocks these calls, there's no way with SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task. v2: fix a bug where listener mode was not unset when an unused fd was not available Signed-off-by: Tycho Andersen <tycho@tycho.ws> CC: Kees Cook <keescook@chromium.org> CC: Andy Lutomirski <luto@amacapital.net> CC: Oleg Nesterov <oleg@redhat.com> CC: Eric W. Biederman <ebiederm@xmission.com> CC: "Serge E. Hallyn" <serge@hallyn.com> CC: Christian Brauner <christian.brauner@ubuntu.com> CC: Tyler Hicks <tyhicks@canonical.com> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp> --- include/linux/seccomp.h | 11 ++++ include/uapi/linux/ptrace.h | 2 + kernel/ptrace.c | 4 ++ kernel/seccomp.c | 27 ++++++++ tools/testing/selftests/seccomp/seccomp_bpf.c | 66 +++++++++++++++++++ 5 files changed, 110 insertions(+) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 0fd3e0676a1c..10e684899b7b 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -111,4 +111,15 @@ static inline long seccomp_get_metadata(struct task_struct *task, return -EINVAL; } #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ + +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION +extern long seccomp_get_listener(struct task_struct *task, + unsigned long filter_off); +#else +static inline long seccomp_get_listener(struct task_struct *task, + unsigned long filter_off) +{ + return -EINVAL; +} +#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */ #endif /* _LINUX_SECCOMP_H */ diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h index d5a1b8a492b9..dc0abf81de3b 100644 --- a/include/uapi/linux/ptrace.h +++ b/include/uapi/linux/ptrace.h @@ -73,6 +73,8 @@ struct seccomp_metadata { __u64 flags; /* Output: filter's flags */ }; +#define PTRACE_SECCOMP_GET_LISTENER 0x420e + /* Read signals from a shared (process wide) queue */ #define PTRACE_PEEKSIGINFO_SHARED (1 << 0) diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 21fec73d45d4..fcbdb6f4dc07 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request, ret = seccomp_get_metadata(child, addr, datavp); break; + case PTRACE_SECCOMP_GET_LISTENER: + ret = seccomp_get_listener(child, addr); + break; + default: break; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index f136eca93f2f..7c23aee76bb4 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1678,4 +1678,31 @@ static struct file *init_listener(struct task_struct *task, return ret; } + +long seccomp_get_listener(struct task_struct *task, + unsigned long filter_off) +{ + struct seccomp_filter *filter; + struct file *listener; + int fd; + + filter = get_nth_filter(task, filter_off); + if (IS_ERR(filter)) + return PTR_ERR(filter); + + fd = get_unused_fd_flags(O_RDWR); + if (fd < 0) { + __put_seccomp_filter(filter); + return fd; + } + + listener = init_listener(task, task->seccomp.filter); + if (IS_ERR(listener)) { + put_unused_fd(fd); + return PTR_ERR(listener); + } + + fd_install(fd, listener); + return fd; +} #endif diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index bb96df66222f..473905f33e0b 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -178,6 +178,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args) } #endif +#ifndef PTRACE_SECCOMP_GET_LISTENER +#define PTRACE_SECCOMP_GET_LISTENER 0x420e +#endif + #if __BYTE_ORDER == __LITTLE_ENDIAN #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n])) #elif __BYTE_ORDER == __BIG_ENDIAN @@ -3096,6 +3100,68 @@ TEST(get_user_notification_syscall) close(listener); } +TEST(get_user_notification_ptrace) +{ + pid_t pid; + int status, listener; + int sk_pair[2]; + char c; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + + ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0); + + /* Test that we get ENOSYS while not attached */ + EXPECT_EQ(syscall(__NR_getpid), -1); + EXPECT_EQ(errno, ENOSYS); + + /* Signal we're ready and have installed the filter. */ + EXPECT_EQ(write(sk_pair[1], "J", 1), 1); + + EXPECT_EQ(read(sk_pair[1], &c, 1), 1); + EXPECT_EQ(c, 'H'); + + exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC); + } + + EXPECT_EQ(read(sk_pair[0], &c, 1), 1); + EXPECT_EQ(c, 'J'); + + EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0); + EXPECT_EQ(waitpid(pid, NULL, 0), pid); + listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0); + EXPECT_GE(listener, 0); + + /* EBUSY for second listener */ + EXPECT_EQ(ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0), -1); + EXPECT_EQ(errno, EBUSY); + + EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0); + + /* Now signal we are done and respond with magic */ + EXPECT_EQ(write(sk_pair[0], "H", 1), 1); + + EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req)); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + + EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp)); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + + close(listener); +} + /* * TODO: * - add microbenchmarks -- 2.17.0 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace 2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen @ 2018-05-17 15:41 ` Oleg Nesterov 2018-05-17 15:57 ` Tycho Andersen 2018-05-18 14:05 ` Christian Brauner 1 sibling, 1 reply; 20+ messages in thread From: Oleg Nesterov @ 2018-05-17 15:41 UTC (permalink / raw) To: Tycho Andersen Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding again, I don't understand this code yet, but On 05/17, Tycho Andersen wrote: > > +long seccomp_get_listener(struct task_struct *task, > + unsigned long filter_off) > +{ > + struct seccomp_filter *filter; > + struct file *listener; > + int fd; > + > + filter = get_nth_filter(task, filter_off); > + if (IS_ERR(filter)) > + return PTR_ERR(filter); > + > + fd = get_unused_fd_flags(O_RDWR); > + if (fd < 0) { > + __put_seccomp_filter(filter); > + return fd; > + } > + > + listener = init_listener(task, task->seccomp.filter); > + if (IS_ERR(listener)) { > + put_unused_fd(fd); > + return PTR_ERR(listener); __put_seccomp_filter() ? and since init_listener() does __get_seccomp_filter() on sucess, it is needed uncondtitionally? Oleg. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace 2018-05-17 15:41 ` Oleg Nesterov @ 2018-05-17 15:57 ` Tycho Andersen 2018-05-17 15:59 ` Tycho Andersen 0 siblings, 1 reply; 20+ messages in thread From: Tycho Andersen @ 2018-05-17 15:57 UTC (permalink / raw) To: Oleg Nesterov Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding On Thu, May 17, 2018 at 05:41:39PM +0200, Oleg Nesterov wrote: > again, I don't understand this code yet, but > > On 05/17, Tycho Andersen wrote: > > > > +long seccomp_get_listener(struct task_struct *task, > > + unsigned long filter_off) > > +{ > > + struct seccomp_filter *filter; > > + struct file *listener; > > + int fd; > > + > > + filter = get_nth_filter(task, filter_off); > > + if (IS_ERR(filter)) > > + return PTR_ERR(filter); > > + > > + fd = get_unused_fd_flags(O_RDWR); > > + if (fd < 0) { > > + __put_seccomp_filter(filter); > > + return fd; > > + } > > + > > + listener = init_listener(task, task->seccomp.filter); > > + if (IS_ERR(listener)) { > > + put_unused_fd(fd); > > + return PTR_ERR(listener); > > __put_seccomp_filter() ? Yes, I think you're right here. > and since init_listener() does __get_seccomp_filter() on sucess, it is needed > uncondtitionally? I think there does need to be a __get_seccomp_filter() on success in init_listener(), because it's paired with the __put_seccomp_filter in seccomp_notify_release. The listener fd has a reference to the filter, and that shouldn't go away until after the fd is freed. Tycho ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace 2018-05-17 15:57 ` Tycho Andersen @ 2018-05-17 15:59 ` Tycho Andersen 0 siblings, 0 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-17 15:59 UTC (permalink / raw) To: Oleg Nesterov Cc: linux-kernel, containers, Kees Cook, Andy Lutomirski, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding On Thu, May 17, 2018 at 09:57:33AM -0600, Tycho Andersen wrote: > On Thu, May 17, 2018 at 05:41:39PM +0200, Oleg Nesterov wrote: > > and since init_listener() does __get_seccomp_filter() on sucess, it is needed > > uncondtitionally? > > I think there does need to be a __get_seccomp_filter() on success in > init_listener(), because it's paired with the __put_seccomp_filter in > seccomp_notify_release. The listener fd has a reference to the filter, > and that shouldn't go away until after the fd is freed. Oh, sorry. I see what you meant here. Yes, it should be unconditional. Tycho ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace 2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen 2018-05-17 15:41 ` Oleg Nesterov @ 2018-05-18 14:05 ` Christian Brauner 2018-05-18 15:10 ` Tycho Andersen 1 sibling, 1 reply; 20+ messages in thread From: Christian Brauner @ 2018-05-18 14:05 UTC (permalink / raw) To: Tycho Andersen Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook, Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman, Christian Brauner, Tyler Hicks On Thu, May 17, 2018 at 09:12:17AM -0600, Tycho Andersen wrote: > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace() > version which can acquire filters is useful. There are at least two reasons > this is preferable, even though it uses ptrace: > > 1. You can control tasks that aren't cooperating with you > 2. You can control tasks whose filters block sendmsg() and socket(); if the > task installs a filter which blocks these calls, there's no way with > SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task. I get the problem I guess the question we need to answer is do we care enought to bring ptrace into this? Not really objecting, just asking. :) If blocking sendmsg() or socket() becomes an issue because people like to shoot themselves in the foot we can surely add this option later. Christian > > v2: fix a bug where listener mode was not unset when an unused fd was not > available > > Signed-off-by: Tycho Andersen <tycho@tycho.ws> > CC: Kees Cook <keescook@chromium.org> > CC: Andy Lutomirski <luto@amacapital.net> > CC: Oleg Nesterov <oleg@redhat.com> > CC: Eric W. Biederman <ebiederm@xmission.com> > CC: "Serge E. Hallyn" <serge@hallyn.com> > CC: Christian Brauner <christian.brauner@ubuntu.com> > CC: Tyler Hicks <tyhicks@canonical.com> > CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp> > --- > include/linux/seccomp.h | 11 ++++ > include/uapi/linux/ptrace.h | 2 + > kernel/ptrace.c | 4 ++ > kernel/seccomp.c | 27 ++++++++ > tools/testing/selftests/seccomp/seccomp_bpf.c | 66 +++++++++++++++++++ > 5 files changed, 110 insertions(+) > > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index 0fd3e0676a1c..10e684899b7b 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -111,4 +111,15 @@ static inline long seccomp_get_metadata(struct task_struct *task, > return -EINVAL; > } > #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ > + > +#ifdef CONFIG_SECCOMP_USER_NOTIFICATION > +extern long seccomp_get_listener(struct task_struct *task, > + unsigned long filter_off); > +#else > +static inline long seccomp_get_listener(struct task_struct *task, > + unsigned long filter_off) > +{ > + return -EINVAL; > +} > +#endif/* CONFIG_SECCOMP_USER_NOTIFICATION */ > #endif /* _LINUX_SECCOMP_H */ > diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h > index d5a1b8a492b9..dc0abf81de3b 100644 > --- a/include/uapi/linux/ptrace.h > +++ b/include/uapi/linux/ptrace.h > @@ -73,6 +73,8 @@ struct seccomp_metadata { > __u64 flags; /* Output: filter's flags */ > }; > > +#define PTRACE_SECCOMP_GET_LISTENER 0x420e > + > /* Read signals from a shared (process wide) queue */ > #define PTRACE_PEEKSIGINFO_SHARED (1 << 0) > > diff --git a/kernel/ptrace.c b/kernel/ptrace.c > index 21fec73d45d4..fcbdb6f4dc07 100644 > --- a/kernel/ptrace.c > +++ b/kernel/ptrace.c > @@ -1096,6 +1096,10 @@ int ptrace_request(struct task_struct *child, long request, > ret = seccomp_get_metadata(child, addr, datavp); > break; > > + case PTRACE_SECCOMP_GET_LISTENER: > + ret = seccomp_get_listener(child, addr); > + break; > + > default: > break; > } > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index f136eca93f2f..7c23aee76bb4 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -1678,4 +1678,31 @@ static struct file *init_listener(struct task_struct *task, > > return ret; > } > + > +long seccomp_get_listener(struct task_struct *task, > + unsigned long filter_off) > +{ > + struct seccomp_filter *filter; > + struct file *listener; > + int fd; > + > + filter = get_nth_filter(task, filter_off); > + if (IS_ERR(filter)) > + return PTR_ERR(filter); > + > + fd = get_unused_fd_flags(O_RDWR); > + if (fd < 0) { > + __put_seccomp_filter(filter); > + return fd; > + } > + > + listener = init_listener(task, task->seccomp.filter); > + if (IS_ERR(listener)) { > + put_unused_fd(fd); > + return PTR_ERR(listener); > + } > + > + fd_install(fd, listener); > + return fd; > +} > #endif > diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c > index bb96df66222f..473905f33e0b 100644 > --- a/tools/testing/selftests/seccomp/seccomp_bpf.c > +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c > @@ -178,6 +178,10 @@ int seccomp(unsigned int op, unsigned int flags, void *args) > } > #endif > > +#ifndef PTRACE_SECCOMP_GET_LISTENER > +#define PTRACE_SECCOMP_GET_LISTENER 0x420e > +#endif > + > #if __BYTE_ORDER == __LITTLE_ENDIAN > #define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n])) > #elif __BYTE_ORDER == __BIG_ENDIAN > @@ -3096,6 +3100,68 @@ TEST(get_user_notification_syscall) > close(listener); > } > > +TEST(get_user_notification_ptrace) > +{ > + pid_t pid; > + int status, listener; > + int sk_pair[2]; > + char c; > + struct seccomp_notif req = {}; > + struct seccomp_notif_resp resp = {}; > + > + ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0); > + > + pid = fork(); > + ASSERT_GE(pid, 0); > + > + if (pid == 0) { > + EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0); > + > + /* Test that we get ENOSYS while not attached */ > + EXPECT_EQ(syscall(__NR_getpid), -1); > + EXPECT_EQ(errno, ENOSYS); > + > + /* Signal we're ready and have installed the filter. */ > + EXPECT_EQ(write(sk_pair[1], "J", 1), 1); > + > + EXPECT_EQ(read(sk_pair[1], &c, 1), 1); > + EXPECT_EQ(c, 'H'); > + > + exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC); > + } > + > + EXPECT_EQ(read(sk_pair[0], &c, 1), 1); > + EXPECT_EQ(c, 'J'); > + > + EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0); > + EXPECT_EQ(waitpid(pid, NULL, 0), pid); > + listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0); > + EXPECT_GE(listener, 0); > + > + /* EBUSY for second listener */ > + EXPECT_EQ(ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0), -1); > + EXPECT_EQ(errno, EBUSY); > + > + EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0); > + > + /* Now signal we are done and respond with magic */ > + EXPECT_EQ(write(sk_pair[0], "H", 1), 1); > + > + EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req)); > + > + resp.id = req.id; > + resp.error = 0; > + resp.val = USER_NOTIF_MAGIC; > + > + EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp)); > + > + EXPECT_EQ(waitpid(pid, &status, 0), pid); > + EXPECT_EQ(true, WIFEXITED(status)); > + EXPECT_EQ(0, WEXITSTATUS(status)); > + > + close(listener); > +} > + > /* > * TODO: > * - add microbenchmarks > -- > 2.17.0 > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace 2018-05-18 14:05 ` Christian Brauner @ 2018-05-18 15:10 ` Tycho Andersen 0 siblings, 0 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-18 15:10 UTC (permalink / raw) To: Christian Brauner Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook, Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman, Christian Brauner, Tyler Hicks On Fri, May 18, 2018 at 04:05:56PM +0200, Christian Brauner wrote: > On Thu, May 17, 2018 at 09:12:17AM -0600, Tycho Andersen wrote: > > As an alternative to SECCOMP_FILTER_FLAG_GET_LISTENER, perhaps a ptrace() > > version which can acquire filters is useful. There are at least two reasons > > this is preferable, even though it uses ptrace: > > > > 1. You can control tasks that aren't cooperating with you > > 2. You can control tasks whose filters block sendmsg() and socket(); if the > > task installs a filter which blocks these calls, there's no way with > > SECCOMP_FILTER_FLAG_GET_LISTENER to get the fd out to the privileged task. > > I get the problem I guess the question we need to answer is do we care > enought to bring ptrace into this? Not really objecting, just asking. :) > If blocking sendmsg() or socket() becomes an issue because people like > to shoot themselves in the foot we can surely add this option later. It doesn't seem that unreasonable to me to want to filter socket() or sendmsg() though, so designing an API that doesn't support that from the get-go seems like a bad idea. But that's why there are two alternatives, so we can argue about it :) Tycho ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH v2 4/4] seccomp: add support for passing fds via USER_NOTIF 2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen ` (2 preceding siblings ...) 2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen @ 2018-05-17 15:12 ` Tycho Andersen 2018-05-18 14:03 ` [PATCH v2 0/4] seccomp trap to userspace Christian Brauner 4 siblings, 0 replies; 20+ messages in thread From: Tycho Andersen @ 2018-05-17 15:12 UTC (permalink / raw) To: linux-kernel, containers Cc: Kees Cook, Andy Lutomirski, Oleg Nesterov, Eric W . Biederman, Serge E . Hallyn, Christian Brauner, Tyler Hicks, Akihiro Suda, Tobin C . Harding, Tycho Andersen The idea here is that the userspace handler should be able to pass an fd back to the trapped task, for example so it can be returned from socket(). I've proposed one API here, but I'm open to other options. In particular, this only lets you return an fd from a syscall, which may not be enough in all cases. For example, if an fd is written to an output parameter instead of returned, the current API can't handle this. Another case is that netlink takes as input fds sometimes (IFLA_NET_NS_FD, e.g.). If netlink ever decides to install an fd and output it, we wouldn't be able to handle this either. Still, the vast majority of interesting cases are covered by this API, so perhaps it is Enough. I've left it as a separate commit for two reasons: * It illustrates the way in which we would grow struct seccomp_notif and struct seccomp_notif_resp without using netlink * It shows just how little code is needed to accomplish this :) v2: new in v2 Signed-off-by: Tycho Andersen <tycho@tycho.ws> CC: Kees Cook <keescook@chromium.org> CC: Andy Lutomirski <luto@amacapital.net> CC: Oleg Nesterov <oleg@redhat.com> CC: Eric W. Biederman <ebiederm@xmission.com> CC: "Serge E. Hallyn" <serge@hallyn.com> CC: Christian Brauner <christian.brauner@ubuntu.com> CC: Tyler Hicks <tyhicks@canonical.com> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp> --- include/uapi/linux/seccomp.h | 2 + kernel/seccomp.c | 49 +++++++- tools/testing/selftests/seccomp/seccomp_bpf.c | 112 ++++++++++++++++++ 3 files changed, 161 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 8160e6cad528..3124427219cb 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -71,6 +71,8 @@ struct seccomp_notif_resp { __u64 id; __s32 error; __s64 val; + __u8 return_fd; + __u32 fd; }; #endif /* _UAPI_LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 7c23aee76bb4..c783b8dcd001 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -77,6 +77,8 @@ struct seccomp_knotif { /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */ int error; long val; + struct file *file; + unsigned int flags; /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */ struct completion ready; @@ -785,13 +787,35 @@ static void seccomp_do_user_notification(int this_syscall, goto remove_list; } - ret = n.val; - err = n.error; + if (n.file) { + int fd; + + fd = get_unused_fd_flags(n.flags); + if (fd < 0) { + err = fd; + ret = -1; + goto remove_list; + } + + ret = fd; + err = 0; + + fd_install(fd, n.file); + /* Don't fput, since fd has a reference now */ + n.file = NULL; + } else { + ret = n.val; + err = n.error; + } + WARN(n.state != SECCOMP_NOTIFY_REPLIED, "notified about write complete when state is not write"); remove_list: + if (n.file) + fput(n.file); + list_del(&n.list); out: mutex_unlock(&match->notify_lock); @@ -1610,6 +1634,27 @@ static ssize_t seccomp_notify_write(struct file *file, const char __user *buf, knotif->state = SECCOMP_NOTIFY_REPLIED; knotif->error = resp.error; knotif->val = resp.val; + + if (resp.return_fd) { + struct fd fd; + + /* + * This is a little hokey: we need a real fget() (i.e. not + * __fget_light(), which is what fdget does), but we also need + * the flags from strcut fd. So, we get it, put it, and get it + * again for real. + */ + fd = fdget(resp.fd); + knotif->flags = fd.flags; + fdput(fd); + + knotif->file = fget(resp.fd); + if (!knotif->file) { + ret = -EBADF; + goto out; + } + } + complete(&knotif->ready); out: mutex_unlock(&filter->notify_lock); diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 473905f33e0b..b04d3ecc61f4 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -167,6 +167,8 @@ struct seccomp_notif_resp { __u64 id; __s32 error; __s64 val; + __u8 return_fd; + __u32 fd; }; #endif @@ -3162,6 +3164,116 @@ TEST(get_user_notification_ptrace) close(listener); } +TEST(user_notification_pass_fd) +{ + pid_t pid; + int status, listener; + int sk_pair[2]; + char c; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + long ret; + + ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + int fd; + char buf[16]; + + EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0); + + /* Signal we're ready and have installed the filter. */ + EXPECT_EQ(write(sk_pair[1], "J", 1), 1); + + EXPECT_EQ(read(sk_pair[1], &c, 1), 1); + EXPECT_EQ(c, 'H'); + close(sk_pair[1]); + + /* An fd from getpid(). Let the games begin. */ + fd = syscall(__NR_getpid); + EXPECT_GT(fd, 0); + EXPECT_EQ(read(fd, buf, sizeof(buf)), 12); + close(fd); + + exit(strcmp("hello world", buf)); + } + + EXPECT_EQ(read(sk_pair[0], &c, 1), 1); + EXPECT_EQ(c, 'J'); + + EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0); + EXPECT_EQ(waitpid(pid, NULL, 0), pid); + listener = ptrace(PTRACE_SECCOMP_GET_LISTENER, pid, 0); + EXPECT_GE(listener, 0); + EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0); + + /* Now signal we are done installing so it can do a getpid */ + EXPECT_EQ(write(sk_pair[0], "H", 1), 1); + close(sk_pair[0]); + + /* Make a new socket pair so we can send half across */ + EXPECT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0); + + ret = read_notif(listener, &req); + EXPECT_EQ(ret, sizeof(req)); + EXPECT_EQ(errno, 0); + + resp.id = req.id; + resp.return_fd = 1; + resp.fd = sk_pair[1]; + EXPECT_EQ(write(listener, &resp, sizeof(resp)), sizeof(resp)); + close(sk_pair[1]); + + EXPECT_EQ(write(sk_pair[0], "hello world\0", 12), 12); + close(sk_pair[0]); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); + close(listener); +} + +TEST(user_notification_struct_size_mismatch) +{ + pid_t pid; + long ret; + int status, listener, len; + struct seccomp_notif req; + struct seccomp_notif_resp resp; + + listener = user_trap_syscall(__NR_getpid, + SECCOMP_FILTER_FLAG_GET_LISTENER); + EXPECT_GE(listener, 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + ret = syscall(__NR_getpid); + exit(ret != USER_NOTIF_MAGIC); + } + + EXPECT_EQ(read(listener, &req, sizeof(req)), sizeof(req)); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + + /* + * Only write a partial structure: this is what was available before we + * had fd support. + */ + len = offsetof(struct seccomp_notif_resp, val) + sizeof(resp.val); + EXPECT_EQ(write(listener, &resp, len), len); + + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); +} + /* * TODO: * - add microbenchmarks -- 2.17.0 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v2 0/4] seccomp trap to userspace 2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen ` (3 preceding siblings ...) 2018-05-17 15:12 ` [PATCH v2 4/4] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen @ 2018-05-18 14:03 ` Christian Brauner 4 siblings, 0 replies; 20+ messages in thread From: Christian Brauner @ 2018-05-18 14:03 UTC (permalink / raw) To: Tycho Andersen Cc: linux-kernel, containers, Tobin C . Harding, Kees Cook, Akihiro Suda, Oleg Nesterov, Andy Lutomirski, Eric W . Biederman, Christian Brauner, Tyler Hicks On Thu, May 17, 2018 at 09:12:14AM -0600, Tycho Andersen wrote: > Hi, > > After a while focusing on other things, I finally managed ot get a v2 of > this series prepared. I believe I've addressed all the feedback from v1, > except for one major point: switching the communication protocol over > the fd to nlattr. I looked into doing this, but the kernel stuff for > dealing with nlattr seems to require an skb (via nlmsg_{new,put} and > netlink_unicast), which means we need to deal with the netlink sequence > numbers, portids, and create a socket protocol. I can do this if we > still think nlattr is necessary, but based on looking at it, it seems > like a lot of extra code for no real benefit. Yes, we've had that discussion before and I agree. I fail to see the benefit here too. Christian > > I've also added support for passing fds. The code itself is simple, but > the API could/should probably be different, see patch 4 for discussion. > > Tycho > > Tycho Andersen (4): > seccomp: add a return code to trap to userspace > seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE > seccomp: add a way to get a listener fd from ptrace > seccomp: add support for passing fds via USER_NOTIF > > arch/Kconfig | 7 + > include/linux/seccomp.h | 14 +- > include/uapi/linux/ptrace.h | 2 + > include/uapi/linux/seccomp.h | 20 +- > kernel/ptrace.c | 4 + > kernel/seccomp.c | 480 +++++++++++++++++- > tools/testing/selftests/seccomp/seccomp_bpf.c | 359 ++++++++++++- > 7 files changed, 878 insertions(+), 8 deletions(-) > > -- > 2.17.0 > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2018-05-24 15:28 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-17 15:12 [PATCH v2 0/4] seccomp trap to userspace Tycho Andersen 2018-05-17 15:12 ` [PATCH v2 1/4] seccomp: add a return code to " Tycho Andersen 2018-05-17 15:33 ` Oleg Nesterov 2018-05-17 15:39 ` Tycho Andersen 2018-05-17 15:46 ` Oleg Nesterov 2018-05-24 15:28 ` Tycho Andersen 2018-05-18 14:04 ` Christian Brauner 2018-05-18 15:21 ` Tycho Andersen 2018-05-19 0:14 ` kbuild test robot 2018-05-19 5:01 ` kbuild test robot 2018-05-21 22:55 ` Tycho Andersen 2018-05-17 15:12 ` [PATCH v2 2/4] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE Tycho Andersen 2018-05-17 15:12 ` [PATCH v2 3/4] seccomp: add a way to get a listener fd from ptrace Tycho Andersen 2018-05-17 15:41 ` Oleg Nesterov 2018-05-17 15:57 ` Tycho Andersen 2018-05-17 15:59 ` Tycho Andersen 2018-05-18 14:05 ` Christian Brauner 2018-05-18 15:10 ` Tycho Andersen 2018-05-17 15:12 ` [PATCH v2 4/4] seccomp: add support for passing fds via USER_NOTIF Tycho Andersen 2018-05-18 14:03 ` [PATCH v2 0/4] seccomp trap to userspace Christian Brauner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).