Earlier this week Kumar Kartikeya Dwivedi posted a patchset switching io_uring over to the anonymous inode variant that allows for LSM controls. While nice, the patchset left the actual LSM controls as an exercise for the reader. The posting can be found using the lore link below: https://lore.kernel.org/io-uring/CAHC9VhS=PDxx=MzZnGGNLwo-o5Og-HGZe84=+BBtBCZgaGSn4A@mail.gmail.com/T/#mde8c5120f3b8e34a5a3b18229b8c563a7855fd20 As fate would have it, I had been working on something very similar, in fact the two patches from Kumar mirrored two in my own patchset. This patchset, while still a bit crude, does include an attempt at adding the LSM and audit support necessary to properly implement LSM based access controls for io_uring. I've provided the SELinux implementation, Casey has been nice enough to provide a Smack patch, and John is working on an AppArmor patch as I write this. I've mentioned this work to the other LSM maintainers that I believe might be affected but I have not heard back from anyone else at this point. If any of the other LSMs would like to contribute a patch to this patchset I will happily accept it; I only ask that you post it to the LSM list and make sure I am on the To/CC line. I think it would be nice to try and wrap this up as soon as possible for the obvious reasons. The individual patches provide an explanation of the changes involved so I'm not going to repeat that here, but I will caution you that these patches are still rather crude, perhaps more than a RFC patchset should be, but it seemed prudent to move this along so I'm posting these now. Any review that you can provide would be helpful. Also, any pointers to easy-to-run io_uring tests would be helpful. I am particularly interested in tests which make use of the personality option, share urings across process boundaries, and make use of the sqpoll functionality. As a point of reference, this patchset is based on v5.13-rc2 and if you want to follow along via git I'll be making updates to the git tree/branch below (warning I will be force-pushing on this branch given the early/rough nature of these patches). git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git (checkout branch "working-io_uring") Thanks in advance, -Paul --- Casey Schaufler (1): Smack: Brutalist io_uring support with debug Paul Moore (8): audit: prepare audit_context for use in calling contexts beyond syscalls audit,io_uring,io-wq: add some basic audit support to io_uring audit: dev/test patch to force io_uring auditing audit: add filtering for io_uring records fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure() io_uring: convert io_uring to the secure anon inode interface lsm,io_uring: add LSM hooks to io_uring selinux: add support for the io_uring access controls fs/anon_inodes.c | 29 ++ fs/io-wq.c | 4 + fs/io_uring.c | 25 +- include/linux/anon_inodes.h | 4 + include/linux/audit.h | 17 + include/linux/lsm_hook_defs.h | 5 + include/linux/lsm_hooks.h | 13 + include/linux/security.h | 16 + include/uapi/linux/audit.h | 4 +- kernel/audit.h | 7 +- kernel/auditfilter.c | 4 +- kernel/auditsc.c | 481 ++++++++++++++++++++++------ security/security.c | 12 + security/selinux/hooks.c | 67 ++++ security/selinux/include/classmap.h | 2 + security/smack/smack_lsm.c | 64 ++++ 16 files changed, 650 insertions(+), 104 deletions(-)
WARNING - This is a work in progress and should not be merged anywhere important. It is almost surely not complete, and while it probably compiles it likely hasn't been booted and will do terrible things. You have been warned. This patch cleans up some of our audit_context handling by abstracting out the reset and return code fixup handling to dedicated functions. Not only does this help make things easier to read and inspect, it allows for easier reuse be future patches. We also convert the simple audit_context->in_syscall flag into an enum which can be used to by future patches to indicate a calling context other than the syscall context. Thanks to Richard Guy Briggs for review and feedback. Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> --- kernel/audit.h | 5 + kernel/auditsc.c | 255 +++++++++++++++++++++++++++++++++++------------------- 2 files changed, 167 insertions(+), 93 deletions(-) diff --git a/kernel/audit.h b/kernel/audit.h index 1522e100fd17..fba180de5912 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -97,7 +97,10 @@ struct audit_proctitle { /* The per-task audit context. */ struct audit_context { int dummy; /* must be the first element */ - int in_syscall; /* 1 if task is in a syscall */ + enum { + AUDIT_CTX_UNUSED, /* audit_context is currently unused */ + AUDIT_CTX_SYSCALL, /* in use by syscall */ + } context; enum audit_state state, current_state; unsigned int serial; /* serial number for record */ int major; /* syscall number */ diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 175ef6f3ea4e..cc89e9f9a753 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -909,10 +909,80 @@ static inline void audit_free_aux(struct audit_context *context) context->aux = aux->next; kfree(aux); } + context->aux = NULL; while ((aux = context->aux_pids)) { context->aux_pids = aux->next; kfree(aux); } + context->aux_pids = NULL; +} + +/** + * audit_reset_context - reset a audit_context structure + * @ctx: the audit_context to reset + * + * All fields in the audit_context will be reset to an initial state, all + * references held by fields will be dropped, and private memory will be + * released. When this function returns the audit_context will be suitable + * for reuse, so long as the passed context is not NULL or a dummy context. + */ +static void audit_reset_context(struct audit_context *ctx) +{ + if (!ctx) + return; + + /* if ctx is non-null, reset the "ctx->state" regardless */ + ctx->context = AUDIT_CTX_UNUSED; + if (ctx->dummy) + return; + + /* + * NOTE: It shouldn't matter in what order we release the fields, so + * release them in the order in which they appear in the struct; + * this gives us some hope of quickly making sure we are + * resetting the audit_context properly. + * + * Other things worth mentioning: + * - we don't reset "dummy" + * - we don't reset "state", we do reset "current_state" + * - we preserver "filterkey" if "state" is AUDIT_RECORD_CONTEXT + * - much of this is likely overkill, but play it safe for now + * - we really need to work on improving the audit_context struct + */ + + ctx->current_state = ctx->state; + ctx->serial = 0; + ctx->major = 0; + ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 }; + memset(ctx->argv, 0, sizeof(ctx->argv)); + ctx->return_code = 0; + ctx->prio = (ctx->state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0); + ctx->return_valid = AUDITSC_INVALID; + audit_free_names(ctx); + if (ctx->state != AUDIT_RECORD_CONTEXT) { + kfree(ctx->filterkey); + ctx->filterkey = NULL; + } + audit_free_aux(ctx); + kfree(ctx->sockaddr); + ctx->sockaddr = NULL; + ctx->sockaddr_len = 0; + ctx->pid = ctx->ppid = 0; + ctx->uid = ctx->euid = ctx->suid = ctx->fsuid = KUIDT_INIT(0); + ctx->gid = ctx->egid = ctx->sgid = ctx->fsgid = KGIDT_INIT(0); + ctx->personality = 0; + ctx->arch = 0; + ctx->target_pid = 0; + ctx->target_auid = ctx->target_uid = KUIDT_INIT(0); + ctx->target_sessionid = 0; + ctx->target_sid = 0; + ctx->target_comm[0] = '\0'; + unroll_tree_refs(ctx, NULL, 0); + WARN_ON(!list_empty(&ctx->killed_trees)); + ctx->type = 0; + audit_free_module(ctx); + ctx->fds[0] = -1; + audit_proctitle_free(ctx); } static inline struct audit_context *audit_alloc_context(enum audit_state state) @@ -922,6 +992,7 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state) context = kzalloc(sizeof(*context), GFP_KERNEL); if (!context) return NULL; + context->context = AUDIT_CTX_UNUSED; context->state = state; context->prio = state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0; INIT_LIST_HEAD(&context->killed_trees); @@ -947,7 +1018,7 @@ int audit_alloc(struct task_struct *tsk) char *key = NULL; if (likely(!audit_ever_enabled)) - return 0; /* Return if not auditing. */ + return 0; state = audit_filter_task(tsk, &key); if (state == AUDIT_DISABLED) { @@ -969,14 +1040,10 @@ int audit_alloc(struct task_struct *tsk) static inline void audit_free_context(struct audit_context *context) { - audit_free_module(context); - audit_free_names(context); - unroll_tree_refs(context, NULL, 0); + /* resetting is extra work, but it is likely just noise */ + audit_reset_context(context); free_tree_refs(context); - audit_free_aux(context); kfree(context->filterkey); - kfree(context->sockaddr); - audit_proctitle_free(context); kfree(context); } @@ -1479,29 +1546,35 @@ static void audit_log_exit(void) context->personality = current->personality; - ab = audit_log_start(context, GFP_KERNEL, AUDIT_SYSCALL); - if (!ab) - return; /* audit_panic has been called */ - audit_log_format(ab, "arch=%x syscall=%d", - context->arch, context->major); - if (context->personality != PER_LINUX) - audit_log_format(ab, " per=%lx", context->personality); - if (context->return_valid != AUDITSC_INVALID) - audit_log_format(ab, " success=%s exit=%ld", - (context->return_valid==AUDITSC_SUCCESS)?"yes":"no", - context->return_code); - - audit_log_format(ab, - " a0=%lx a1=%lx a2=%lx a3=%lx items=%d", - context->argv[0], - context->argv[1], - context->argv[2], - context->argv[3], - context->name_count); - - audit_log_task_info(ab); - audit_log_key(ab, context->filterkey); - audit_log_end(ab); + switch (context->context) { + case AUDIT_CTX_SYSCALL: + ab = audit_log_start(context, GFP_KERNEL, AUDIT_SYSCALL); + if (!ab) + return; + audit_log_format(ab, "arch=%x syscall=%d", + context->arch, context->major); + if (context->personality != PER_LINUX) + audit_log_format(ab, " per=%lx", context->personality); + if (context->return_valid != AUDITSC_INVALID) + audit_log_format(ab, " success=%s exit=%ld", + (context->return_valid == AUDITSC_SUCCESS ? + "yes" : "no"), + context->return_code); + audit_log_format(ab, + " a0=%lx a1=%lx a2=%lx a3=%lx items=%d", + context->argv[0], + context->argv[1], + context->argv[2], + context->argv[3], + context->name_count); + audit_log_task_info(ab); + audit_log_key(ab, context->filterkey); + audit_log_end(ab); + break; + default: + BUG(); + break; + } for (aux = context->aux; aux; aux = aux->next) { @@ -1591,14 +1664,15 @@ static void audit_log_exit(void) audit_log_name(context, n, NULL, i++, &call_panic); } - audit_log_proctitle(); + if (context->context == AUDIT_CTX_SYSCALL) + audit_log_proctitle(); /* Send end of event record to help user space know we are finished */ ab = audit_log_start(context, GFP_KERNEL, AUDIT_EOE); if (ab) audit_log_end(ab); if (call_panic) - audit_panic("error converting sid to string"); + audit_panic("error in audit_log_exit()"); } /** @@ -1614,6 +1688,7 @@ void __audit_free(struct task_struct *tsk) if (!context) return; + /* this may generate CONFIG_CHANGE records */ if (!list_empty(&context->killed_trees)) audit_kill_trees(context); @@ -1622,7 +1697,8 @@ void __audit_free(struct task_struct *tsk) * random task_struct that doesn't doesn't have any meaningful data we * need to log via audit_log_exit(). */ - if (tsk == current && !context->dummy && context->in_syscall) { + if (tsk == current && !context->dummy && + context->context == AUDIT_CTX_SYSCALL) { context->return_valid = AUDITSC_INVALID; context->return_code = 0; @@ -1636,6 +1712,34 @@ void __audit_free(struct task_struct *tsk) audit_free_context(context); } +/** + * audit_return_fixup - fixup the return codes in the audit_context + * @ctx: the audit_context + * @success: true/false value to indicate if the operation succeeded or not + * @code: operation return code + * + * We need to fixup the return code in the audit logs if the actual return + * codes are later going to be fixed by the arch specific signal handlers. + */ +static void audit_return_fixup(struct audit_context *ctx, + int success, long code) +{ + /* + * This is actually a test for: + * (rc == ERESTARTSYS ) || (rc == ERESTARTNOINTR) || + * (rc == ERESTARTNOHAND) || (rc == ERESTART_RESTARTBLOCK) + * + * but is faster than a bunch of || + */ + if (unlikely(code <= -ERESTARTSYS) && + (code >= -ERESTART_RESTARTBLOCK) && + (code != -ENOIOCTLCMD)) + ctx->return_code = -EINTR; + else + ctx->return_code = code; + ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE); +} + /** * __audit_syscall_entry - fill in an audit record at syscall entry * @major: major syscall type (function) @@ -1661,7 +1765,12 @@ void __audit_syscall_entry(int major, unsigned long a1, unsigned long a2, if (!audit_enabled || !context) return; - BUG_ON(context->in_syscall || context->name_count); + WARN_ON(context->context != AUDIT_CTX_UNUSED); + WARN_ON(context->name_count); + if (context->context != AUDIT_CTX_UNUSED || context->name_count) { + audit_panic("unrecoverable error in audit_syscall_entry()"); + return; + } state = context->state; if (state == AUDIT_DISABLED) @@ -1680,10 +1789,8 @@ void __audit_syscall_entry(int major, unsigned long a1, unsigned long a2, context->argv[1] = a2; context->argv[2] = a3; context->argv[3] = a4; - context->serial = 0; - context->in_syscall = 1; + context->context = AUDIT_CTX_SYSCALL; context->current_state = state; - context->ppid = 0; ktime_get_coarse_real_ts64(&context->ctime); } @@ -1700,63 +1807,27 @@ void __audit_syscall_entry(int major, unsigned long a1, unsigned long a2, */ void __audit_syscall_exit(int success, long return_code) { - struct audit_context *context; + struct audit_context *context = audit_context(); - context = audit_context(); - if (!context) - return; + if (!context || context->dummy || + context->context != AUDIT_CTX_SYSCALL) + goto out; + /* this may generate CONFIG_CHANGE records */ if (!list_empty(&context->killed_trees)) audit_kill_trees(context); - if (!context->dummy && context->in_syscall) { - if (success) - context->return_valid = AUDITSC_SUCCESS; - else - context->return_valid = AUDITSC_FAILURE; - - /* - * we need to fix up the return code in the audit logs if the - * actual return codes are later going to be fixed up by the - * arch specific signal handlers - * - * This is actually a test for: - * (rc == ERESTARTSYS ) || (rc == ERESTARTNOINTR) || - * (rc == ERESTARTNOHAND) || (rc == ERESTART_RESTARTBLOCK) - * - * but is faster than a bunch of || - */ - if (unlikely(return_code <= -ERESTARTSYS) && - (return_code >= -ERESTART_RESTARTBLOCK) && - (return_code != -ENOIOCTLCMD)) - context->return_code = -EINTR; - else - context->return_code = return_code; - - audit_filter_syscall(current, context); - audit_filter_inodes(current, context); - if (context->current_state == AUDIT_RECORD_CONTEXT) - audit_log_exit(); - } + /* run through both filters to ensure we set the filterkey properly */ + audit_filter_syscall(current, context); + audit_filter_inodes(current, context); + if (context->current_state < AUDIT_RECORD_CONTEXT) + goto out; - context->in_syscall = 0; - context->prio = context->state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0; + audit_return_fixup(context, success, return_code); + audit_log_exit(); - audit_free_module(context); - audit_free_names(context); - unroll_tree_refs(context, NULL, 0); - audit_free_aux(context); - context->aux = NULL; - context->aux_pids = NULL; - context->target_pid = 0; - context->target_sid = 0; - context->sockaddr_len = 0; - context->type = 0; - context->fds[0] = -1; - if (context->state != AUDIT_RECORD_CONTEXT) { - kfree(context->filterkey); - context->filterkey = NULL; - } +out: + audit_reset_context(context); } static inline void handle_one(const struct inode *inode) @@ -1905,7 +1976,7 @@ void __audit_getname(struct filename *name) struct audit_context *context = audit_context(); struct audit_names *n; - if (!context->in_syscall) + if (context->context == AUDIT_CTX_UNUSED) return; n = audit_alloc_name(context, AUDIT_TYPE_UNKNOWN); @@ -1977,7 +2048,7 @@ void __audit_inode(struct filename *name, const struct dentry *dentry, struct list_head *list = &audit_filter_list[AUDIT_FILTER_FS]; int i; - if (!context->in_syscall) + if (context->context == AUDIT_CTX_UNUSED) return; rcu_read_lock(); @@ -2095,7 +2166,7 @@ void __audit_inode_child(struct inode *parent, struct list_head *list = &audit_filter_list[AUDIT_FILTER_FS]; int i; - if (!context->in_syscall) + if (context->context == AUDIT_CTX_UNUSED) return; rcu_read_lock(); @@ -2194,7 +2265,7 @@ EXPORT_SYMBOL_GPL(__audit_inode_child); int auditsc_get_stamp(struct audit_context *ctx, struct timespec64 *t, unsigned int *serial) { - if (!ctx->in_syscall) + if (ctx->context == AUDIT_CTX_UNUSED) return 0; if (!ctx->serial) ctx->serial = audit_serial(); @@ -2686,7 +2757,7 @@ void audit_seccomp_actions_logged(const char *names, const char *old_names, struct list_head *audit_killed_trees(void) { struct audit_context *ctx = audit_context(); - if (likely(!ctx || !ctx->in_syscall)) + if (likely(!ctx || ctx->context == AUDIT_CTX_UNUSED)) return NULL; return &ctx->killed_trees; }
WARNING - This is a work in progress and should not be merged anywhere important. It is almost surely not complete, and while it probably compiles it likely hasn't been booted and will do terrible things. You have been warned. This patch adds basic auditing to io_uring operations, regardless of their context. This is accomplished by allocating audit_context structures for the io-wq worker and io_uring SQPOLL kernel threads as well as explicitly auditing the io_uring operations in io_issue_sqe(). The io_uring operations are audited using a new AUDIT_URINGOP record, an example is shown below: % <TODO - insert AUDIT_URINGOP record example> Thanks to Richard Guy Briggs for review and feedback. Signed-off-by: Paul Moore <paul@paul-moore.com> --- fs/io-wq.c | 4 + fs/io_uring.c | 11 +++ include/linux/audit.h | 17 ++++ include/uapi/linux/audit.h | 1 kernel/audit.h | 2 + kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 208 insertions(+) diff --git a/fs/io-wq.c b/fs/io-wq.c index 5361a9b4b47b..8af09a3336e0 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -16,6 +16,7 @@ #include <linux/rculist_nulls.h> #include <linux/cpu.h> #include <linux/tracehook.h> +#include <linux/audit.h> #include "io-wq.h" @@ -535,6 +536,8 @@ static int io_wqe_worker(void *data) snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid); set_task_comm(current, buf); + audit_alloc_kernel(current); + while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { long ret; @@ -573,6 +576,7 @@ static int io_wqe_worker(void *data) raw_spin_unlock_irq(&wqe->lock); } + audit_free(current); io_worker_exit(worker); return 0; } diff --git a/fs/io_uring.c b/fs/io_uring.c index e481ac8a757a..e9941d1ad8fd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -78,6 +78,7 @@ #include <linux/task_work.h> #include <linux/pagemap.h> #include <linux/io_uring.h> +#include <linux/audit.h> #define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) if (req->work.creds && req->work.creds != current_cred()) creds = override_creds(req->work.creds); + if (req->opcode < IORING_OP_LAST) + audit_uring_entry(req->opcode); + switch (req->opcode) { case IORING_OP_NOP: ret = io_nop(req, issue_flags); @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) break; } + if (req->opcode < IORING_OP_LAST) + audit_uring_exit(!ret, ret); + if (creds) revert_creds(creds); @@ -6827,6 +6834,8 @@ static int io_sq_thread(void *data) set_cpus_allowed_ptr(current, cpu_online_mask); current->flags |= PF_NO_SETAFFINITY; + audit_alloc_kernel(current); + mutex_lock(&sqd->lock); /* a user may had exited before the thread started */ io_run_task_work_head(&sqd->park_task_work); @@ -6916,6 +6925,8 @@ static int io_sq_thread(void *data) io_run_task_work_head(&sqd->park_task_work); mutex_unlock(&sqd->lock); + audit_free(current); + complete(&sqd->exited); do_exit(0); } diff --git a/include/linux/audit.h b/include/linux/audit.h index 82b7c1116a85..6a0c013bc7de 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -286,7 +286,10 @@ static inline int audit_signal_info(int sig, struct task_struct *t) /* These are defined in auditsc.c */ /* Public API */ extern int audit_alloc(struct task_struct *task); +extern int audit_alloc_kernel(struct task_struct *task); extern void __audit_free(struct task_struct *task); +extern void __audit_uring_entry(u8 op); +extern void __audit_uring_exit(int success, long code); extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1, unsigned long a2, unsigned long a3); extern void __audit_syscall_exit(int ret_success, long ret_value); @@ -323,6 +326,16 @@ static inline void audit_free(struct task_struct *task) if (unlikely(task->audit_context)) __audit_free(task); } +static inline void audit_uring_entry(u8 op) +{ + if (unlikely(audit_context())) + __audit_uring_entry(op); +} +static inline void audit_uring_exit(int success, long code) +{ + if (unlikely(audit_context())) + __audit_uring_exit(success, code); +} static inline void audit_syscall_entry(int major, unsigned long a0, unsigned long a1, unsigned long a2, unsigned long a3) @@ -554,6 +567,10 @@ static inline int audit_alloc(struct task_struct *task) { return 0; } +static inline int audit_alloc_kernel(struct task_struct *task) +{ + return 0; +} static inline void audit_free(struct task_struct *task) { } static inline void audit_syscall_entry(int major, unsigned long a0, diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h index cd2d8279a5e4..b26e0c435e8b 100644 --- a/include/uapi/linux/audit.h +++ b/include/uapi/linux/audit.h @@ -118,6 +118,7 @@ #define AUDIT_TIME_ADJNTPVAL 1333 /* NTP value adjustment */ #define AUDIT_BPF 1334 /* BPF subsystem */ #define AUDIT_EVENT_LISTENER 1335 /* Task joined multicast read socket */ +#define AUDIT_URINGOP 1336 /* io_uring operation */ #define AUDIT_AVC 1400 /* SE Linux avc denial or grant */ #define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */ diff --git a/kernel/audit.h b/kernel/audit.h index fba180de5912..50de827497ca 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -100,10 +100,12 @@ struct audit_context { enum { AUDIT_CTX_UNUSED, /* audit_context is currently unused */ AUDIT_CTX_SYSCALL, /* in use by syscall */ + AUDIT_CTX_URING, /* in use by io_uring */ } context; enum audit_state state, current_state; unsigned int serial; /* serial number for record */ int major; /* syscall number */ + int uring_op; /* uring operation */ struct timespec64 ctime; /* time of syscall entry */ unsigned long argv[4]; /* syscall arguments */ long return_code;/* syscall return code */ diff --git a/kernel/auditsc.c b/kernel/auditsc.c index cc89e9f9a753..729849d41631 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -953,6 +953,7 @@ static void audit_reset_context(struct audit_context *ctx) ctx->current_state = ctx->state; ctx->serial = 0; ctx->major = 0; + ctx->uring_op = 0; ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 }; memset(ctx->argv, 0, sizeof(ctx->argv)); ctx->return_code = 0; @@ -1038,6 +1039,31 @@ int audit_alloc(struct task_struct *tsk) return 0; } +/** + * audit_alloc_kernel - allocate an audit_context for a kernel task + * @tsk: the kernel task + * + * Similar to the audit_alloc() function, but intended for kernel private + * threads. Returns zero on success, negative values on failure. + */ +int audit_alloc_kernel(struct task_struct *tsk) +{ + /* + * At the moment we are just going to call into audit_alloc() to + * simplify the code, but there two things to keep in mind with this + * approach: + * + * 1. Filtering internal kernel tasks is a bit laughable in almost all + * cases, but there is at least one case where there is a benefit: + * the '-a task,never' case allows the admin to effectively disable + * task auditing at runtime. + * + * 2. The {set,clear}_task_syscall_work() ops likely have zero effect + * on these internal kernel tasks, but they probably don't hurt either. + */ + return audit_alloc(tsk); +} + static inline void audit_free_context(struct audit_context *context) { /* resetting is extra work, but it is likely just noise */ @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void) audit_log_end(ab); } +/** + * audit_log_uring - generate a AUDIT_URINGOP record + * @ctx: the audit context + */ +static void audit_log_uring(struct audit_context *ctx) +{ + struct audit_buffer *ab; + const struct cred *cred; + + /* + * TODO: What do we log here? I'm tossing in a few things to start the + * conversation, but additional thought needs to go into this. + */ + + ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP); + if (!ab) + return; + cred = current_cred(); + audit_log_format(ab, "uring_op=%d", ctx->uring_op); + if (ctx->return_valid != AUDITSC_INVALID) + audit_log_format(ab, " success=%s exit=%ld", + (ctx->return_valid == AUDITSC_SUCCESS ? + "yes" : "no"), + ctx->return_code); + audit_log_format(ab, + " items=%d" + " ppid=%d pid=%d auid=%u uid=%u gid=%u" + " euid=%u suid=%u fsuid=%u" + " egid=%u sgid=%u fsgid=%u", + ctx->name_count, + task_ppid_nr(current), + task_tgid_nr(current), + from_kuid(&init_user_ns, audit_get_loginuid(current)), + from_kuid(&init_user_ns, cred->uid), + from_kgid(&init_user_ns, cred->gid), + from_kuid(&init_user_ns, cred->euid), + from_kuid(&init_user_ns, cred->suid), + from_kuid(&init_user_ns, cred->fsuid), + from_kgid(&init_user_ns, cred->egid), + from_kgid(&init_user_ns, cred->sgid), + from_kgid(&init_user_ns, cred->fsgid)); + audit_log_task_context(ab); + audit_log_key(ab, ctx->filterkey); + audit_log_end(ab); +} + static void audit_log_exit(void) { int i, call_panic = 0; @@ -1571,6 +1643,9 @@ static void audit_log_exit(void) audit_log_key(ab, context->filterkey); audit_log_end(ab); break; + case AUDIT_CTX_URING: + audit_log_uring(context); + break; default: BUG(); break; @@ -1740,6 +1815,104 @@ static void audit_return_fixup(struct audit_context *ctx, ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE); } +/** + * __audit_uring_entry - prepare the kernel task's audit context for io_uring + * @op: the io_uring opcode + * + * This is similar to audit_syscall_entry() but is intended for use by io_uring + * operations. + */ +void __audit_uring_entry(u8 op) +{ + struct audit_context *ctx = audit_context(); + + if (!audit_enabled || !ctx || ctx->state == AUDIT_DISABLED) + return; + + /* + * NOTE: It's possible that we can be called from the process' context + * before it returns to userspace, and before audit_syscall_exit() + * is called. In this case there is not much to do, just record + * the io_uring details and return. + */ + ctx->uring_op = op; + if (ctx->context == AUDIT_CTX_SYSCALL) + return; + + ctx->dummy = !audit_n_rules; + if (!ctx->dummy && ctx->state == AUDIT_BUILD_CONTEXT) + ctx->prio = 0; + + ctx->arch = syscall_get_arch(current); + ctx->context = AUDIT_CTX_URING; + ctx->current_state = ctx->state; + ktime_get_coarse_real_ts64(&ctx->ctime); +} + +/** + * __audit_uring_exit - wrap up the kernel task's audit context after io_uring + * @success: true/false value to indicate if the operation succeeded or not + * @code: operation return code + * + * This is similar to audit_syscall_exit() but is intended for use by io_uring + * operations. + */ +void __audit_uring_exit(int success, long code) +{ + struct audit_context *ctx = audit_context(); + + /* + * TODO: At some point we will likely want to filter on io_uring ops + * and other things similar to what we do for syscalls, but that + * is something for another day; just record what we can here. + */ + + if (!ctx || ctx->dummy) + goto out; + if (ctx->context == AUDIT_CTX_SYSCALL) { + /* + * NOTE: See the note in __audit_uring_entry() about the case + * where we may be called from process context before we + * return to userspace via audit_syscall_exit(). In this + * case we simply emit a URINGOP record and bail, the + * normal syscall exit handling will take care of + * everything else. + * It is also worth mentioning that when we are called, + * the current process creds may differ from the creds + * used during the normal syscall processing; keep that + * in mind if/when we move the record generation code. + */ + + /* + * We need to filter on the syscall info here to decide if we + * should emit a URINGOP record. I know it seems odd but this + * solves the problem where users have a filter to block *all* + * syscall records in the "exit" filter; we want to preserve + * the behavior here. + */ + audit_filter_syscall(current, ctx); + audit_filter_inodes(current, ctx); + if (ctx->current_state != AUDIT_RECORD_CONTEXT) + return; + + audit_log_uring(ctx); + return; + } + + /* this may generate CONFIG_CHANGE records */ + if (!list_empty(&ctx->killed_trees)) + audit_kill_trees(ctx); + + audit_filter_inodes(current, ctx); + if (ctx->current_state != AUDIT_RECORD_CONTEXT) + goto out; + audit_return_fixup(ctx, success, code); + audit_log_exit(); + +out: + audit_reset_context(ctx); +} + /** * __audit_syscall_entry - fill in an audit record at syscall entry * @major: major syscall type (function)
WARNING - This patch is intended only to aid in the initial dev/test of the audit/io_uring support, it is not intended to be merged. With this patch, you can emit io_uring operation audit records with the following commands (the first clears any blocking rules): % auditctl -D % auditctl -a exit,always -S io_uring_enter Signed-off-by: DO NOT COMMIT --- kernel/auditsc.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 729849d41631..d8aa2c690bf9 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1898,6 +1898,10 @@ void __audit_uring_exit(int success, long code) audit_log_uring(ctx); return; } +#if 1 + /* XXX - temporary hack to force record generation */ + ctx->current_state = AUDIT_RECORD_CONTEXT; +#endif /* this may generate CONFIG_CHANGE records */ if (!list_empty(&ctx->killed_trees))
WARNING - This is a work in progress and should not be merged anywhere important. It is almost surely not complete, and while it probably compiles it likely hasn't been booted and will do terrible things. You have been warned. This patch adds basic audit io_uring filtering, using as much of the existing audit filtering infrastructure as possible. In order to do this we reuse the audit filter rule's syscall mask for the io_uring operation and we create a new filter for io_uring operations as AUDIT_FILTER_URING_EXIT/audit_filter_list[7]. <TODO - provide some additional guidance for the userspace tools> Signed-off-by: Paul Moore <paul@paul-moore.com> --- include/uapi/linux/audit.h | 3 +- kernel/auditfilter.c | 4 ++- kernel/auditsc.c | 65 ++++++++++++++++++++++++++++++++++---------- 3 files changed, 55 insertions(+), 17 deletions(-) diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h index b26e0c435e8b..621eb3c1076e 100644 --- a/include/uapi/linux/audit.h +++ b/include/uapi/linux/audit.h @@ -167,8 +167,9 @@ #define AUDIT_FILTER_EXCLUDE 0x05 /* Apply rule before record creation */ #define AUDIT_FILTER_TYPE AUDIT_FILTER_EXCLUDE /* obsolete misleading naming */ #define AUDIT_FILTER_FS 0x06 /* Apply rule at __audit_inode_child */ +#define AUDIT_FILTER_URING_EXIT 0x07 /* Apply rule at io_uring op exit */ -#define AUDIT_NR_FILTERS 7 +#define AUDIT_NR_FILTERS 8 #define AUDIT_FILTER_PREPEND 0x10 /* Prepend to front of list */ diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index db2c6b59dfc3..c21119c00504 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -44,7 +44,8 @@ struct list_head audit_filter_list[AUDIT_NR_FILTERS] = { LIST_HEAD_INIT(audit_filter_list[4]), LIST_HEAD_INIT(audit_filter_list[5]), LIST_HEAD_INIT(audit_filter_list[6]), -#if AUDIT_NR_FILTERS != 7 + LIST_HEAD_INIT(audit_filter_list[7]), +#if AUDIT_NR_FILTERS != 8 #error Fix audit_filter_list initialiser #endif }; @@ -56,6 +57,7 @@ static struct list_head audit_rules_list[AUDIT_NR_FILTERS] = { LIST_HEAD_INIT(audit_rules_list[4]), LIST_HEAD_INIT(audit_rules_list[5]), LIST_HEAD_INIT(audit_rules_list[6]), + LIST_HEAD_INIT(audit_rules_list[7]), }; DEFINE_MUTEX(audit_filter_mutex); diff --git a/kernel/auditsc.c b/kernel/auditsc.c index d8aa2c690bf9..4f6ab34020fb 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) return rule->mask[word] & bit; } +/** + * audit_filter_uring - apply filters to an io_uring operation + * @tsk: associated task + * @ctx: audit context + */ +static void audit_filter_uring(struct task_struct *tsk, + struct audit_context *ctx) +{ + struct audit_entry *e; + enum audit_state state; + + if (auditd_test_task(tsk)) + return; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT], + list) { + if (audit_in_mask(&e->rule, ctx->uring_op) && + audit_filter_rules(tsk, &e->rule, ctx, NULL, &state, + false)) { + rcu_read_unlock(); + ctx->current_state = state; + return; + } + } + rcu_read_unlock(); + return; +} + /* At syscall exit time, this filter is called if the audit_state is * not low enough that auditing cannot take place, but is also not * high enough that we already know we have to write an audit record @@ -1754,7 +1783,7 @@ static void audit_log_exit(void) * __audit_free - free a per-task audit context * @tsk: task whose audit context block to free * - * Called from copy_process and do_exit + * Called from copy_process, do_exit, and the io_uring code */ void __audit_free(struct task_struct *tsk) { @@ -1772,15 +1801,21 @@ void __audit_free(struct task_struct *tsk) * random task_struct that doesn't doesn't have any meaningful data we * need to log via audit_log_exit(). */ - if (tsk == current && !context->dummy && - context->context == AUDIT_CTX_SYSCALL) { + if (tsk == current && !context->dummy) { context->return_valid = AUDITSC_INVALID; context->return_code = 0; - - audit_filter_syscall(tsk, context); - audit_filter_inodes(tsk, context); - if (context->current_state == AUDIT_RECORD_CONTEXT) - audit_log_exit(); + if (context->context == AUDIT_CTX_SYSCALL) { + audit_filter_syscall(tsk, context); + audit_filter_inodes(tsk, context); + if (context->current_state == AUDIT_RECORD_CONTEXT) + audit_log_exit(); + } else if (context->context == AUDIT_CTX_URING) { + /* TODO: verify this case is real and valid */ + audit_filter_uring(tsk, context); + audit_filter_inodes(tsk, context); + if (context->current_state == AUDIT_RECORD_CONTEXT) + audit_log_uring(context); + } } audit_set_context(tsk, NULL); @@ -1861,12 +1896,6 @@ void __audit_uring_exit(int success, long code) { struct audit_context *ctx = audit_context(); - /* - * TODO: At some point we will likely want to filter on io_uring ops - * and other things similar to what we do for syscalls, but that - * is something for another day; just record what we can here. - */ - if (!ctx || ctx->dummy) goto out; if (ctx->context == AUDIT_CTX_SYSCALL) { @@ -1891,6 +1920,8 @@ void __audit_uring_exit(int success, long code) * the behavior here. */ audit_filter_syscall(current, ctx); + if (ctx->current_state != AUDIT_RECORD_CONTEXT) + audit_filter_uring(current, ctx); audit_filter_inodes(current, ctx); if (ctx->current_state != AUDIT_RECORD_CONTEXT) return; @@ -1899,7 +1930,9 @@ void __audit_uring_exit(int success, long code) return; } #if 1 - /* XXX - temporary hack to force record generation */ + /* XXX - temporary hack to force record generation, we are leaving this + * enabled, but if you want to actually test the filtering you + * need to disable this #if/#endif block */ ctx->current_state = AUDIT_RECORD_CONTEXT; #endif @@ -1907,6 +1940,8 @@ void __audit_uring_exit(int success, long code) if (!list_empty(&ctx->killed_trees)) audit_kill_trees(ctx); + /* run through both filters to ensure we set the filterkey properly */ + audit_filter_uring(current, ctx); audit_filter_inodes(current, ctx); if (ctx->current_state != AUDIT_RECORD_CONTEXT) goto out;
Extending the secure anonymous inode support to other subsystems requires that we have a secure anon_inode_getfile() variant in addition to the existing secure anon_inode_getfd() variant. Thankfully we can reuse the existing __anon_inode_getfile() function and just wrap it with the proper arguments. Signed-off-by: Paul Moore <paul@paul-moore.com> --- fs/anon_inodes.c | 29 +++++++++++++++++++++++++++++ include/linux/anon_inodes.h | 4 ++++ 2 files changed, 33 insertions(+) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index a280156138ed..e0c3e33c4177 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -148,6 +148,35 @@ struct file *anon_inode_getfile(const char *name, } EXPORT_SYMBOL_GPL(anon_inode_getfile); +/** + * anon_inode_getfile_secure - Like anon_inode_getfile(), but creates a new + * !S_PRIVATE anon inode rather than reuse the + * singleton anon inode and calls the + * inode_init_security_anon() LSM hook. This + * allows for both the inode to have its own + * security context and for the LSM to enforce + * policy on the inode's creation. + * + * @name: [in] name of the "class" of the new file + * @fops: [in] file operations for the new file + * @priv: [in] private data for the new file (will be file's private_data) + * @flags: [in] flags + * @context_inode: + * [in] the logical relationship with the new inode (optional) + * + * The LSM may use @context_inode in inode_init_security_anon(), but a + * reference to it is not held. Returns the newly created file* or an error + * pointer. See the anon_inode_getfile() documentation for more information. + */ +struct file *anon_inode_getfile_secure(const char *name, + const struct file_operations *fops, + void *priv, int flags, + const struct inode *context_inode) +{ + return __anon_inode_getfile(name, fops, priv, flags, + context_inode, true); +} + static int __anon_inode_getfd(const char *name, const struct file_operations *fops, void *priv, int flags, diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h index 71881a2b6f78..5deaddbd7927 100644 --- a/include/linux/anon_inodes.h +++ b/include/linux/anon_inodes.h @@ -15,6 +15,10 @@ struct inode; struct file *anon_inode_getfile(const char *name, const struct file_operations *fops, void *priv, int flags); +struct file *anon_inode_getfile_secure(const char *name, + const struct file_operations *fops, + void *priv, int flags, + const struct inode *context_inode); int anon_inode_getfd(const char *name, const struct file_operations *fops, void *priv, int flags); int anon_inode_getfd_secure(const char *name,
Converting io_uring's anonymous inode to the secure anon inode API enables LSMs to enforce policy on the io_uring anonymous inodes if they chose to do so. This is an important first step towards providing the necessary mechanisms so that LSMs can apply security policy to io_uring operations. Signed-off-by: Paul Moore <paul@paul-moore.com> --- fs/io_uring.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index e9941d1ad8fd..6ff769c9b7d3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -9562,8 +9562,8 @@ static struct file *io_uring_get_file(struct io_ring_ctx *ctx) return ERR_PTR(ret); #endif - file = anon_inode_getfile("[io_uring]", &io_uring_fops, ctx, - O_RDWR | O_CLOEXEC); + file = anon_inode_getfile_secure("[io_uring]", &io_uring_fops, ctx, + O_RDWR | O_CLOEXEC, NULL); #if defined(CONFIG_UNIX) if (IS_ERR(file)) { sock_release(ctx->ring_sock);
WARNING - This is a work in progress, this patch, including the description, may be incomplete or even incorrect. You have been warned. A full expalantion of io_uring is beyond the scope of this commit description, but in summary it is an asynchronous I/O mechanism which allows for I/O requests and the resulting data to be queued in memory mapped "rings" which are shared between the kernel and userspace. Optionally, io_uring offers the ability for applications to spawn kernel threads to dequeue I/O requests from the ring and submit the requests in the kernel, helping to minimize the syscall overhead. Rings are accessed in userspace by memory mapping a file descriptor provided by the io_uring_setup(2), and can be shared between applications as one might do with any open file descriptor. Finally, process credentials can be registered with a given ring and any process with access to that ring can submit I/O requests using any of the registered credentials. While the io_uring functionality is widely recognized as offering a vastly improved, and high performing asynchronous I/O mechanism, its ability to allow processes to submit I/O requests with credentials other than its own presents a challenge to LSMs. When a process creates a new io_uring ring the ring's credentials are inhertied from the calling process; if this ring is shared with another process operating with different credentials there is the potential to bypass the LSMs security policy. Similarly, registering credentials with a given ring allows any process with access to that ring to submit I/O requests with those credentials. In an effort to allow LSMs to apply security policy to io_uring I/O operations, this patch adds two new LSM hooks. These hooks, in conjunction with the LSM anonymous inode support previously submitted, allow an LSM to apply access control policy to the sharing of io_uring rings as well as any io_uring credential changes requested by a process. The new LSM hooks are described below: * int security_uring_override_creds(cred) Controls if the current task, executing an io_uring operation, is allowed to override it's credentials with @cred. In cases where the current task is a user application, the current credentials will be those of the user application. In cases where the current task is a kernel thread servicing io_uring requests the current credentials will be those of the io_uring ring (inherited from the process that created the ring). * int security_uring_sqpoll(void) Controls if the current task is allowed to create an io_uring polling thread (IORING_SETUP_SQPOLL). Without a SQPOLL thread in the kernel processes must submit I/O requests via io_uring_enter(2) which allows us to compare any requested credential changes against the application making the request. With a SQPOLL thread, we can no longer compare requested credential changes against the application making the request, the comparison is made against the ring's credentials. Signed-off-by: Paul Moore <paul@paul-moore.com> --- fs/io_uring.c | 10 ++++++++++ include/linux/lsm_hook_defs.h | 5 +++++ include/linux/lsm_hooks.h | 13 +++++++++++++ include/linux/security.h | 16 ++++++++++++++++ security/security.c | 12 ++++++++++++ 5 files changed, 56 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index 6ff769c9b7d3..d18a594c4c6e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -79,6 +79,7 @@ #include <linux/pagemap.h> #include <linux/io_uring.h> #include <linux/audit.h> +#include <linux/security.h> #define CREATE_TRACE_POINTS #include <trace/events/io_uring.h> @@ -6537,6 +6538,11 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, if (!req->work.creds) return -EINVAL; get_cred(req->work.creds); + ret = security_uring_override_creds(req->work.creds); + if (ret) { + put_cred(req->work.creds); + return ret; + } } state = &ctx->submit_state; @@ -7963,6 +7969,10 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx, struct io_sq_data *sqd; bool attached; + ret = security_uring_sqpoll(); + if (ret) + return ret; + sqd = io_get_sq_data(p, &attached); if (IS_ERR(sqd)) { ret = PTR_ERR(sqd); diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h index 04c01794de83..88971b3da3c0 100644 --- a/include/linux/lsm_hook_defs.h +++ b/include/linux/lsm_hook_defs.h @@ -403,3 +403,8 @@ LSM_HOOK(void, LSM_RET_VOID, perf_event_free, struct perf_event *event) LSM_HOOK(int, 0, perf_event_read, struct perf_event *event) LSM_HOOK(int, 0, perf_event_write, struct perf_event *event) #endif /* CONFIG_PERF_EVENTS */ + +#ifdef CONFIG_IO_URING +LSM_HOOK(int, 0, uring_override_creds, const struct cred *new) +LSM_HOOK(int, 0, uring_sqpoll, void) +#endif /* CONFIG_IO_URING */ diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 5c4c5c0602cb..0eb0ae95c4c4 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -1557,6 +1557,19 @@ * Read perf_event security info if allowed. * @perf_event_write: * Write perf_event security info if allowed. + * + * Security hooks for io_uring + * + * @uring_override_creds: + * Check if the current task, executing an io_uring operation, is allowed + * to override it's credentials with @new. + * + * @new: the new creds to use + * + * @uring_sqpoll: + * Check whether the current task is allowed to spawn a io_uring polling + * thread (IORING_SETUP_SQPOLL). + * */ union security_list_options { #define LSM_HOOK(RET, DEFAULT, NAME, ...) RET (*NAME)(__VA_ARGS__); diff --git a/include/linux/security.h b/include/linux/security.h index 06f7c50ce77f..263a744c839f 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -2037,4 +2037,20 @@ static inline int security_perf_event_write(struct perf_event *event) #endif /* CONFIG_SECURITY */ #endif /* CONFIG_PERF_EVENTS */ +#ifdef CONFIG_IO_URING +#ifdef CONFIG_SECURITY +extern int security_uring_override_creds(const struct cred *new); +extern int security_uring_sqpoll(void); +#else +static inline int security_uring_override_creds(const struct cred *new) +{ + return 0; +} +static inline int security_uring_sqpoll(void) +{ + return 0; +} +#endif /* CONFIG_SECURITY */ +#endif /* CONFIG_IO_URING */ + #endif /* ! __LINUX_SECURITY_H */ diff --git a/security/security.c b/security/security.c index b38155b2de83..3d6b3a2cacf5 100644 --- a/security/security.c +++ b/security/security.c @@ -2624,3 +2624,15 @@ int security_perf_event_write(struct perf_event *event) return call_int_hook(perf_event_write, 0, event); } #endif /* CONFIG_PERF_EVENTS */ + +#ifdef CONFIG_IO_URING +int security_uring_override_creds(const struct cred *new) +{ + return call_int_hook(uring_override_creds, 0, new); +} + +int security_uring_sqpoll(void) +{ + return call_int_hook(uring_sqpoll, 0); +} +#endif /* CONFIG_IO_URING */
WARNING - This is a work in progress, this patch, including the description, may be incomplete or even incorrect. You have been warned. This patch implements two new io_uring access controls, specifically support for controlling the io_uring "personalities" and IORING_SETUP_SQPOLL. Controlling the sharing of io_urings themselves is handled via the normal file/inode labeling and sharing mechanisms. The io_uring { override_creds } permission restricts which domains the subject domain can use to override it's own credentials. Granting a domain the io_uring { override_creds } permission allows it to impersonate another domain in io_uring operations. The io_uring { sqpoll } permission restricts which domains can create asynchronous io_uring polling threads. This is important from a security perspective as operations queued by this asynchronous thread inherit the credentials of the thread creator by default; if an io_uring is shared across process/domain boundaries this could result in one domain impersonating another. Controlling the creation of sqpoll threads, and the sharing of io_urings across processes, allow policy authors to restrict the ability of one domain to impersonate another via io_uring. As a quick summary, this patch adds a new object class with two permissions: io_uring { override_creds sqpoll } These permissions can be seen in the two simple policy statements below: allow domA_t domB_t : io_uring { override_creds }; allow domA_t self : io_uring { sqpoll }; Signed-off-by: Paul Moore <paul@paul-moore.com> --- security/selinux/hooks.c | 67 +++++++++++++++++++++++++++++++++++ security/selinux/include/classmap.h | 2 + 2 files changed, 69 insertions(+) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index eaea837d89d1..696130972e4d 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -7115,6 +7115,68 @@ static int selinux_perf_event_write(struct perf_event *event) } #endif +#ifdef CONFIG_IO_URING + +#if 1 +/* XXX - dump_creds() is for debugging only, remove before committing */ +static void dump_creds(const struct cred *cred) +{ + u32 secid; + char *ctx = NULL; + unsigned int ctx_len = 0; + + security_cred_getsecid(cred, &secid); + if (secid) + security_secid_to_secctx(secid, &ctx, &ctx_len); + + printk(KERN_ERR "CRED: ->security is %s\n", (ctx ? : "(error)")); + + if (ctx) + security_release_secctx(ctx, ctx_len); +} +#endif + +/** + * selinux_uring_override_creds - check the requested cred override + * @new: the target creds + * + * Check to see if the current task is allowed to override it's credentials + * to service an io_uring operation. + */ +int selinux_uring_override_creds(const struct cred *new) +{ +#if 1 + /* XXX - debug only code, remove before committing */ + pr_err("PMD: selinux_uring_override_creds()\n"); + pr_err("PMD: >>> CURRENT\n"); + dump_creds(current_cred()); + pr_err("PMD: >>> NEW\n"); + dump_creds(new); +#endif + return avc_has_perm(&selinux_state, current_sid(), cred_sid(new), + SECCLASS_IO_URING, IO_URING__OVERRIDE_CREDS, NULL); +} + +/** + * selinux_uring_sqpoll - check if a io_uring polling thread can be created + * + * Check to see if the current task is allowed to create a new io_uring + * kernel polling thread. + */ +int selinux_uring_sqpoll(void) +{ + int sid = current_sid(); +#if 1 + /* XXX - debug only code, remove before committing */ + pr_err("PMD: selinux_uring_sqpoll()\n"); + pr_err("PMD: >>> CURRENT\n"); + dump_creds(current_cred()); +#endif + return avc_has_perm(&selinux_state, sid, sid, + SECCLASS_IO_URING, IO_URING__SQPOLL, NULL); +} +#endif /* CONFIG_IO_URING */ + /* * IMPORTANT NOTE: When adding new hooks, please be careful to keep this order: * 1. any hooks that don't belong to (2.) or (3.) below, @@ -7353,6 +7415,11 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = { LSM_HOOK_INIT(perf_event_write, selinux_perf_event_write), #endif +#ifdef CONFIG_IO_URING + LSM_HOOK_INIT(uring_override_creds, selinux_uring_override_creds), + LSM_HOOK_INIT(uring_sqpoll, selinux_uring_sqpoll), +#endif + LSM_HOOK_INIT(locked_down, selinux_lockdown), /* diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h index 62d19bccf3de..3314ad72279d 100644 --- a/security/selinux/include/classmap.h +++ b/security/selinux/include/classmap.h @@ -252,6 +252,8 @@ struct security_class_mapping secclass_map[] = { { "integrity", "confidentiality", NULL } }, { "anon_inode", { COMMON_FILE_PERMS, NULL } }, + { "io_uring", + { "override_creds", "sqpoll", NULL } }, { NULL } };
From: Casey Schaufler <casey@schaufler-ca.com> Add Smack privilege checks for io_uring. Use CAP_MAC_OVERRIDE for the override_creds case and CAP_MAC_ADMIN for creating a polling thread. These choices are based on conjecture regarding the intent of the surrounding code. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Paul Moore <paul@paul-moore.com> --- security/smack/smack_lsm.c | 64 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c index 223a6da0e6dc..f6423c0096e9 100644 --- a/security/smack/smack_lsm.c +++ b/security/smack/smack_lsm.c @@ -4691,6 +4691,66 @@ static int smack_dentry_create_files_as(struct dentry *dentry, int mode, return 0; } +#ifdef CONFIG_IO_URING +/** + * smack_uring_override_creds - Is io_uring cred override allowed? + * @new: the target creds + * + * Check to see if the current task is allowed to override it's credentials + * to service an io_uring operation. + */ +int smack_uring_override_creds(const struct cred *new) +{ + struct task_smack *tsp = smack_cred(current_cred()); + struct task_smack *nsp = smack_cred(new); + +#if 1 + if (tsp->smk_task == nsp->smk_task) + pr_info("%s: Smack matches %s\n", __func__, + tsp->smk_task->smk_known); + else + pr_info("%s: Smack override check %s to %s\n", __func__, + tsp->smk_task->smk_known, nsp->smk_task->smk_known); +#endif + /* + * Allow the degenerate case where the new Smack value is + * the same as the current Smack value. + */ + if (tsp->smk_task == nsp->smk_task) + return 0; + +#if 1 + pr_info("%s: Smack sqpoll %s\n", __func__, + smack_privileged_cred(CAP_MAC_OVERRIDE, current_cred()) ? + "ok by Smack" : "disallowed (No CAP_MAC_OVERRIDE)"); +#endif + if (smack_privileged_cred(CAP_MAC_OVERRIDE, current_cred())) + return 0; + + return -EPERM; +} + +/** + * smack_uring_sqpoll - check if a io_uring polling thread can be created + * + * Check to see if the current task is allowed to create a new io_uring + * kernel polling thread. + */ +int smack_uring_sqpoll(void) +{ +#if 1 + pr_info("%s: Smack new ring %s\n", __func__, + smack_privileged_cred(CAP_MAC_ADMIN, current_cred()) ? + "ok by Smack" : "disallowed (No CAP_MAC_ADMIN)"); +#endif + if (smack_privileged_cred(CAP_MAC_ADMIN, current_cred())) + return 0; + + return -EPERM; +} + +#endif /* CONFIG_IO_URING */ + struct lsm_blob_sizes smack_blob_sizes __lsm_ro_after_init = { .lbs_cred = sizeof(struct task_smack), .lbs_file = sizeof(struct smack_known *), @@ -4843,6 +4903,10 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = { LSM_HOOK_INIT(inode_copy_up, smack_inode_copy_up), LSM_HOOK_INIT(inode_copy_up_xattr, smack_inode_copy_up_xattr), LSM_HOOK_INIT(dentry_create_files_as, smack_dentry_create_files_as), +#ifdef CONFIG_IO_URING + LSM_HOOK_INIT(uring_override_creds, smack_uring_override_creds), + LSM_HOOK_INIT(uring_sqpoll, smack_uring_sqpoll), +#endif };
On 5/21/21 10:49 PM, Paul Moore wrote: > WARNING - This is a work in progress and should not be merged > anywhere important. It is almost surely not complete, and while it > probably compiles it likely hasn't been booted and will do terrible > things. You have been warned. > > This patch adds basic auditing to io_uring operations, regardless of > their context. This is accomplished by allocating audit_context > structures for the io-wq worker and io_uring SQPOLL kernel threads > as well as explicitly auditing the io_uring operations in > io_issue_sqe(). The io_uring operations are audited using a new > AUDIT_URINGOP record, an example is shown below: > > % <TODO - insert AUDIT_URINGOP record example> > > Thanks to Richard Guy Briggs for review and feedback. > > Signed-off-by: Paul Moore <paul@paul-moore.com> > --- [...] > diff --git a/fs/io_uring.c b/fs/io_uring.c > index e481ac8a757a..e9941d1ad8fd 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -78,6 +78,7 @@ > #include <linux/task_work.h> > #include <linux/pagemap.h> > #include <linux/io_uring.h> > +#include <linux/audit.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/io_uring.h> > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > if (req->work.creds && req->work.creds != current_cred()) > creds = override_creds(req->work.creds); > > + if (req->opcode < IORING_OP_LAST) always true at this point > + audit_uring_entry(req->opcode); So, it adds two if's with memory loads (i.e. current->audit_context) per request in one of the hottest functions here... No way, nack Maybe, if it's dynamically compiled into like kprobes if it's _really_ used. > + > switch (req->opcode) { > case IORING_OP_NOP: > ret = io_nop(req, issue_flags); > @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > break; > } > > + if (req->opcode < IORING_OP_LAST) > + audit_uring_exit(!ret, ret); > + > if (creds) > revert_creds(creds); -- Pavel Begunkov
On 2021/05/22 6:49, Paul Moore wrote:
> I've provided the SELinux
> implementation, Casey has been nice enough to provide a Smack patch,
> and John is working on an AppArmor patch as I write this. I've
> mentioned this work to the other LSM maintainers that I believe might
> be affected but I have not heard back from anyone else at this point.
I don't think any change is required for TOMOYO, for TOMOYO does not
use "struct cred"->security where [RFC PATCH 8/9] and [RFC PATCH 9/9]
are addressing, and TOMOYO does not call kernel/audit*.c functions.
On Fri, May 21, 2021 at 8:53 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
> On 2021/05/22 6:49, Paul Moore wrote:
> > I've provided the SELinux
> > implementation, Casey has been nice enough to provide a Smack patch,
> > and John is working on an AppArmor patch as I write this. I've
> > mentioned this work to the other LSM maintainers that I believe might
> > be affected but I have not heard back from anyone else at this point.
>
> I don't think any change is required for TOMOYO, for TOMOYO does not
> use "struct cred"->security where [RFC PATCH 8/9] and [RFC PATCH 9/9]
> are addressing, and TOMOYO does not call kernel/audit*.c functions.
Good to know, thank you for checking.
--
paul moore
www.paul-moore.com
On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/21/21 10:49 PM, Paul Moore wrote: > > WARNING - This is a work in progress and should not be merged > > anywhere important. It is almost surely not complete, and while it > > probably compiles it likely hasn't been booted and will do terrible > > things. You have been warned. > > > > This patch adds basic auditing to io_uring operations, regardless of > > their context. This is accomplished by allocating audit_context > > structures for the io-wq worker and io_uring SQPOLL kernel threads > > as well as explicitly auditing the io_uring operations in > > io_issue_sqe(). The io_uring operations are audited using a new > > AUDIT_URINGOP record, an example is shown below: > > > > % <TODO - insert AUDIT_URINGOP record example> > > > > Thanks to Richard Guy Briggs for review and feedback. > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > --- > [...] > > diff --git a/fs/io_uring.c b/fs/io_uring.c > > index e481ac8a757a..e9941d1ad8fd 100644 > > --- a/fs/io_uring.c > > +++ b/fs/io_uring.c > > @@ -78,6 +78,7 @@ > > #include <linux/task_work.h> > > #include <linux/pagemap.h> > > #include <linux/io_uring.h> > > +#include <linux/audit.h> > > > > #define CREATE_TRACE_POINTS > > #include <trace/events/io_uring.h> > > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > > if (req->work.creds && req->work.creds != current_cred()) > > creds = override_creds(req->work.creds); > > > > + if (req->opcode < IORING_OP_LAST) > > always true at this point I placed the opcode check before the audit call because the switch statement below which handles the operation dispatching has a 'ret = -EINVAL' for the default case, implying that there are some paths where an invalid opcode could be passed into the function. Obviously if that is not the case and you can guarantee that req->opcode will always be valid we can easily drop the check prior to the audit call. > > + audit_uring_entry(req->opcode); > > So, it adds two if's with memory loads (i.e. current->audit_context) > per request in one of the hottest functions here... No way, nack > > Maybe, if it's dynamically compiled into like kprobes if it's > _really_ used. I'm open to suggestions on how to tweak the io_uring/audit integration, if you don't like what I've proposed in this patchset, lets try to come up with a solution that is more palatable. If you were going to add audit support for these io_uring operations, how would you propose we do it? Not being able to properly audit io_uring operations is going to be a significant issue for a chunk of users, if it isn't already, we need to work to find a solution to this problem. Unfortunately I don't think dynamically inserting audit calls is something that would meet the needs of the audit community (I fear it would run afoul of the various security certifications), and it definitely isn't something that we support at present. -- paul moore www.paul-moore.com
On 5/22/21 3:36 AM, Paul Moore wrote: > On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >> On 5/21/21 10:49 PM, Paul Moore wrote: [...] >>> >>> + if (req->opcode < IORING_OP_LAST) >> >> always true at this point > > I placed the opcode check before the audit call because the switch > statement below which handles the operation dispatching has a 'ret = > -EINVAL' for the default case, implying that there are some paths > where an invalid opcode could be passed into the function. Obviously > if that is not the case and you can guarantee that req->opcode will > always be valid we can easily drop the check prior to the audit call. It is always true at this point, would be completely broken otherwise >>> + audit_uring_entry(req->opcode); >> >> So, it adds two if's with memory loads (i.e. current->audit_context) >> per request in one of the hottest functions here... No way, nack >> >> Maybe, if it's dynamically compiled into like kprobes if it's >> _really_ used. > > I'm open to suggestions on how to tweak the io_uring/audit > integration, if you don't like what I've proposed in this patchset, > lets try to come up with a solution that is more palatable. If you > were going to add audit support for these io_uring operations, how > would you propose we do it? Not being able to properly audit io_uring > operations is going to be a significant issue for a chunk of users, if > it isn't already, we need to work to find a solution to this problem. Who knows. First of all, seems CONFIG_AUDIT is enabled by default for many popular distributions, so I assume that is not compiled out. What are use cases for audit? Always running I guess? Putting aside compatibility problems, it sounds that with the amount of overhead it adds there is no much profit in using io_uring in the first place. Is that so? __audit_uring_exit() -> audit_filter_syscall() -> for (audit_list) if (...) audit_filter_rules() -> ... -> audit_filter_inodes() -> ... > Unfortunately I don't think dynamically inserting audit calls is > something that would meet the needs of the audit community (I fear it > would run afoul of the various security certifications), and it > definitely isn't something that we support at present. I see -- Pavel Begunkov
On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/22/21 3:36 AM, Paul Moore wrote: > > On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > >> On 5/21/21 10:49 PM, Paul Moore wrote: > [...] > >>> > >>> + if (req->opcode < IORING_OP_LAST) > >> > >> always true at this point > > > > I placed the opcode check before the audit call because the switch > > statement below which handles the operation dispatching has a 'ret = > > -EINVAL' for the default case, implying that there are some paths > > where an invalid opcode could be passed into the function. Obviously > > if that is not the case and you can guarantee that req->opcode will > > always be valid we can easily drop the check prior to the audit call. > > It is always true at this point, would be completely broken > otherwise Understood, I was just pointing out an oddity in the code. I just dropped the checks from my local tree, you'll see it in the next draft of the patchset. > >> So, it adds two if's with memory loads (i.e. current->audit_context) > >> per request in one of the hottest functions here... No way, nack > >> > >> Maybe, if it's dynamically compiled into like kprobes if it's > >> _really_ used. > > > > I'm open to suggestions on how to tweak the io_uring/audit > > integration, if you don't like what I've proposed in this patchset, > > lets try to come up with a solution that is more palatable. If you > > were going to add audit support for these io_uring operations, how > > would you propose we do it? Not being able to properly audit io_uring > > operations is going to be a significant issue for a chunk of users, if > > it isn't already, we need to work to find a solution to this problem. > > Who knows. First of all, seems CONFIG_AUDIT is enabled by default > for many popular distributions, so I assume that is not compiled out. > > What are use cases for audit? Always running I guess? Audit has been around for quite some time now, and it's goal is to provide a mechanism for logging "security relevant" information in such a way that it meets the needs of various security certification efforts. Traditional Linux event logging, e.g. syslog and the like, does not meet these requirements and changing them would likely affect the usability for those who are not required to be compliant with these security certifications. The Linux audit subsystem allows Linux to be used in places it couldn't be used otherwise (or rather makes it a *lot* easier). That said, audit is not for everyone, and we have build time and runtime options to help make life easier. Beyond simply disabling audit at compile time a number of Linux distributions effectively shortcut audit at runtime by adding a "never" rule to the audit filter, for example: % auditctl -a task,never > Putting aside compatibility problems, it sounds that with the amount of overhead > it adds there is no much profit in using io_uring in the first place. > Is that so? Well, if audit alone erased all of the io_uring advantages we should just rip io_uring out of the kernel and people can just disable audit instead ;) I believe there are people who would like to use io_uring and are also required to use a kernel with audit, either due to the need to run a distribution kernel or the need to capture security information in the audit stream. I'm hoping that we can find a solution for these users; if we don't we are asking this group to choose either io_uring or audit, and that is something I would like to avoid. -- paul moore www.paul-moore.com
On 5/24/21 8:59 PM, Paul Moore wrote: > On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >> On 5/22/21 3:36 AM, Paul Moore wrote: >>> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >>>> On 5/21/21 10:49 PM, Paul Moore wrote: >> [...] >>>>> >>>>> + if (req->opcode < IORING_OP_LAST) >>>> >>>> always true at this point >>> >>> I placed the opcode check before the audit call because the switch >>> statement below which handles the operation dispatching has a 'ret = >>> -EINVAL' for the default case, implying that there are some paths >>> where an invalid opcode could be passed into the function. Obviously >>> if that is not the case and you can guarantee that req->opcode will >>> always be valid we can easily drop the check prior to the audit call. >> >> It is always true at this point, would be completely broken >> otherwise > > Understood, I was just pointing out an oddity in the code. I just > dropped the checks from my local tree, you'll see it in the next draft > of the patchset. > >>>> So, it adds two if's with memory loads (i.e. current->audit_context) >>>> per request in one of the hottest functions here... No way, nack >>>> >>>> Maybe, if it's dynamically compiled into like kprobes if it's >>>> _really_ used. >>> >>> I'm open to suggestions on how to tweak the io_uring/audit >>> integration, if you don't like what I've proposed in this patchset, >>> lets try to come up with a solution that is more palatable. If you >>> were going to add audit support for these io_uring operations, how >>> would you propose we do it? Not being able to properly audit io_uring >>> operations is going to be a significant issue for a chunk of users, if >>> it isn't already, we need to work to find a solution to this problem. >> >> Who knows. First of all, seems CONFIG_AUDIT is enabled by default >> for many popular distributions, so I assume that is not compiled out. >> >> What are use cases for audit? Always running I guess? > > Audit has been around for quite some time now, and it's goal is to > provide a mechanism for logging "security relevant" information in > such a way that it meets the needs of various security certification > efforts. Traditional Linux event logging, e.g. syslog and the like, > does not meet these requirements and changing them would likely affect > the usability for those who are not required to be compliant with > these security certifications. The Linux audit subsystem allows Linux > to be used in places it couldn't be used otherwise (or rather makes it > a *lot* easier). > > That said, audit is not for everyone, and we have build time and > runtime options to help make life easier. Beyond simply disabling > audit at compile time a number of Linux distributions effectively > shortcut audit at runtime by adding a "never" rule to the audit > filter, for example: > > % auditctl -a task,never > >> Putting aside compatibility problems, it sounds that with the amount of overhead >> it adds there is no much profit in using io_uring in the first place. >> Is that so? > > Well, if audit alone erased all of the io_uring advantages we should > just rip io_uring out of the kernel and people can just disable audit > instead ;) Hah, if we add a simple idle "feature" like for (i=0;i<1000000;i+) {;} and it would destroy all the performance, let's throw useless Linux kernel then! If seriously, it's more of an open question to me, how much overhead it adds if enabled (with typical filters/options/etc). Btw, do you really need two hooks -- before and right after execution? > I believe there are people who would like to use io_uring and are also > required to use a kernel with audit, either due to the need to run a > distribution kernel or the need to capture security information in the > audit stream. I'm hoping that we can find a solution for these users; > if we don't we are asking this group to choose either io_uring or > audit, and that is something I would like to avoid. -- Pavel Begunkov
On Tue, May 25, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/24/21 8:59 PM, Paul Moore wrote: > > On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > >> On 5/22/21 3:36 AM, Paul Moore wrote: > >>> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > >>>> On 5/21/21 10:49 PM, Paul Moore wrote: > >> [...] > >>>>> > >>>>> + if (req->opcode < IORING_OP_LAST) > >>>> > >>>> always true at this point > >>> > >>> I placed the opcode check before the audit call because the switch > >>> statement below which handles the operation dispatching has a 'ret = > >>> -EINVAL' for the default case, implying that there are some paths > >>> where an invalid opcode could be passed into the function. Obviously > >>> if that is not the case and you can guarantee that req->opcode will > >>> always be valid we can easily drop the check prior to the audit call. > >> > >> It is always true at this point, would be completely broken > >> otherwise > > > > Understood, I was just pointing out an oddity in the code. I just > > dropped the checks from my local tree, you'll see it in the next draft > > of the patchset. > > > >>>> So, it adds two if's with memory loads (i.e. current->audit_context) > >>>> per request in one of the hottest functions here... No way, nack > >>>> > >>>> Maybe, if it's dynamically compiled into like kprobes if it's > >>>> _really_ used. > >>> > >>> I'm open to suggestions on how to tweak the io_uring/audit > >>> integration, if you don't like what I've proposed in this patchset, > >>> lets try to come up with a solution that is more palatable. If you > >>> were going to add audit support for these io_uring operations, how > >>> would you propose we do it? Not being able to properly audit io_uring > >>> operations is going to be a significant issue for a chunk of users, if > >>> it isn't already, we need to work to find a solution to this problem. > >> > >> Who knows. First of all, seems CONFIG_AUDIT is enabled by default > >> for many popular distributions, so I assume that is not compiled out. > >> > >> What are use cases for audit? Always running I guess? > > > > Audit has been around for quite some time now, and it's goal is to > > provide a mechanism for logging "security relevant" information in > > such a way that it meets the needs of various security certification > > efforts. Traditional Linux event logging, e.g. syslog and the like, > > does not meet these requirements and changing them would likely affect > > the usability for those who are not required to be compliant with > > these security certifications. The Linux audit subsystem allows Linux > > to be used in places it couldn't be used otherwise (or rather makes it > > a *lot* easier). > > > > That said, audit is not for everyone, and we have build time and > > runtime options to help make life easier. Beyond simply disabling > > audit at compile time a number of Linux distributions effectively > > shortcut audit at runtime by adding a "never" rule to the audit > > filter, for example: > > > > % auditctl -a task,never > > > >> Putting aside compatibility problems, it sounds that with the amount of overhead > >> it adds there is no much profit in using io_uring in the first place. > >> Is that so? > > > > Well, if audit alone erased all of the io_uring advantages we should > > just rip io_uring out of the kernel and people can just disable audit > > instead ;) > > Hah, if we add a simple idle "feature" like > > for (i=0;i<1000000;i+) {;} > > and it would destroy all the performance, let's throw useless > Linux kernel then! > > If seriously, it's more of an open question to me, how much overhead > it adds if enabled (with typical filters/options/etc). I am very hesitant to define a "typical" audit configuration as it really is dependent on the user's security requirements as well as the particular solution/environment. Any configuration I pick will be "wrong" for a lot of audit users :) As I see it, users will likely find themselves in one of three performance buckets: * io_uring enabled, CONFIG_AUDIT=n For those who are already trying to get that last 1% of performance from their kernel and are willing to give up most everything else to get it this is the obvious choice. Needless to say there should not be any audit related impact in this case (especially since we've removed that req->opcode checks prior to the audit calls). * io_uring enabled, CONFIG_AUDIT=y, audit "task,never" runtime config [side note: I made some tweaks to audit_uring_entry() to move the audit_enabled check there so we no longer need to call into __audit_uring_entry() in this fastpath case. Similarly audit_uring_exit() now does an audit_dummy_context() check instead of simply checking audit_context(), this should avoid calling into __audit_uring_exit() when the io_uring op is not being audited.] I'm guessing that most distro users will fall into this bucket. Here the task's audit_context should always be NULL, both in the syscall context and sqpoll context, so io_uring would call into the inline audit_uring_entry() function and return without calling into __audit_uring_entry() (see the "side note" above). The audit_uring_exit() call would behave in a similar manner. * io_uring enabled, CONFIG_AUDIT=y, some sort of runtime audit config For obvious reasons this case has the most performance impact and as mentioned above it can vary quite a bit. In my opinion this is the least important bucket from a performance perspective as the user clearly has a need for the security information that audit provides and enabling that functionality in io_uring is critical to allowing the user to take advantage of io_uring. The performance of io_uring is impacted, but it should still be more performant than synchronous I/O and it allows the user to run their existing io_uring applications/code-paths. > Btw, do you really need two hooks -- before and right after > execution? Yes, the entry/before hook does the setup for the io_uring op (very similar to a syscall entry point) and the exit/after hook is what does the audit record generation based on what took place during the io_uring operation. > > I believe there are people who would like to use io_uring and are also > > required to use a kernel with audit, either due to the need to run a > > distribution kernel or the need to capture security information in the > > audit stream. I'm hoping that we can find a solution for these users; > > if we don't we are asking this group to choose either io_uring or > > audit, and that is something I would like to avoid. I'm leaving this old paragraph here because I really think this is important to the discussion. If we can't find a solution here we would need to make io_uring and audit mutually exclusive and I don't think that is in the best interests of the users, and would surely create a headache for the distros. -- paul moore www.paul-moore.com
On 5/24/21 1:59 PM, Paul Moore wrote:
> That said, audit is not for everyone, and we have build time and
> runtime options to help make life easier. Beyond simply disabling
> audit at compile time a number of Linux distributions effectively
> shortcut audit at runtime by adding a "never" rule to the audit
> filter, for example:
>
> % auditctl -a task,never
As has been brought up, the issue we're facing is that distros have
CONFIG_AUDIT=y and hence the above is the best real world case outside
of people doing custom kernels. My question would then be how much
overhead the above will add, considering it's an entry/exit call per op.
If auditctl is turned off, what is the expectation in turns of overhead?
My gut feeling tells me it's likely going to be too much. Keep in mind
that we're sometimes doing millions of operations per second, per core.
aio never had any audit logging as far as I can tell. I think it'd make
a lot more sense to selectively enable audit logging only for opcodes
that we care about. File open/create/unlink/mkdir etc, that kind of
thing. File level operations that people would care about logging. Would
they care about logging a buffer registration or a polled read from a
device/file? I highly doubt it, and we don't do that for alternative
methods either. Doesn't really make sense for a lot of the other
operations, imho.
--
Jens Axboe
On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > On 5/24/21 1:59 PM, Paul Moore wrote: > > That said, audit is not for everyone, and we have build time and > > runtime options to help make life easier. Beyond simply disabling > > audit at compile time a number of Linux distributions effectively > > shortcut audit at runtime by adding a "never" rule to the audit > > filter, for example: > > > > % auditctl -a task,never > > As has been brought up, the issue we're facing is that distros have > CONFIG_AUDIT=y and hence the above is the best real world case outside > of people doing custom kernels. My question would then be how much > overhead the above will add, considering it's an entry/exit call per op. > If auditctl is turned off, what is the expectation in turns of overhead? I commented on that case in my last email to Pavel, but I'll try to go over it again in a little more detail. As we discussed earlier in this thread, we can skip the req->opcode check before both the _entry and _exit calls, so we are left with just the bare audit calls in the io_uring code. As the _entry and _exit functions are small, I've copied them and their supporting functions below and I'll try to explain what would happen in CONFIG_AUDIT=y, "task,never" case. + static inline struct audit_context *audit_context(void) + { + return current->audit_context; + } + static inline bool audit_dummy_context(void) + { + void *p = audit_context(); + return !p || *(int *)p; + } + static inline void audit_uring_entry(u8 op) + { + if (unlikely(audit_enabled && audit_context())) + __audit_uring_entry(op); + } We have one if statement where the conditional checks on two individual conditions. The first (audit_enabled) is simply a check to see if anyone has "turned on" auditing at runtime; historically this worked rather well, and still does in a number of places, but ever since systemd has taken to forcing audit on regardless of the admin's audit configuration it is less useful. The second (audit_context()) is a check to see if an audit_context has been allocated for the current task. In the case of "task,never" current->audit_context will be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath will never be called. Worst case here is checking the value of audit_enabled and current->audit_context. Depending on which you think is more likely we can change the order of the check so that the current->audit_context check is first if you feel that is more likely to be NULL than audit_enabled is to be false (it may be that way now). + static inline void audit_uring_exit(int success, long code) + { + if (unlikely(!audit_dummy_context())) + __audit_uring_exit(success, code); + } The exit call is very similar to the entry call, but in the "task,never" case it is very simple as the first check to be performed is the current->audit_context check which we know to be NULL. The __audit_uring_exit() slowpath will never be called. > aio never had any audit logging as far as I can tell. I think it'd make > a lot more sense to selectively enable audit logging only for opcodes > that we care about. File open/create/unlink/mkdir etc, that kind of > thing. File level operations that people would care about logging. Would > they care about logging a buffer registration or a polled read from a > device/file? I highly doubt it, and we don't do that for alternative > methods either. Doesn't really make sense for a lot of the other > operations, imho. We would need to check with the current security requirements (there are distro people on the linux-audit list that keep track of that stuff), but looking at the opcodes right now my gut feeling is that most of the opcodes would be considered "security relevant" so selective auditing might not be that useful in practice. It would definitely clutter the code and increase the chances that new opcodes would not be properly audited when they are merged. -- paul moore www.paul-moore.com
On 5/26/21 3:04 AM, Paul Moore wrote: > On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 5/24/21 1:59 PM, Paul Moore wrote: >>> That said, audit is not for everyone, and we have build time and >>> runtime options to help make life easier. Beyond simply disabling >>> audit at compile time a number of Linux distributions effectively >>> shortcut audit at runtime by adding a "never" rule to the audit >>> filter, for example: >>> >>> % auditctl -a task,never >> >> As has been brought up, the issue we're facing is that distros have >> CONFIG_AUDIT=y and hence the above is the best real world case outside >> of people doing custom kernels. My question would then be how much >> overhead the above will add, considering it's an entry/exit call per op. >> If auditctl is turned off, what is the expectation in turns of overhead? > > I commented on that case in my last email to Pavel, but I'll try to go > over it again in a little more detail. > > As we discussed earlier in this thread, we can skip the req->opcode > check before both the _entry and _exit calls, so we are left with just > the bare audit calls in the io_uring code. As the _entry and _exit > functions are small, I've copied them and their supporting functions > below and I'll try to explain what would happen in CONFIG_AUDIT=y, > "task,never" case. > > + static inline struct audit_context *audit_context(void) > + { > + return current->audit_context; > + } > > + static inline bool audit_dummy_context(void) > + { > + void *p = audit_context(); > + return !p || *(int *)p; > + } > > + static inline void audit_uring_entry(u8 op) > + { > + if (unlikely(audit_enabled && audit_context())) > + __audit_uring_entry(op); > + } I'd rather agree that it's my cycle-picking. The case I care about is CONFIG_AUDIT=y (because everybody enable it), and io_uring tracing _not_ enabled at runtime. If enabled let them suffer the overhead, it will probably dip down the performance So, for the case I care about it's two of if (unlikely(audit_enabled && current->audit_context)) in the hot path. load-test-jump + current, so it will be around 7x2 instructions. We can throw away audit_enabled as you say systemd already enables it, that will give 4x2 instructions including 2 conditional jumps. That's not great at all. And that's why I brought up the question about need of pre and post hooks and whether can be combined. Would be just 4 instructions and that is ok (ish). > We would need to check with the current security requirements (there > are distro people on the linux-audit list that keep track of that > stuff), but looking at the opcodes right now my gut feeling is that > most of the opcodes would be considered "security relevant" so > selective auditing might not be that useful in practice. It would > definitely clutter the code and increase the chances that new opcodes > would not be properly audited when they are merged. I'm curious, why it's enabled by many distros by default? Are there use cases they use? Tempting to add AUDIT_IOURING=default N, but won't work I guess -- Pavel Begunkov
On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/26/21 3:04 AM, Paul Moore wrote: > > On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > >> On 5/24/21 1:59 PM, Paul Moore wrote: > >>> That said, audit is not for everyone, and we have build time and > >>> runtime options to help make life easier. Beyond simply disabling > >>> audit at compile time a number of Linux distributions effectively > >>> shortcut audit at runtime by adding a "never" rule to the audit > >>> filter, for example: > >>> > >>> % auditctl -a task,never > >> > >> As has been brought up, the issue we're facing is that distros have > >> CONFIG_AUDIT=y and hence the above is the best real world case outside > >> of people doing custom kernels. My question would then be how much > >> overhead the above will add, considering it's an entry/exit call per op. > >> If auditctl is turned off, what is the expectation in turns of overhead? > > > > I commented on that case in my last email to Pavel, but I'll try to go > > over it again in a little more detail. > > > > As we discussed earlier in this thread, we can skip the req->opcode > > check before both the _entry and _exit calls, so we are left with just > > the bare audit calls in the io_uring code. As the _entry and _exit > > functions are small, I've copied them and their supporting functions > > below and I'll try to explain what would happen in CONFIG_AUDIT=y, > > "task,never" case. > > > > + static inline struct audit_context *audit_context(void) > > + { > > + return current->audit_context; > > + } > > > > + static inline bool audit_dummy_context(void) > > + { > > + void *p = audit_context(); > > + return !p || *(int *)p; > > + } > > > > + static inline void audit_uring_entry(u8 op) > > + { > > + if (unlikely(audit_enabled && audit_context())) > > + __audit_uring_entry(op); > > + } > > I'd rather agree that it's my cycle-picking. The case I care about > is CONFIG_AUDIT=y (because everybody enable it), and io_uring > tracing _not_ enabled at runtime. If enabled let them suffer > the overhead, it will probably dip down the performance > > So, for the case I care about it's two of > > if (unlikely(audit_enabled && current->audit_context)) > > in the hot path. load-test-jump + current, so it will > be around 7x2 instructions. We can throw away audit_enabled > as you say systemd already enables it, that will give > 4x2 instructions including 2 conditional jumps. We've basically got it down to the equivalent of two "current->audit_context != NULL" checks in the case where audit is built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and "task,never". I'm at a loss for how we can lower the overhead any further, but I'm open to suggestions. > That's not great at all. And that's why I brought up > the question about need of pre and post hooks and whether > can be combined. Would be just 4 instructions and that is > ok (ish). As discussed previously in this thread that isn't really an option from an audit perspective. > > We would need to check with the current security requirements (there > > are distro people on the linux-audit list that keep track of that > > stuff), but looking at the opcodes right now my gut feeling is that > > most of the opcodes would be considered "security relevant" so > > selective auditing might not be that useful in practice. It would > > definitely clutter the code and increase the chances that new opcodes > > would not be properly audited when they are merged. > > I'm curious, why it's enabled by many distros by default? Are there > use cases they use? We've already talked about certain users and environments where audit is an important requirement, e.g. public sector, health care, financial institutions, etc.; without audit Linux wouldn't be an option for these users, at least not without heavy modification, out-of-tree/ISV patches, etc. I currently don't have any direct ties to any distros, "Enterprise" or otherwise, but in the past it has been my experience that distros much prefer to have a single kernel build to address the needs of all their users. In the few cases I have seen where a second kernel build is supported it is usually for hardware enablement. I'm sure there are other cases too, I just haven't seen them personally; the big distros definitely seem to have a strong desire to limit the number of supported kernel configs/builds. > Tempting to add AUDIT_IOURING=default N, but won't work I guess One of the nice things about audit is that it can give you a history of what a user did on a system, which is very important for a number of use cases. If we selectively disable audit for certain subsystems we create a blind spot in the audit log, and in the case of io_uring this can be a very serious blind spot. I fear that if we can't come to some agreement here we will need to make io_uring and audit mutually exclusive at build time which would be awful; forcing many distros to either make a hard choice or carry out-of-tree patches. -- paul moore www.paul-moore.com
Hi Paul,
> #define CREATE_TRACE_POINTS
> #include <trace/events/io_uring.h>
> @@ -6537,6 +6538,11 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
> if (!req->work.creds)
> return -EINVAL;
> get_cred(req->work.creds);
> + ret = security_uring_override_creds(req->work.creds);
> + if (ret) {
> + put_cred(req->work.creds);
> + return ret;
> + }
Why are you calling this per requests, shouldn't this be done in
io_register_personality()?
I'm also not sure if this really gains anything as io_register_personality()
only captures the value of get_current_cred(), so the process already has changed to
the credentials (at least once for the io_uring_register(IORING_REGISTER_PERSONALITY)
call).
metze
Paul Moore <paul@paul-moore.com> writes: > Also, any pointers to easy-to-run io_uring tests would be helpful. I > am particularly interested in tests which make use of the personality > option, share urings across process boundaries, and make use of the > sqpoll functionality. liburing contains a test suite: https://git.kernel.dk/cgit/liburing/ You can run it via 'make runtests'. Cheers, Jeff
On Wednesday, May 26, 2021 10:38:38 AM EDT Paul Moore wrote: > > > We would need to check with the current security requirements (there > > > are distro people on the linux-audit list that keep track of that > > > stuff), The requirements generally care about resource access. File open, connect, accept, etc. We don't care about read/write itself as that would flood the analysis. > > > but looking at the opcodes right now my gut feeling is that > > > most of the opcodes would be considered "security relevant" so > > > selective auditing might not be that useful in practice. I'd say maybe a quarter to a third look interesting. > > > It would > > > definitely clutter the code and increase the chances that new opcodes > > > would not be properly audited when they are merged. There is that... > > I'm curious, why it's enabled by many distros by default? Are there > > use cases they use? > > We've already talked about certain users and environments where audit > is an important requirement, e.g. public sector, health care, > financial institutions, etc.; without audit Linux wouldn't be an > option for these users, People that care about auditing are under regulatory mandates. They care more about the audit event than the performance. Imagine you have a system with some brand new medical discovery. You want to know anyone who accesses the information in case it gets leaked out. You don't care how slow the system gets - you simply *have* to know everyone who's looked at the documents. -Steve
Am 26.05.21 um 16:38 schrieb Paul Moore:
> On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> On 5/26/21 3:04 AM, Paul Moore wrote:
>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>> That said, audit is not for everyone, and we have build time and
>>>>> runtime options to help make life easier. Beyond simply disabling
>>>>> audit at compile time a number of Linux distributions effectively
>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>> filter, for example:
>>>>>
>>>>> % auditctl -a task,never
>>>>
>>>> As has been brought up, the issue we're facing is that distros have
>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>> of people doing custom kernels. My question would then be how much
>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>
>>> I commented on that case in my last email to Pavel, but I'll try to go
>>> over it again in a little more detail.
>>>
>>> As we discussed earlier in this thread, we can skip the req->opcode
>>> check before both the _entry and _exit calls, so we are left with just
>>> the bare audit calls in the io_uring code. As the _entry and _exit
>>> functions are small, I've copied them and their supporting functions
>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>> "task,never" case.
>>>
>>> + static inline struct audit_context *audit_context(void)
>>> + {
>>> + return current->audit_context;
>>> + }
>>>
>>> + static inline bool audit_dummy_context(void)
>>> + {
>>> + void *p = audit_context();
>>> + return !p || *(int *)p;
>>> + }
>>>
>>> + static inline void audit_uring_entry(u8 op)
>>> + {
>>> + if (unlikely(audit_enabled && audit_context()))
>>> + __audit_uring_entry(op);
>>> + }
>>
>> I'd rather agree that it's my cycle-picking. The case I care about
>> is CONFIG_AUDIT=y (because everybody enable it), and io_uring
>> tracing _not_ enabled at runtime. If enabled let them suffer
>> the overhead, it will probably dip down the performance
>>
>> So, for the case I care about it's two of
>>
>> if (unlikely(audit_enabled && current->audit_context))
>>
>> in the hot path. load-test-jump + current, so it will
>> be around 7x2 instructions. We can throw away audit_enabled
>> as you say systemd already enables it, that will give
>> 4x2 instructions including 2 conditional jumps.
>
> We've basically got it down to the equivalent of two
> "current->audit_context != NULL" checks in the case where audit is
> built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and
> "task,never". I'm at a loss for how we can lower the overhead any
> further, but I'm open to suggestions.
>
>> That's not great at all. And that's why I brought up
>> the question about need of pre and post hooks and whether
>> can be combined. Would be just 4 instructions and that is
>> ok (ish).
>
> As discussed previously in this thread that isn't really an option
> from an audit perspective.
>
>>> We would need to check with the current security requirements (there
>>> are distro people on the linux-audit list that keep track of that
>>> stuff), but looking at the opcodes right now my gut feeling is that
>>> most of the opcodes would be considered "security relevant" so
>>> selective auditing might not be that useful in practice. It would
>>> definitely clutter the code and increase the chances that new opcodes
>>> would not be properly audited when they are merged.
>>
>> I'm curious, why it's enabled by many distros by default? Are there
>> use cases they use?
>
> We've already talked about certain users and environments where audit
> is an important requirement, e.g. public sector, health care,
> financial institutions, etc.; without audit Linux wouldn't be an
> option for these users, at least not without heavy modification,
> out-of-tree/ISV patches, etc. I currently don't have any direct ties
> to any distros, "Enterprise" or otherwise, but in the past it has been
> my experience that distros much prefer to have a single kernel build
> to address the needs of all their users. In the few cases I have seen
> where a second kernel build is supported it is usually for hardware
> enablement. I'm sure there are other cases too, I just haven't seen
> them personally; the big distros definitely seem to have a strong
> desire to limit the number of supported kernel configs/builds.
>
>> Tempting to add AUDIT_IOURING=default N, but won't work I guess
>
> One of the nice things about audit is that it can give you a history
> of what a user did on a system, which is very important for a number
> of use cases. If we selectively disable audit for certain subsystems
> we create a blind spot in the audit log, and in the case of io_uring
> this can be a very serious blind spot. I fear that if we can't come
> to some agreement here we will need to make io_uring and audit
> mutually exclusive at build time which would be awful; forcing many
> distros to either make a hard choice or carry out-of-tree patches.
I'm wondering why it's not enough to have the native auditing just to happen.
E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
already go via security_socket_{recvmsg,sendmsg,connect}()
IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
and should also trigger audit_inode() and security_file_open().
So why is there anything special needed for io_uring (now that the native worker threads are used)?
Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
would do? If so, I think that should just be fixed...
Additional LSM based restrictions could be hooked into the io_check_restriction() path
and setup at io_uring_setup() or early io_uring_register() time.
What do you think?
metze
On 2021-05-26 17:17, Stefan Metzmacher wrote: > > Am 26.05.21 um 16:38 schrieb Paul Moore: > > On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > >> On 5/26/21 3:04 AM, Paul Moore wrote: > >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: > >>>> On 5/24/21 1:59 PM, Paul Moore wrote: > >>>>> That said, audit is not for everyone, and we have build time and > >>>>> runtime options to help make life easier. Beyond simply disabling > >>>>> audit at compile time a number of Linux distributions effectively > >>>>> shortcut audit at runtime by adding a "never" rule to the audit > >>>>> filter, for example: > >>>>> > >>>>> % auditctl -a task,never > >>>> > >>>> As has been brought up, the issue we're facing is that distros have > >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside > >>>> of people doing custom kernels. My question would then be how much > >>>> overhead the above will add, considering it's an entry/exit call per op. > >>>> If auditctl is turned off, what is the expectation in turns of overhead? > >>> > >>> I commented on that case in my last email to Pavel, but I'll try to go > >>> over it again in a little more detail. > >>> > >>> As we discussed earlier in this thread, we can skip the req->opcode > >>> check before both the _entry and _exit calls, so we are left with just > >>> the bare audit calls in the io_uring code. As the _entry and _exit > >>> functions are small, I've copied them and their supporting functions > >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, > >>> "task,never" case. > >>> > >>> + static inline struct audit_context *audit_context(void) > >>> + { > >>> + return current->audit_context; > >>> + } > >>> > >>> + static inline bool audit_dummy_context(void) > >>> + { > >>> + void *p = audit_context(); > >>> + return !p || *(int *)p; > >>> + } > >>> > >>> + static inline void audit_uring_entry(u8 op) > >>> + { > >>> + if (unlikely(audit_enabled && audit_context())) > >>> + __audit_uring_entry(op); > >>> + } > >> > >> I'd rather agree that it's my cycle-picking. The case I care about > >> is CONFIG_AUDIT=y (because everybody enable it), and io_uring > >> tracing _not_ enabled at runtime. If enabled let them suffer > >> the overhead, it will probably dip down the performance > >> > >> So, for the case I care about it's two of > >> > >> if (unlikely(audit_enabled && current->audit_context)) > >> > >> in the hot path. load-test-jump + current, so it will > >> be around 7x2 instructions. We can throw away audit_enabled > >> as you say systemd already enables it, that will give > >> 4x2 instructions including 2 conditional jumps. > > > > We've basically got it down to the equivalent of two > > "current->audit_context != NULL" checks in the case where audit is > > built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and > > "task,never". I'm at a loss for how we can lower the overhead any > > further, but I'm open to suggestions. > > > >> That's not great at all. And that's why I brought up > >> the question about need of pre and post hooks and whether > >> can be combined. Would be just 4 instructions and that is > >> ok (ish). > > > > As discussed previously in this thread that isn't really an option > > from an audit perspective. > > > >>> We would need to check with the current security requirements (there > >>> are distro people on the linux-audit list that keep track of that > >>> stuff), but looking at the opcodes right now my gut feeling is that > >>> most of the opcodes would be considered "security relevant" so > >>> selective auditing might not be that useful in practice. It would > >>> definitely clutter the code and increase the chances that new opcodes > >>> would not be properly audited when they are merged. > >> > >> I'm curious, why it's enabled by many distros by default? Are there > >> use cases they use? > > > > We've already talked about certain users and environments where audit > > is an important requirement, e.g. public sector, health care, > > financial institutions, etc.; without audit Linux wouldn't be an > > option for these users, at least not without heavy modification, > > out-of-tree/ISV patches, etc. I currently don't have any direct ties > > to any distros, "Enterprise" or otherwise, but in the past it has been > > my experience that distros much prefer to have a single kernel build > > to address the needs of all their users. In the few cases I have seen > > where a second kernel build is supported it is usually for hardware > > enablement. I'm sure there are other cases too, I just haven't seen > > them personally; the big distros definitely seem to have a strong > > desire to limit the number of supported kernel configs/builds. > > > >> Tempting to add AUDIT_IOURING=default N, but won't work I guess > > > > One of the nice things about audit is that it can give you a history > > of what a user did on a system, which is very important for a number > > of use cases. If we selectively disable audit for certain subsystems > > we create a blind spot in the audit log, and in the case of io_uring > > this can be a very serious blind spot. I fear that if we can't come > > to some agreement here we will need to make io_uring and audit > > mutually exclusive at build time which would be awful; forcing many > > distros to either make a hard choice or carry out-of-tree patches. > > I'm wondering why it's not enough to have the native auditing just to happen. The audit context needs to be set up for each event. This happens in audit_syslog_entry and audit_syslog_exit. > E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes > already go via security_socket_{recvmsg,sendmsg,connect}() > > IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls > and should also trigger audit_inode() and security_file_open(). These are extra hooks to grab operation-specific (syscall) parameters. > So why is there anything special needed for io_uring (now that the native worker threads are used)? Because syscall has been bypassed by a memory-mapped work queue. > Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall > would do? If so, I think that should just be fixed... This is by design to speed it up. This is what Paul's iouring entry and exit hooks do. > Additional LSM based restrictions could be hooked into the io_check_restriction() path > and setup at io_uring_setup() or early io_uring_register() time. > > What do you think? > > metze - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
> I'm wondering why it's not enough to have the native auditing just to happen.
>
> E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
> already go via security_socket_{recvmsg,sendmsg,connect}()
>
> IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
> and should also trigger audit_inode() and security_file_open().
>
> So why is there anything special needed for io_uring (now that the native worker threads are used)?
>
> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
> would do? If so, I think that should just be fixed...
stefan's points crossed my mind as well.
but assuming iouring buy-in is required, from a design perspective,
rather than inserting these audit conditionals in the hotpath,
wouldn't a layering model work better?
aka enabling auditing changes the function entry point into io_uring
and passes operations through an auditing layer, then back to the main
entry point. then there is no
cost to audit disabled code, and you just force audit to pay whatever
double processing cost that entails.
V
On 5/26/2021 8:49 AM, Victor Stewart wrote: >> I'm wondering why it's not enough to have the native auditing just to happen. >> >> E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes >> already go via security_socket_{recvmsg,sendmsg,connect}() >> >> IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls >> and should also trigger audit_inode() and security_file_open(). >> >> So why is there anything special needed for io_uring (now that the native worker threads are used)? >> >> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall >> would do? If so, I think that should just be fixed... > stefan's points crossed my mind as well. > > but assuming iouring buy-in is required, from a design perspective, > rather than inserting these audit conditionals in the hotpath, > wouldn't a layering model work better? > aka enabling auditing changes the function entry point into io_uring > and passes operations through an auditing layer, then back to the main > entry point. then there is no > cost to audit disabled code, and you just force audit to pay whatever > double processing cost that entails. There seems to be an assumption that the audit code isn't concerned about processing cost. This is complete nonsense. While an audit system has to be accurate and complete (hence encompassing io_uring) it also has to be sufficiently performant for the system to be useful when it is enabled. It would have been real handy had the audit and LSM requirements been designed into io_uring. The performance implications could have been addressed up front. Instead it all has to be retrofit. io_uring, Audit and LSM need to perform well and the best way to ensure that the combination works is cooperation between the developers. Any response that includes the word "just" is unlikely to be helpful. > > V > > -- > Linux-audit mailing list > Linux-audit@redhat.com > https://listman.redhat.com/mailman/listinfo/linux-audit >
On 5/25/21 8:04 PM, Paul Moore wrote: > On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 5/24/21 1:59 PM, Paul Moore wrote: >>> That said, audit is not for everyone, and we have build time and >>> runtime options to help make life easier. Beyond simply disabling >>> audit at compile time a number of Linux distributions effectively >>> shortcut audit at runtime by adding a "never" rule to the audit >>> filter, for example: >>> >>> % auditctl -a task,never >> >> As has been brought up, the issue we're facing is that distros have >> CONFIG_AUDIT=y and hence the above is the best real world case outside >> of people doing custom kernels. My question would then be how much >> overhead the above will add, considering it's an entry/exit call per op. >> If auditctl is turned off, what is the expectation in turns of overhead? > > I commented on that case in my last email to Pavel, but I'll try to go > over it again in a little more detail. > > As we discussed earlier in this thread, we can skip the req->opcode > check before both the _entry and _exit calls, so we are left with just > the bare audit calls in the io_uring code. As the _entry and _exit > functions are small, I've copied them and their supporting functions > below and I'll try to explain what would happen in CONFIG_AUDIT=y, > "task,never" case. > > + static inline struct audit_context *audit_context(void) > + { > + return current->audit_context; > + } > > + static inline bool audit_dummy_context(void) > + { > + void *p = audit_context(); > + return !p || *(int *)p; > + } > > + static inline void audit_uring_entry(u8 op) > + { > + if (unlikely(audit_enabled && audit_context())) > + __audit_uring_entry(op); > + } > > We have one if statement where the conditional checks on two > individual conditions. The first (audit_enabled) is simply a check to > see if anyone has "turned on" auditing at runtime; historically this > worked rather well, and still does in a number of places, but ever > since systemd has taken to forcing audit on regardless of the admin's > audit configuration it is less useful. The second (audit_context()) > is a check to see if an audit_context has been allocated for the > current task. In the case of "task,never" current->audit_context will > be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath > will never be called. > > Worst case here is checking the value of audit_enabled and > current->audit_context. Depending on which you think is more likely > we can change the order of the check so that the > current->audit_context check is first if you feel that is more likely > to be NULL than audit_enabled is to be false (it may be that way now). > > + static inline void audit_uring_exit(int success, long code) > + { > + if (unlikely(!audit_dummy_context())) > + __audit_uring_exit(success, code); > + } > > The exit call is very similar to the entry call, but in the > "task,never" case it is very simple as the first check to be performed > is the current->audit_context check which we know to be NULL. The > __audit_uring_exit() slowpath will never be called. I actually ran some numbers this morning. The test base is 5.13+, and CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline test and the test with this series applied. I used your git branch as of this morning. The test case is my usual peak perf test, which is random reads at QD=128 and using polled IO. It's a single core test, not threaded. I ran two different tests - one was having a thread just do the IO, the other is using SQPOLL to do the IO for us. The device is capable than more IOPS than a single core can deliver, so we're CPU limited in this test. Hence it's a good test case as it does actual work, and shows software overhead quite nicely. Runs are very stable (less than 0.5% difference between runs on the same base), yet I did average 4 runs. Kernel SQPOLL IOPS Perf diff --------------------------------------------------------- 5.13 0 3029872 0.0% 5.13 1 3031056 0.0% 5.13 + audit 0 2894160 -4.5% 5.13 + audit 1 2886168 -4.8% That's an immediate drop in perf of almost 5%. Looking at a quick profile of it (nothing fancy, just checking for 'audit' in the profile) shows this: + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit Note that this is with _no_ rules! >> aio never had any audit logging as far as I can tell. I think it'd make >> a lot more sense to selectively enable audit logging only for opcodes >> that we care about. File open/create/unlink/mkdir etc, that kind of >> thing. File level operations that people would care about logging. Would >> they care about logging a buffer registration or a polled read from a >> device/file? I highly doubt it, and we don't do that for alternative >> methods either. Doesn't really make sense for a lot of the other >> operations, imho. > > We would need to check with the current security requirements (there > are distro people on the linux-audit list that keep track of that > stuff), but looking at the opcodes right now my gut feeling is that > most of the opcodes would be considered "security relevant" so > selective auditing might not be that useful in practice. It would > definitely clutter the code and increase the chances that new opcodes > would not be properly audited when they are merged. We don't audit read/write from aio, as mentioned. In the past two decades, I take it that hasn't been a concern? I agree that some opcodes should _definitely_ be audited. Things like opening a file, closing a file, removing/creating a file, mount, etc. But normal read/write, I think that's just utter noise and not useful at all. Auditing on a per-opcode basis is trivial, io_uring already has provisions for flagging opcode requirements and this would just be another one. -- Jens Axboe
On 5/26/21 9:49 AM, Richard Guy Briggs wrote: >> So why is there anything special needed for io_uring (now that the >> native worker threads are used)? > > Because syscall has been bypassed by a memory-mapped work queue. I don't follow this one at all, that's just the delivery mechanism if you choose to use SQPOLL. If you do, then a thread sibling of the original task does the actual system call. There's no magic involved there, and the tasks are related. So care to expand on that? >> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall >> would do? If so, I think that should just be fixed... > > This is by design to speed it up. This is what Paul's iouring entry and > exit hooks do. As far as I can tell, we're doing double logging at that point, if the syscall helper does the audit as well. We'll get something logging the io_uring opcode (eg IORING_OP_OPENAT2), then log again when we call the fs helper. That's _assuming_ that we always hit the logging part when we call into the vfs, but that's something that can be updated to be true and kept an eye on for future additions. Why is the double logging useful? It only tells you that the invocation was via io_uring as the delivery mechanism rather than the usual system call, but the effect is the same - the file is opened, for example. I feel like I'm missing something here, or the other side is. Or both! -- Jens Axboe
On 5/26/21 11:15 AM, Jens Axboe wrote:
> On 5/25/21 8:04 PM, Paul Moore wrote:
>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>> That said, audit is not for everyone, and we have build time and
>>>> runtime options to help make life easier. Beyond simply disabling
>>>> audit at compile time a number of Linux distributions effectively
>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>> filter, for example:
>>>>
>>>> % auditctl -a task,never
>>>
>>> As has been brought up, the issue we're facing is that distros have
>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>> of people doing custom kernels. My question would then be how much
>>> overhead the above will add, considering it's an entry/exit call per op.
>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>
>> I commented on that case in my last email to Pavel, but I'll try to go
>> over it again in a little more detail.
>>
>> As we discussed earlier in this thread, we can skip the req->opcode
>> check before both the _entry and _exit calls, so we are left with just
>> the bare audit calls in the io_uring code. As the _entry and _exit
>> functions are small, I've copied them and their supporting functions
>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>> "task,never" case.
>>
>> + static inline struct audit_context *audit_context(void)
>> + {
>> + return current->audit_context;
>> + }
>>
>> + static inline bool audit_dummy_context(void)
>> + {
>> + void *p = audit_context();
>> + return !p || *(int *)p;
>> + }
>>
>> + static inline void audit_uring_entry(u8 op)
>> + {
>> + if (unlikely(audit_enabled && audit_context()))
>> + __audit_uring_entry(op);
>> + }
>>
>> We have one if statement where the conditional checks on two
>> individual conditions. The first (audit_enabled) is simply a check to
>> see if anyone has "turned on" auditing at runtime; historically this
>> worked rather well, and still does in a number of places, but ever
>> since systemd has taken to forcing audit on regardless of the admin's
>> audit configuration it is less useful. The second (audit_context())
>> is a check to see if an audit_context has been allocated for the
>> current task. In the case of "task,never" current->audit_context will
>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>> will never be called.
>>
>> Worst case here is checking the value of audit_enabled and
>> current->audit_context. Depending on which you think is more likely
>> we can change the order of the check so that the
>> current->audit_context check is first if you feel that is more likely
>> to be NULL than audit_enabled is to be false (it may be that way now).
>>
>> + static inline void audit_uring_exit(int success, long code)
>> + {
>> + if (unlikely(!audit_dummy_context()))
>> + __audit_uring_exit(success, code);
>> + }
>>
>> The exit call is very similar to the entry call, but in the
>> "task,never" case it is very simple as the first check to be performed
>> is the current->audit_context check which we know to be NULL. The
>> __audit_uring_exit() slowpath will never be called.
>
> I actually ran some numbers this morning. The test base is 5.13+, and
> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> test and the test with this series applied. I used your git branch as of
> this morning.
>
> The test case is my usual peak perf test, which is random reads at
> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> two different tests - one was having a thread just do the IO, the other
> is using SQPOLL to do the IO for us. The device is capable than more
> IOPS than a single core can deliver, so we're CPU limited in this test.
> Hence it's a good test case as it does actual work, and shows software
> overhead quite nicely. Runs are very stable (less than 0.5% difference
> between runs on the same base), yet I did average 4 runs.
>
> Kernel SQPOLL IOPS Perf diff
> ---------------------------------------------------------
> 5.13 0 3029872 0.0%
> 5.13 1 3031056 0.0%
> 5.13 + audit 0 2894160 -4.5%
> 5.13 + audit 1 2886168 -4.8%
>
> That's an immediate drop in perf of almost 5%. Looking at a quick
> profile of it (nothing fancy, just checking for 'audit' in the profile)
> shows this:
>
> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
>
> Note that this is with _no_ rules!
io_uring also supports a NOP command, which basically just measures
reqs/sec through the interface. Ran that as well:
Kernel SQPOLL IOPS Perf diff
---------------------------------------------------------
5.13 0 31.05M 0.0%
5.13 + audit 0 25.31M -18.5%
and profile for the latter includes:
+ 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry
+ 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit
0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
--
Jens Axboe
On 5/26/21 11:31 AM, Jens Axboe wrote:
> On 5/26/21 11:15 AM, Jens Axboe wrote:
>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>> That said, audit is not for everyone, and we have build time and
>>>>> runtime options to help make life easier. Beyond simply disabling
>>>>> audit at compile time a number of Linux distributions effectively
>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>> filter, for example:
>>>>>
>>>>> % auditctl -a task,never
>>>>
>>>> As has been brought up, the issue we're facing is that distros have
>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>> of people doing custom kernels. My question would then be how much
>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>
>>> I commented on that case in my last email to Pavel, but I'll try to go
>>> over it again in a little more detail.
>>>
>>> As we discussed earlier in this thread, we can skip the req->opcode
>>> check before both the _entry and _exit calls, so we are left with just
>>> the bare audit calls in the io_uring code. As the _entry and _exit
>>> functions are small, I've copied them and their supporting functions
>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>> "task,never" case.
>>>
>>> + static inline struct audit_context *audit_context(void)
>>> + {
>>> + return current->audit_context;
>>> + }
>>>
>>> + static inline bool audit_dummy_context(void)
>>> + {
>>> + void *p = audit_context();
>>> + return !p || *(int *)p;
>>> + }
>>>
>>> + static inline void audit_uring_entry(u8 op)
>>> + {
>>> + if (unlikely(audit_enabled && audit_context()))
>>> + __audit_uring_entry(op);
>>> + }
>>>
>>> We have one if statement where the conditional checks on two
>>> individual conditions. The first (audit_enabled) is simply a check to
>>> see if anyone has "turned on" auditing at runtime; historically this
>>> worked rather well, and still does in a number of places, but ever
>>> since systemd has taken to forcing audit on regardless of the admin's
>>> audit configuration it is less useful. The second (audit_context())
>>> is a check to see if an audit_context has been allocated for the
>>> current task. In the case of "task,never" current->audit_context will
>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>> will never be called.
>>>
>>> Worst case here is checking the value of audit_enabled and
>>> current->audit_context. Depending on which you think is more likely
>>> we can change the order of the check so that the
>>> current->audit_context check is first if you feel that is more likely
>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>
>>> + static inline void audit_uring_exit(int success, long code)
>>> + {
>>> + if (unlikely(!audit_dummy_context()))
>>> + __audit_uring_exit(success, code);
>>> + }
>>>
>>> The exit call is very similar to the entry call, but in the
>>> "task,never" case it is very simple as the first check to be performed
>>> is the current->audit_context check which we know to be NULL. The
>>> __audit_uring_exit() slowpath will never be called.
>>
>> I actually ran some numbers this morning. The test base is 5.13+, and
>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>> test and the test with this series applied. I used your git branch as of
>> this morning.
>>
>> The test case is my usual peak perf test, which is random reads at
>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>> two different tests - one was having a thread just do the IO, the other
>> is using SQPOLL to do the IO for us. The device is capable than more
>> IOPS than a single core can deliver, so we're CPU limited in this test.
>> Hence it's a good test case as it does actual work, and shows software
>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>> between runs on the same base), yet I did average 4 runs.
>>
>> Kernel SQPOLL IOPS Perf diff
>> ---------------------------------------------------------
>> 5.13 0 3029872 0.0%
>> 5.13 1 3031056 0.0%
>> 5.13 + audit 0 2894160 -4.5%
>> 5.13 + audit 1 2886168 -4.8%
>>
>> That's an immediate drop in perf of almost 5%. Looking at a quick
>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>> shows this:
>>
>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry
>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit
>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
>>
>> Note that this is with _no_ rules!
>
> io_uring also supports a NOP command, which basically just measures
> reqs/sec through the interface. Ran that as well:
>
> Kernel SQPOLL IOPS Perf diff
> ---------------------------------------------------------
> 5.13 0 31.05M 0.0%
> 5.13 + audit 0 25.31M -18.5%
>
> and profile for the latter includes:
>
> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
As Pavel correctly pointed it, looks like auditing is enabled. And
indeed it was! Hence the above numbers is without having turned off
auditing. Running the NOPs after having turned off audit, we get 30.6M
IOPS, which is down about 1.5% from the baseline. The results for the
polled random read test above did _not_ change from this, they are still
down the same amount.
Note, and I should have included this in the first email, this is not
any kind of argument for or against audit logging. It's purely meant to
be a set of numbers that show how the current series impacts
performance.
--
Jens Axboe
On 5/26/21 11:54 AM, Jens Axboe wrote: > On 5/26/21 11:31 AM, Jens Axboe wrote: >> On 5/26/21 11:15 AM, Jens Axboe wrote: >>> On 5/25/21 8:04 PM, Paul Moore wrote: >>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote: >>>>> On 5/24/21 1:59 PM, Paul Moore wrote: >>>>>> That said, audit is not for everyone, and we have build time and >>>>>> runtime options to help make life easier. Beyond simply disabling >>>>>> audit at compile time a number of Linux distributions effectively >>>>>> shortcut audit at runtime by adding a "never" rule to the audit >>>>>> filter, for example: >>>>>> >>>>>> % auditctl -a task,never >>>>> >>>>> As has been brought up, the issue we're facing is that distros have >>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside >>>>> of people doing custom kernels. My question would then be how much >>>>> overhead the above will add, considering it's an entry/exit call per op. >>>>> If auditctl is turned off, what is the expectation in turns of overhead? >>>> >>>> I commented on that case in my last email to Pavel, but I'll try to go >>>> over it again in a little more detail. >>>> >>>> As we discussed earlier in this thread, we can skip the req->opcode >>>> check before both the _entry and _exit calls, so we are left with just >>>> the bare audit calls in the io_uring code. As the _entry and _exit >>>> functions are small, I've copied them and their supporting functions >>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y, >>>> "task,never" case. >>>> >>>> + static inline struct audit_context *audit_context(void) >>>> + { >>>> + return current->audit_context; >>>> + } >>>> >>>> + static inline bool audit_dummy_context(void) >>>> + { >>>> + void *p = audit_context(); >>>> + return !p || *(int *)p; >>>> + } >>>> >>>> + static inline void audit_uring_entry(u8 op) >>>> + { >>>> + if (unlikely(audit_enabled && audit_context())) >>>> + __audit_uring_entry(op); >>>> + } >>>> >>>> We have one if statement where the conditional checks on two >>>> individual conditions. The first (audit_enabled) is simply a check to >>>> see if anyone has "turned on" auditing at runtime; historically this >>>> worked rather well, and still does in a number of places, but ever >>>> since systemd has taken to forcing audit on regardless of the admin's >>>> audit configuration it is less useful. The second (audit_context()) >>>> is a check to see if an audit_context has been allocated for the >>>> current task. In the case of "task,never" current->audit_context will >>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath >>>> will never be called. >>>> >>>> Worst case here is checking the value of audit_enabled and >>>> current->audit_context. Depending on which you think is more likely >>>> we can change the order of the check so that the >>>> current->audit_context check is first if you feel that is more likely >>>> to be NULL than audit_enabled is to be false (it may be that way now). >>>> >>>> + static inline void audit_uring_exit(int success, long code) >>>> + { >>>> + if (unlikely(!audit_dummy_context())) >>>> + __audit_uring_exit(success, code); >>>> + } >>>> >>>> The exit call is very similar to the entry call, but in the >>>> "task,never" case it is very simple as the first check to be performed >>>> is the current->audit_context check which we know to be NULL. The >>>> __audit_uring_exit() slowpath will never be called. >>> >>> I actually ran some numbers this morning. The test base is 5.13+, and >>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline >>> test and the test with this series applied. I used your git branch as of >>> this morning. >>> >>> The test case is my usual peak perf test, which is random reads at >>> QD=128 and using polled IO. It's a single core test, not threaded. I ran >>> two different tests - one was having a thread just do the IO, the other >>> is using SQPOLL to do the IO for us. The device is capable than more >>> IOPS than a single core can deliver, so we're CPU limited in this test. >>> Hence it's a good test case as it does actual work, and shows software >>> overhead quite nicely. Runs are very stable (less than 0.5% difference >>> between runs on the same base), yet I did average 4 runs. >>> >>> Kernel SQPOLL IOPS Perf diff >>> --------------------------------------------------------- >>> 5.13 0 3029872 0.0% >>> 5.13 1 3031056 0.0% >>> 5.13 + audit 0 2894160 -4.5% >>> 5.13 + audit 1 2886168 -4.8% >>> >>> That's an immediate drop in perf of almost 5%. Looking at a quick >>> profile of it (nothing fancy, just checking for 'audit' in the profile) >>> shows this: >>> >>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry >>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit >>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry >>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit >>> >>> Note that this is with _no_ rules! >> >> io_uring also supports a NOP command, which basically just measures >> reqs/sec through the interface. Ran that as well: >> >> Kernel SQPOLL IOPS Perf diff >> --------------------------------------------------------- >> 5.13 0 31.05M 0.0% >> 5.13 + audit 0 25.31M -18.5% >> >> and profile for the latter includes: >> >> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry >> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit >> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry >> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit > > As Pavel correctly pointed it, looks like auditing is enabled. And > indeed it was! Hence the above numbers is without having turned off > auditing. Running the NOPs after having turned off audit, we get 30.6M > IOPS, which is down about 1.5% from the baseline. The results for the > polled random read test above did _not_ change from this, they are still > down the same amount. > > Note, and I should have included this in the first email, this is not > any kind of argument for or against audit logging. It's purely meant to > be a set of numbers that show how the current series impacts > performance. And finally, just checking if we make it optional per opcode if we see any real impact, and the answer is no. Using the below patch which effectively bypasses audit calls unless the opcode has flagged the need to do so, I cannot measure any difference in perf (as expected). To turn this into something useful, my suggestion as a viable path forward would be: 1) Use something like the below patch and flag request types that we want to do audit logging for. 2) As Pavel suggested, eliminate the need for having both and entry/exit hook, turning it into just one. That effectively cuts the number of checks and calls in half. diff --git a/fs/io_uring.c b/fs/io_uring.c index aa065808ddcf..2c7c913b786b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -885,6 +885,8 @@ struct io_op_def { unsigned needs_async_setup : 1; /* should block plug */ unsigned plug : 1; + /* should audit */ + unsigned audit : 1; /* size of async data needed, if any */ unsigned short async_size; }; @@ -6122,7 +6124,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) if (req->work.creds && req->work.creds != current_cred()) creds = override_creds(req->work.creds); - if (req->opcode < IORING_OP_LAST) + if (io_op_defs[req->opcode].audit) audit_uring_entry(req->opcode); switch (req->opcode) { @@ -6231,7 +6233,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) break; } - if (req->opcode < IORING_OP_LAST) + if (io_op_defs[req->opcode].audit) audit_uring_exit(!ret, ret); if (creds) -- Jens Axboe
On Wed, May 26, 2021 at 1:54 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 5/26/21 11:31 AM, Jens Axboe wrote:
> > On 5/26/21 11:15 AM, Jens Axboe wrote:
> >> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>> That said, audit is not for everyone, and we have build time and
> >>>>> runtime options to help make life easier. Beyond simply disabling
> >>>>> audit at compile time a number of Linux distributions effectively
> >>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>> filter, for example:
> >>>>>
> >>>>> % auditctl -a task,never
> >>>>
> >>>> As has been brought up, the issue we're facing is that distros have
> >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>> of people doing custom kernels. My question would then be how much
> >>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>
> >>> I commented on that case in my last email to Pavel, but I'll try to go
> >>> over it again in a little more detail.
> >>>
> >>> As we discussed earlier in this thread, we can skip the req->opcode
> >>> check before both the _entry and _exit calls, so we are left with just
> >>> the bare audit calls in the io_uring code. As the _entry and _exit
> >>> functions are small, I've copied them and their supporting functions
> >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>> "task,never" case.
> >>>
> >>> + static inline struct audit_context *audit_context(void)
> >>> + {
> >>> + return current->audit_context;
> >>> + }
> >>>
> >>> + static inline bool audit_dummy_context(void)
> >>> + {
> >>> + void *p = audit_context();
> >>> + return !p || *(int *)p;
> >>> + }
> >>>
> >>> + static inline void audit_uring_entry(u8 op)
> >>> + {
> >>> + if (unlikely(audit_enabled && audit_context()))
> >>> + __audit_uring_entry(op);
> >>> + }
> >>>
> >>> We have one if statement where the conditional checks on two
> >>> individual conditions. The first (audit_enabled) is simply a check to
> >>> see if anyone has "turned on" auditing at runtime; historically this
> >>> worked rather well, and still does in a number of places, but ever
> >>> since systemd has taken to forcing audit on regardless of the admin's
> >>> audit configuration it is less useful. The second (audit_context())
> >>> is a check to see if an audit_context has been allocated for the
> >>> current task. In the case of "task,never" current->audit_context will
> >>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>> will never be called.
> >>>
> >>> Worst case here is checking the value of audit_enabled and
> >>> current->audit_context. Depending on which you think is more likely
> >>> we can change the order of the check so that the
> >>> current->audit_context check is first if you feel that is more likely
> >>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>
> >>> + static inline void audit_uring_exit(int success, long code)
> >>> + {
> >>> + if (unlikely(!audit_dummy_context()))
> >>> + __audit_uring_exit(success, code);
> >>> + }
> >>>
> >>> The exit call is very similar to the entry call, but in the
> >>> "task,never" case it is very simple as the first check to be performed
> >>> is the current->audit_context check which we know to be NULL. The
> >>> __audit_uring_exit() slowpath will never be called.
> >>
> >> I actually ran some numbers this morning. The test base is 5.13+, and
> >> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >> test and the test with this series applied. I used your git branch as of
> >> this morning.
> >>
> >> The test case is my usual peak perf test, which is random reads at
> >> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >> two different tests - one was having a thread just do the IO, the other
> >> is using SQPOLL to do the IO for us. The device is capable than more
> >> IOPS than a single core can deliver, so we're CPU limited in this test.
> >> Hence it's a good test case as it does actual work, and shows software
> >> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >> between runs on the same base), yet I did average 4 runs.
> >>
> >> Kernel SQPOLL IOPS Perf diff
> >> ---------------------------------------------------------
> >> 5.13 0 3029872 0.0%
> >> 5.13 1 3031056 0.0%
> >> 5.13 + audit 0 2894160 -4.5%
> >> 5.13 + audit 1 2886168 -4.8%
> >>
> >> That's an immediate drop in perf of almost 5%. Looking at a quick
> >> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >> shows this:
> >>
> >> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> >> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> >> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> >> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
> >>
> >> Note that this is with _no_ rules!
> >
> > io_uring also supports a NOP command, which basically just measures
> > reqs/sec through the interface. Ran that as well:
> >
> > Kernel SQPOLL IOPS Perf diff
> > ---------------------------------------------------------
> > 5.13 0 31.05M 0.0%
> > 5.13 + audit 0 25.31M -18.5%
> >
> > and profile for the latter includes:
> >
> > + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> > + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> > 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> > 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
>
> As Pavel correctly pointed it, looks like auditing is enabled. And
> indeed it was! Hence the above numbers is without having turned off
> auditing. Running the NOPs after having turned off audit, we get 30.6M
> IOPS, which is down about 1.5% from the baseline. The results for the
> polled random read test above did _not_ change from this, they are still
> down the same amount.
>
> Note, and I should have included this in the first email, this is not
> any kind of argument for or against audit logging. It's purely meant to
> be a set of numbers that show how the current series impacts
> performance.
Thanks for running some tests Jens, unfortunately the git tree hadn't
been updated to reflect the improved audit_uring_entry() and
audit_uring_exit() functions as we were still discussing things and I
thought there still might be some changes. I just now updated the
branch with the latest code so if you have the cycles (ha!) to run
another perf test I think those numbers would be more interesting.
For example, I don't believe you should see __audit_uring_entry() or
__audit_uring_exit() at all if you have the audit "task,never" rule
loaded; you will see audit_uring_entry() and audit_uring_exit() but as
we discussed previously those should just be a single
"current->audit_context != NULL" check.
Also, I want to pull back a bit as I think there is confusion about
how audit works and why these changes are necessary. As everyone
likely knows, there are audit calls sprinkled throughout the kernel
code, e.g. audit_log_format() and similar. In addition to these calls
that most are aware of, there are setup/teardown audit calls that run
at task creation and destruction as well as at syscall entry and exit.
It is these lesser known calls that are responsible for the filtering,
setup, and general management of the audit context state in the
system; they also handle generation of some audit records such as
SYSCALL, PATH, etc. based on information that is recorded by audit
calls inserted into other places in the kernel. The
audit_alloc_kernel() and audit_free() calls we are adding to the
io_uring/io-wq code are intended to do similar things to the existing
audit task creation/destruction hooks, but for the io_uring/io-wq
kernel threads. Similarly the audit_uring_entry() and
audit_uring_exit() calls are intended to act as replacements for the
syscall entry and exit code. If we want the existing audit calls in
the kernel to work properly, we need these setup/teardown functions.
Hopefully that makes things a bit more clear regarding these calls in
the io_uring/io-wq code.
Another point I wanted to address is the "double audit" (!!!) that has
been mentioned recently in this thread. Don't get too excited, this
isn't quite what you think it is, it is a side effect of how io_uring
can dispatch certain operations. As I think the io_uring folks
already know, io_uring can process I/O ops in several different
contexts; one of which is after a syscall completes but before the
kernel returns to userspace. In this particular case things can get
rather interesting from an audit perspective as we need to handle both
the syscall audit records *and* the io_uring operation records. If
you look closer at the audit code in this patchset you'll see some of
the fun stuff we need to do to make sure this case is handled
correctly. If you happen to see a code path that takes you through
audit_syscall_entry() + <syscall_stuff> + audit_uring_entry() +
<io_uring_stuff> + audit_uring_exit() + audit_syscall_exit() rest
assured that is correct :)
Of course there is the normal audit_uring_entry() and
audit_uring_exit() without the audit syscall hooks in other code
paths, e.g. SQPOLL, but those are less dramatic than the "OMG, double
audit!!!" ;)
--
paul moore
www.paul-moore.com
On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 5/26/21 11:54 AM, Jens Axboe wrote:
> > On 5/26/21 11:31 AM, Jens Axboe wrote:
> >> On 5/26/21 11:15 AM, Jens Axboe wrote:
> >>> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>>> That said, audit is not for everyone, and we have build time and
> >>>>>> runtime options to help make life easier. Beyond simply disabling
> >>>>>> audit at compile time a number of Linux distributions effectively
> >>>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>>> filter, for example:
> >>>>>>
> >>>>>> % auditctl -a task,never
> >>>>>
> >>>>> As has been brought up, the issue we're facing is that distros have
> >>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>>> of people doing custom kernels. My question would then be how much
> >>>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>>
> >>>> I commented on that case in my last email to Pavel, but I'll try to go
> >>>> over it again in a little more detail.
> >>>>
> >>>> As we discussed earlier in this thread, we can skip the req->opcode
> >>>> check before both the _entry and _exit calls, so we are left with just
> >>>> the bare audit calls in the io_uring code. As the _entry and _exit
> >>>> functions are small, I've copied them and their supporting functions
> >>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>>> "task,never" case.
> >>>>
> >>>> + static inline struct audit_context *audit_context(void)
> >>>> + {
> >>>> + return current->audit_context;
> >>>> + }
> >>>>
> >>>> + static inline bool audit_dummy_context(void)
> >>>> + {
> >>>> + void *p = audit_context();
> >>>> + return !p || *(int *)p;
> >>>> + }
> >>>>
> >>>> + static inline void audit_uring_entry(u8 op)
> >>>> + {
> >>>> + if (unlikely(audit_enabled && audit_context()))
> >>>> + __audit_uring_entry(op);
> >>>> + }
> >>>>
> >>>> We have one if statement where the conditional checks on two
> >>>> individual conditions. The first (audit_enabled) is simply a check to
> >>>> see if anyone has "turned on" auditing at runtime; historically this
> >>>> worked rather well, and still does in a number of places, but ever
> >>>> since systemd has taken to forcing audit on regardless of the admin's
> >>>> audit configuration it is less useful. The second (audit_context())
> >>>> is a check to see if an audit_context has been allocated for the
> >>>> current task. In the case of "task,never" current->audit_context will
> >>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>>> will never be called.
> >>>>
> >>>> Worst case here is checking the value of audit_enabled and
> >>>> current->audit_context. Depending on which you think is more likely
> >>>> we can change the order of the check so that the
> >>>> current->audit_context check is first if you feel that is more likely
> >>>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>>
> >>>> + static inline void audit_uring_exit(int success, long code)
> >>>> + {
> >>>> + if (unlikely(!audit_dummy_context()))
> >>>> + __audit_uring_exit(success, code);
> >>>> + }
> >>>>
> >>>> The exit call is very similar to the entry call, but in the
> >>>> "task,never" case it is very simple as the first check to be performed
> >>>> is the current->audit_context check which we know to be NULL. The
> >>>> __audit_uring_exit() slowpath will never be called.
> >>>
> >>> I actually ran some numbers this morning. The test base is 5.13+, and
> >>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >>> test and the test with this series applied. I used your git branch as of
> >>> this morning.
> >>>
> >>> The test case is my usual peak perf test, which is random reads at
> >>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >>> two different tests - one was having a thread just do the IO, the other
> >>> is using SQPOLL to do the IO for us. The device is capable than more
> >>> IOPS than a single core can deliver, so we're CPU limited in this test.
> >>> Hence it's a good test case as it does actual work, and shows software
> >>> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >>> between runs on the same base), yet I did average 4 runs.
> >>>
> >>> Kernel SQPOLL IOPS Perf diff
> >>> ---------------------------------------------------------
> >>> 5.13 0 3029872 0.0%
> >>> 5.13 1 3031056 0.0%
> >>> 5.13 + audit 0 2894160 -4.5%
> >>> 5.13 + audit 1 2886168 -4.8%
> >>>
> >>> That's an immediate drop in perf of almost 5%. Looking at a quick
> >>> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >>> shows this:
> >>>
> >>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> >>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> >>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> >>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
> >>>
> >>> Note that this is with _no_ rules!
> >>
> >> io_uring also supports a NOP command, which basically just measures
> >> reqs/sec through the interface. Ran that as well:
> >>
> >> Kernel SQPOLL IOPS Perf diff
> >> ---------------------------------------------------------
> >> 5.13 0 31.05M 0.0%
> >> 5.13 + audit 0 25.31M -18.5%
> >>
> >> and profile for the latter includes:
> >>
> >> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> >> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> >> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> >> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
> >
> > As Pavel correctly pointed it, looks like auditing is enabled. And
> > indeed it was! Hence the above numbers is without having turned off
> > auditing. Running the NOPs after having turned off audit, we get 30.6M
> > IOPS, which is down about 1.5% from the baseline. The results for the
> > polled random read test above did _not_ change from this, they are still
> > down the same amount.
> >
> > Note, and I should have included this in the first email, this is not
> > any kind of argument for or against audit logging. It's purely meant to
> > be a set of numbers that show how the current series impacts
> > performance.
>
> And finally, just checking if we make it optional per opcode if we see
> any real impact, and the answer is no. Using the below patch which
> effectively bypasses audit calls unless the opcode has flagged the need
> to do so, I cannot measure any difference in perf (as expected).
>
> To turn this into something useful, my suggestion as a viable path
> forward would be:
>
> 1) Use something like the below patch and flag request types that we
> want to do audit logging for.
>
> 2) As Pavel suggested, eliminate the need for having both and entry/exit
> hook, turning it into just one. That effectively cuts the number of
> checks and calls in half.
I suspect the updated working-io_uring branch with HEAD at
1f25193a3f54 (updated a short time ago, see my last email in this
thread) will improve performance. Also, as has been mention several
times now, for audit to work we need both the _entry and _exit call.
--
paul moore
www.paul-moore.com
On Wed, May 26, 2021 at 10:59 AM Jeff Moyer <jmoyer@redhat.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
>
> > Also, any pointers to easy-to-run io_uring tests would be helpful. I
> > am particularly interested in tests which make use of the personality
> > option, share urings across process boundaries, and make use of the
> > sqpoll functionality.
>
> liburing contains a test suite:
> https://git.kernel.dk/cgit/liburing/
>
> You can run it via 'make runtests'.
Thanks Jeff, I'll take a look. Quick question as I start sifting
through the tests, are there any tests in here which share a single
ring across process boundaries?
--
paul moore
www.paul-moore.com
On 5/26/21 7:44 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 5/26/21 11:54 AM, Jens Axboe wrote:
>>> On 5/26/21 11:31 AM, Jens Axboe wrote:
>>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
>>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>>>>> That said, audit is not for everyone, and we have build time and
>>>>>>>> runtime options to help make life easier. Beyond simply disabling
>>>>>>>> audit at compile time a number of Linux distributions effectively
>>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>>>>> filter, for example:
>>>>>>>>
>>>>>>>> % auditctl -a task,never
>>>>>>>
>>>>>>> As has been brought up, the issue we're facing is that distros have
>>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>>>>> of people doing custom kernels. My question would then be how much
>>>>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>>>>
>>>>>> I commented on that case in my last email to Pavel, but I'll try to go
>>>>>> over it again in a little more detail.
>>>>>>
>>>>>> As we discussed earlier in this thread, we can skip the req->opcode
>>>>>> check before both the _entry and _exit calls, so we are left with just
>>>>>> the bare audit calls in the io_uring code. As the _entry and _exit
>>>>>> functions are small, I've copied them and their supporting functions
>>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>>>>> "task,never" case.
>>>>>>
>>>>>> + static inline struct audit_context *audit_context(void)
>>>>>> + {
>>>>>> + return current->audit_context;
>>>>>> + }
>>>>>>
>>>>>> + static inline bool audit_dummy_context(void)
>>>>>> + {
>>>>>> + void *p = audit_context();
>>>>>> + return !p || *(int *)p;
>>>>>> + }
>>>>>>
>>>>>> + static inline void audit_uring_entry(u8 op)
>>>>>> + {
>>>>>> + if (unlikely(audit_enabled && audit_context()))
>>>>>> + __audit_uring_entry(op);
>>>>>> + }
>>>>>>
>>>>>> We have one if statement where the conditional checks on two
>>>>>> individual conditions. The first (audit_enabled) is simply a check to
>>>>>> see if anyone has "turned on" auditing at runtime; historically this
>>>>>> worked rather well, and still does in a number of places, but ever
>>>>>> since systemd has taken to forcing audit on regardless of the admin's
>>>>>> audit configuration it is less useful. The second (audit_context())
>>>>>> is a check to see if an audit_context has been allocated for the
>>>>>> current task. In the case of "task,never" current->audit_context will
>>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>>>>> will never be called.
>>>>>>
>>>>>> Worst case here is checking the value of audit_enabled and
>>>>>> current->audit_context. Depending on which you think is more likely
>>>>>> we can change the order of the check so that the
>>>>>> current->audit_context check is first if you feel that is more likely
>>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>>>>
>>>>>> + static inline void audit_uring_exit(int success, long code)
>>>>>> + {
>>>>>> + if (unlikely(!audit_dummy_context()))
>>>>>> + __audit_uring_exit(success, code);
>>>>>> + }
>>>>>>
>>>>>> The exit call is very similar to the entry call, but in the
>>>>>> "task,never" case it is very simple as the first check to be performed
>>>>>> is the current->audit_context check which we know to be NULL. The
>>>>>> __audit_uring_exit() slowpath will never be called.
>>>>>
>>>>> I actually ran some numbers this morning. The test base is 5.13+, and
>>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>>>>> test and the test with this series applied. I used your git branch as of
>>>>> this morning.
>>>>>
>>>>> The test case is my usual peak perf test, which is random reads at
>>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>>>>> two different tests - one was having a thread just do the IO, the other
>>>>> is using SQPOLL to do the IO for us. The device is capable than more
>>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
>>>>> Hence it's a good test case as it does actual work, and shows software
>>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>>>>> between runs on the same base), yet I did average 4 runs.
>>>>>
>>>>> Kernel SQPOLL IOPS Perf diff
>>>>> ---------------------------------------------------------
>>>>> 5.13 0 3029872 0.0%
>>>>> 5.13 1 3031056 0.0%
>>>>> 5.13 + audit 0 2894160 -4.5%
>>>>> 5.13 + audit 1 2886168 -4.8%
>>>>>
>>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
>>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>>>>> shows this:
>>>>>
>>>>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry
>>>>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit
>>>>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
>>>>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
>>>>>
>>>>> Note that this is with _no_ rules!
>>>>
>>>> io_uring also supports a NOP command, which basically just measures
>>>> reqs/sec through the interface. Ran that as well:
>>>>
>>>> Kernel SQPOLL IOPS Perf diff
>>>> ---------------------------------------------------------
>>>> 5.13 0 31.05M 0.0%
>>>> 5.13 + audit 0 25.31M -18.5%
>>>>
>>>> and profile for the latter includes:
>>>>
>>>> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry
>>>> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit
>>>> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
>>>> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
>>>
>>> As Pavel correctly pointed it, looks like auditing is enabled. And
>>> indeed it was! Hence the above numbers is without having turned off
>>> auditing. Running the NOPs after having turned off audit, we get 30.6M
>>> IOPS, which is down about 1.5% from the baseline. The results for the
>>> polled random read test above did _not_ change from this, they are still
>>> down the same amount.
>>>
>>> Note, and I should have included this in the first email, this is not
>>> any kind of argument for or against audit logging. It's purely meant to
>>> be a set of numbers that show how the current series impacts
>>> performance.
>>
>> And finally, just checking if we make it optional per opcode if we see
>> any real impact, and the answer is no. Using the below patch which
>> effectively bypasses audit calls unless the opcode has flagged the need
>> to do so, I cannot measure any difference in perf (as expected).
>>
>> To turn this into something useful, my suggestion as a viable path
>> forward would be:
>>
>> 1) Use something like the below patch and flag request types that we
>> want to do audit logging for.
>>
>> 2) As Pavel suggested, eliminate the need for having both and entry/exit
>> hook, turning it into just one. That effectively cuts the number of
>> checks and calls in half.
>
> I suspect the updated working-io_uring branch with HEAD at
> 1f25193a3f54 (updated a short time ago, see my last email in this
> thread) will improve performance. Also, as has been mention several
See the email you replied to, ~1.5% was basically an overhead of
two `if (io_op_defs[req->opcode].audit)` in case of nops, where at
least once req->opcode is cached. But to be completely fair, misses
unlikely
--
Pavel Begunkov
Paul Moore <paul@paul-moore.com> writes:
> On Wed, May 26, 2021 at 10:59 AM Jeff Moyer <jmoyer@redhat.com> wrote:
>> Paul Moore <paul@paul-moore.com> writes:
>>
>> > Also, any pointers to easy-to-run io_uring tests would be helpful. I
>> > am particularly interested in tests which make use of the personality
>> > option, share urings across process boundaries, and make use of the
>> > sqpoll functionality.
>>
>> liburing contains a test suite:
>> https://git.kernel.dk/cgit/liburing/
>>
>> You can run it via 'make runtests'.
>
> Thanks Jeff, I'll take a look. Quick question as I start sifting
> through the tests, are there any tests in here which share a single
> ring across process boundaries?
Yes. At the very least, this one:
test/across-fork.c
-Jeff
On Wed, May 26, 2021 at 2:57 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 5/26/21 7:44 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 5/26/21 11:54 AM, Jens Axboe wrote:
> >>> On 5/26/21 11:31 AM, Jens Axboe wrote:
> >>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
> >>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>>>>> That said, audit is not for everyone, and we have build time and
> >>>>>>>> runtime options to help make life easier. Beyond simply disabling
> >>>>>>>> audit at compile time a number of Linux distributions effectively
> >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>>>>> filter, for example:
> >>>>>>>>
> >>>>>>>> % auditctl -a task,never
> >>>>>>>
> >>>>>>> As has been brought up, the issue we're facing is that distros have
> >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>>>>> of people doing custom kernels. My question would then be how much
> >>>>>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>>>>
> >>>>>> I commented on that case in my last email to Pavel, but I'll try to go
> >>>>>> over it again in a little more detail.
> >>>>>>
> >>>>>> As we discussed earlier in this thread, we can skip the req->opcode
> >>>>>> check before both the _entry and _exit calls, so we are left with just
> >>>>>> the bare audit calls in the io_uring code. As the _entry and _exit
> >>>>>> functions are small, I've copied them and their supporting functions
> >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>>>>> "task,never" case.
> >>>>>>
> >>>>>> + static inline struct audit_context *audit_context(void)
> >>>>>> + {
> >>>>>> + return current->audit_context;
> >>>>>> + }
> >>>>>>
> >>>>>> + static inline bool audit_dummy_context(void)
> >>>>>> + {
> >>>>>> + void *p = audit_context();
> >>>>>> + return !p || *(int *)p;
> >>>>>> + }
> >>>>>>
> >>>>>> + static inline void audit_uring_entry(u8 op)
> >>>>>> + {
> >>>>>> + if (unlikely(audit_enabled && audit_context()))
> >>>>>> + __audit_uring_entry(op);
> >>>>>> + }
> >>>>>>
> >>>>>> We have one if statement where the conditional checks on two
> >>>>>> individual conditions. The first (audit_enabled) is simply a check to
> >>>>>> see if anyone has "turned on" auditing at runtime; historically this
> >>>>>> worked rather well, and still does in a number of places, but ever
> >>>>>> since systemd has taken to forcing audit on regardless of the admin's
> >>>>>> audit configuration it is less useful. The second (audit_context())
> >>>>>> is a check to see if an audit_context has been allocated for the
> >>>>>> current task. In the case of "task,never" current->audit_context will
> >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>>>>> will never be called.
> >>>>>>
> >>>>>> Worst case here is checking the value of audit_enabled and
> >>>>>> current->audit_context. Depending on which you think is more likely
> >>>>>> we can change the order of the check so that the
> >>>>>> current->audit_context check is first if you feel that is more likely
> >>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>>>>
> >>>>>> + static inline void audit_uring_exit(int success, long code)
> >>>>>> + {
> >>>>>> + if (unlikely(!audit_dummy_context()))
> >>>>>> + __audit_uring_exit(success, code);
> >>>>>> + }
> >>>>>>
> >>>>>> The exit call is very similar to the entry call, but in the
> >>>>>> "task,never" case it is very simple as the first check to be performed
> >>>>>> is the current->audit_context check which we know to be NULL. The
> >>>>>> __audit_uring_exit() slowpath will never be called.
> >>>>>
> >>>>> I actually ran some numbers this morning. The test base is 5.13+, and
> >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >>>>> test and the test with this series applied. I used your git branch as of
> >>>>> this morning.
> >>>>>
> >>>>> The test case is my usual peak perf test, which is random reads at
> >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >>>>> two different tests - one was having a thread just do the IO, the other
> >>>>> is using SQPOLL to do the IO for us. The device is capable than more
> >>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
> >>>>> Hence it's a good test case as it does actual work, and shows software
> >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >>>>> between runs on the same base), yet I did average 4 runs.
> >>>>>
> >>>>> Kernel SQPOLL IOPS Perf diff
> >>>>> ---------------------------------------------------------
> >>>>> 5.13 0 3029872 0.0%
> >>>>> 5.13 1 3031056 0.0%
> >>>>> 5.13 + audit 0 2894160 -4.5%
> >>>>> 5.13 + audit 1 2886168 -4.8%
> >>>>>
> >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
> >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >>>>> shows this:
> >>>>>
> >>>>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> >>>>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> >>>>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> >>>>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
> >>>>>
> >>>>> Note that this is with _no_ rules!
> >>>>
> >>>> io_uring also supports a NOP command, which basically just measures
> >>>> reqs/sec through the interface. Ran that as well:
> >>>>
> >>>> Kernel SQPOLL IOPS Perf diff
> >>>> ---------------------------------------------------------
> >>>> 5.13 0 31.05M 0.0%
> >>>> 5.13 + audit 0 25.31M -18.5%
> >>>>
> >>>> and profile for the latter includes:
> >>>>
> >>>> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> >>>> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> >>>> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> >>>> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
> >>>
> >>> As Pavel correctly pointed it, looks like auditing is enabled. And
> >>> indeed it was! Hence the above numbers is without having turned off
> >>> auditing. Running the NOPs after having turned off audit, we get 30.6M
> >>> IOPS, which is down about 1.5% from the baseline. The results for the
> >>> polled random read test above did _not_ change from this, they are still
> >>> down the same amount.
> >>>
> >>> Note, and I should have included this in the first email, this is not
> >>> any kind of argument for or against audit logging. It's purely meant to
> >>> be a set of numbers that show how the current series impacts
> >>> performance.
> >>
> >> And finally, just checking if we make it optional per opcode if we see
> >> any real impact, and the answer is no. Using the below patch which
> >> effectively bypasses audit calls unless the opcode has flagged the need
> >> to do so, I cannot measure any difference in perf (as expected).
> >>
> >> To turn this into something useful, my suggestion as a viable path
> >> forward would be:
> >>
> >> 1) Use something like the below patch and flag request types that we
> >> want to do audit logging for.
> >>
> >> 2) As Pavel suggested, eliminate the need for having both and entry/exit
> >> hook, turning it into just one. That effectively cuts the number of
> >> checks and calls in half.
> >
> > I suspect the updated working-io_uring branch with HEAD at
> > 1f25193a3f54 (updated a short time ago, see my last email in this
> > thread) will improve performance. Also, as has been mention several
>
> See the email you replied to, ~1.5% was basically an overhead of
> two `if (io_op_defs[req->opcode].audit)` in case of nops, where at
> least once req->opcode is cached. But to be completely fair, misses
> unlikely
Maybe. I remain skeptical that "io_op_defs[req->opcode].audit" has
the same cost as "unlikely(audit_context())".
--
paul moore
www.paul-moore.com
On Wed, May 26, 2021 at 3:06 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> Paul Moore <paul@paul-moore.com> writes:
>
> > On Wed, May 26, 2021 at 10:59 AM Jeff Moyer <jmoyer@redhat.com> wrote:
> >> Paul Moore <paul@paul-moore.com> writes:
> >>
> >> > Also, any pointers to easy-to-run io_uring tests would be helpful. I
> >> > am particularly interested in tests which make use of the personality
> >> > option, share urings across process boundaries, and make use of the
> >> > sqpoll functionality.
> >>
> >> liburing contains a test suite:
> >> https://git.kernel.dk/cgit/liburing/
> >>
> >> You can run it via 'make runtests'.
> >
> > Thanks Jeff, I'll take a look. Quick question as I start sifting
> > through the tests, are there any tests in here which share a single
> > ring across process boundaries?
>
> Yes. At the very least, this one:
>
> test/across-fork.c
Great, thanks!
--
paul moore
www.paul-moore.com
On 5/26/21 12:44 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 5/26/21 11:54 AM, Jens Axboe wrote:
>>> On 5/26/21 11:31 AM, Jens Axboe wrote:
>>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
>>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>>>>> That said, audit is not for everyone, and we have build time and
>>>>>>>> runtime options to help make life easier. Beyond simply disabling
>>>>>>>> audit at compile time a number of Linux distributions effectively
>>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>>>>> filter, for example:
>>>>>>>>
>>>>>>>> % auditctl -a task,never
>>>>>>>
>>>>>>> As has been brought up, the issue we're facing is that distros have
>>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>>>>> of people doing custom kernels. My question would then be how much
>>>>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>>>>
>>>>>> I commented on that case in my last email to Pavel, but I'll try to go
>>>>>> over it again in a little more detail.
>>>>>>
>>>>>> As we discussed earlier in this thread, we can skip the req->opcode
>>>>>> check before both the _entry and _exit calls, so we are left with just
>>>>>> the bare audit calls in the io_uring code. As the _entry and _exit
>>>>>> functions are small, I've copied them and their supporting functions
>>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>>>>> "task,never" case.
>>>>>>
>>>>>> + static inline struct audit_context *audit_context(void)
>>>>>> + {
>>>>>> + return current->audit_context;
>>>>>> + }
>>>>>>
>>>>>> + static inline bool audit_dummy_context(void)
>>>>>> + {
>>>>>> + void *p = audit_context();
>>>>>> + return !p || *(int *)p;
>>>>>> + }
>>>>>>
>>>>>> + static inline void audit_uring_entry(u8 op)
>>>>>> + {
>>>>>> + if (unlikely(audit_enabled && audit_context()))
>>>>>> + __audit_uring_entry(op);
>>>>>> + }
>>>>>>
>>>>>> We have one if statement where the conditional checks on two
>>>>>> individual conditions. The first (audit_enabled) is simply a check to
>>>>>> see if anyone has "turned on" auditing at runtime; historically this
>>>>>> worked rather well, and still does in a number of places, but ever
>>>>>> since systemd has taken to forcing audit on regardless of the admin's
>>>>>> audit configuration it is less useful. The second (audit_context())
>>>>>> is a check to see if an audit_context has been allocated for the
>>>>>> current task. In the case of "task,never" current->audit_context will
>>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>>>>> will never be called.
>>>>>>
>>>>>> Worst case here is checking the value of audit_enabled and
>>>>>> current->audit_context. Depending on which you think is more likely
>>>>>> we can change the order of the check so that the
>>>>>> current->audit_context check is first if you feel that is more likely
>>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>>>>
>>>>>> + static inline void audit_uring_exit(int success, long code)
>>>>>> + {
>>>>>> + if (unlikely(!audit_dummy_context()))
>>>>>> + __audit_uring_exit(success, code);
>>>>>> + }
>>>>>>
>>>>>> The exit call is very similar to the entry call, but in the
>>>>>> "task,never" case it is very simple as the first check to be performed
>>>>>> is the current->audit_context check which we know to be NULL. The
>>>>>> __audit_uring_exit() slowpath will never be called.
>>>>>
>>>>> I actually ran some numbers this morning. The test base is 5.13+, and
>>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>>>>> test and the test with this series applied. I used your git branch as of
>>>>> this morning.
>>>>>
>>>>> The test case is my usual peak perf test, which is random reads at
>>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>>>>> two different tests - one was having a thread just do the IO, the other
>>>>> is using SQPOLL to do the IO for us. The device is capable than more
>>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
>>>>> Hence it's a good test case as it does actual work, and shows software
>>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>>>>> between runs on the same base), yet I did average 4 runs.
>>>>>
>>>>> Kernel SQPOLL IOPS Perf diff
>>>>> ---------------------------------------------------------
>>>>> 5.13 0 3029872 0.0%
>>>>> 5.13 1 3031056 0.0%
>>>>> 5.13 + audit 0 2894160 -4.5%
>>>>> 5.13 + audit 1 2886168 -4.8%
>>>>>
>>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
>>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>>>>> shows this:
>>>>>
>>>>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry
>>>>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit
>>>>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
>>>>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
>>>>>
>>>>> Note that this is with _no_ rules!
>>>>
>>>> io_uring also supports a NOP command, which basically just measures
>>>> reqs/sec through the interface. Ran that as well:
>>>>
>>>> Kernel SQPOLL IOPS Perf diff
>>>> ---------------------------------------------------------
>>>> 5.13 0 31.05M 0.0%
>>>> 5.13 + audit 0 25.31M -18.5%
>>>>
>>>> and profile for the latter includes:
>>>>
>>>> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry
>>>> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit
>>>> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
>>>> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
>>>
>>> As Pavel correctly pointed it, looks like auditing is enabled. And
>>> indeed it was! Hence the above numbers is without having turned off
>>> auditing. Running the NOPs after having turned off audit, we get 30.6M
>>> IOPS, which is down about 1.5% from the baseline. The results for the
>>> polled random read test above did _not_ change from this, they are still
>>> down the same amount.
>>>
>>> Note, and I should have included this in the first email, this is not
>>> any kind of argument for or against audit logging. It's purely meant to
>>> be a set of numbers that show how the current series impacts
>>> performance.
>>
>> And finally, just checking if we make it optional per opcode if we see
>> any real impact, and the answer is no. Using the below patch which
>> effectively bypasses audit calls unless the opcode has flagged the need
>> to do so, I cannot measure any difference in perf (as expected).
>>
>> To turn this into something useful, my suggestion as a viable path
>> forward would be:
>>
>> 1) Use something like the below patch and flag request types that we
>> want to do audit logging for.
>>
>> 2) As Pavel suggested, eliminate the need for having both and entry/exit
>> hook, turning it into just one. That effectively cuts the number of
>> checks and calls in half.
>
> I suspect the updated working-io_uring branch with HEAD at
> 1f25193a3f54 (updated a short time ago, see my last email in this
> thread) will improve performance. Also, as has been mention several
> times now, for audit to work we need both the _entry and _exit call.
Pulled in your new stuff, and ran a quick test again (same as before,
the rand reads). No change, we're still down ~5% with auditctl -a
task,never having been run. If I stub out your two audit entry/exit
calls by adding an if (0) before them, then perf is back to normal.
--
Jens Axboe
On Wed, May 26, 2021 at 3:44 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 5/26/21 12:44 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 5/26/21 11:54 AM, Jens Axboe wrote:
> >>> On 5/26/21 11:31 AM, Jens Axboe wrote:
> >>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
> >>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>>>>> That said, audit is not for everyone, and we have build time and
> >>>>>>>> runtime options to help make life easier. Beyond simply disabling
> >>>>>>>> audit at compile time a number of Linux distributions effectively
> >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>>>>> filter, for example:
> >>>>>>>>
> >>>>>>>> % auditctl -a task,never
> >>>>>>>
> >>>>>>> As has been brought up, the issue we're facing is that distros have
> >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>>>>> of people doing custom kernels. My question would then be how much
> >>>>>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>>>>
> >>>>>> I commented on that case in my last email to Pavel, but I'll try to go
> >>>>>> over it again in a little more detail.
> >>>>>>
> >>>>>> As we discussed earlier in this thread, we can skip the req->opcode
> >>>>>> check before both the _entry and _exit calls, so we are left with just
> >>>>>> the bare audit calls in the io_uring code. As the _entry and _exit
> >>>>>> functions are small, I've copied them and their supporting functions
> >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>>>>> "task,never" case.
> >>>>>>
> >>>>>> + static inline struct audit_context *audit_context(void)
> >>>>>> + {
> >>>>>> + return current->audit_context;
> >>>>>> + }
> >>>>>>
> >>>>>> + static inline bool audit_dummy_context(void)
> >>>>>> + {
> >>>>>> + void *p = audit_context();
> >>>>>> + return !p || *(int *)p;
> >>>>>> + }
> >>>>>>
> >>>>>> + static inline void audit_uring_entry(u8 op)
> >>>>>> + {
> >>>>>> + if (unlikely(audit_enabled && audit_context()))
> >>>>>> + __audit_uring_entry(op);
> >>>>>> + }
> >>>>>>
> >>>>>> We have one if statement where the conditional checks on two
> >>>>>> individual conditions. The first (audit_enabled) is simply a check to
> >>>>>> see if anyone has "turned on" auditing at runtime; historically this
> >>>>>> worked rather well, and still does in a number of places, but ever
> >>>>>> since systemd has taken to forcing audit on regardless of the admin's
> >>>>>> audit configuration it is less useful. The second (audit_context())
> >>>>>> is a check to see if an audit_context has been allocated for the
> >>>>>> current task. In the case of "task,never" current->audit_context will
> >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>>>>> will never be called.
> >>>>>>
> >>>>>> Worst case here is checking the value of audit_enabled and
> >>>>>> current->audit_context. Depending on which you think is more likely
> >>>>>> we can change the order of the check so that the
> >>>>>> current->audit_context check is first if you feel that is more likely
> >>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>>>>
> >>>>>> + static inline void audit_uring_exit(int success, long code)
> >>>>>> + {
> >>>>>> + if (unlikely(!audit_dummy_context()))
> >>>>>> + __audit_uring_exit(success, code);
> >>>>>> + }
> >>>>>>
> >>>>>> The exit call is very similar to the entry call, but in the
> >>>>>> "task,never" case it is very simple as the first check to be performed
> >>>>>> is the current->audit_context check which we know to be NULL. The
> >>>>>> __audit_uring_exit() slowpath will never be called.
> >>>>>
> >>>>> I actually ran some numbers this morning. The test base is 5.13+, and
> >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >>>>> test and the test with this series applied. I used your git branch as of
> >>>>> this morning.
> >>>>>
> >>>>> The test case is my usual peak perf test, which is random reads at
> >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >>>>> two different tests - one was having a thread just do the IO, the other
> >>>>> is using SQPOLL to do the IO for us. The device is capable than more
> >>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
> >>>>> Hence it's a good test case as it does actual work, and shows software
> >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >>>>> between runs on the same base), yet I did average 4 runs.
> >>>>>
> >>>>> Kernel SQPOLL IOPS Perf diff
> >>>>> ---------------------------------------------------------
> >>>>> 5.13 0 3029872 0.0%
> >>>>> 5.13 1 3031056 0.0%
> >>>>> 5.13 + audit 0 2894160 -4.5%
> >>>>> 5.13 + audit 1 2886168 -4.8%
> >>>>>
> >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
> >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >>>>> shows this:
> >>>>>
> >>>>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> >>>>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> >>>>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> >>>>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
> >>>>>
> >>>>> Note that this is with _no_ rules!
> >>>>
> >>>> io_uring also supports a NOP command, which basically just measures
> >>>> reqs/sec through the interface. Ran that as well:
> >>>>
> >>>> Kernel SQPOLL IOPS Perf diff
> >>>> ---------------------------------------------------------
> >>>> 5.13 0 31.05M 0.0%
> >>>> 5.13 + audit 0 25.31M -18.5%
> >>>>
> >>>> and profile for the latter includes:
> >>>>
> >>>> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry
> >>>> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit
> >>>> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
> >>>> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
> >>>
> >>> As Pavel correctly pointed it, looks like auditing is enabled. And
> >>> indeed it was! Hence the above numbers is without having turned off
> >>> auditing. Running the NOPs after having turned off audit, we get 30.6M
> >>> IOPS, which is down about 1.5% from the baseline. The results for the
> >>> polled random read test above did _not_ change from this, they are still
> >>> down the same amount.
> >>>
> >>> Note, and I should have included this in the first email, this is not
> >>> any kind of argument for or against audit logging. It's purely meant to
> >>> be a set of numbers that show how the current series impacts
> >>> performance.
> >>
> >> And finally, just checking if we make it optional per opcode if we see
> >> any real impact, and the answer is no. Using the below patch which
> >> effectively bypasses audit calls unless the opcode has flagged the need
> >> to do so, I cannot measure any difference in perf (as expected).
> >>
> >> To turn this into something useful, my suggestion as a viable path
> >> forward would be:
> >>
> >> 1) Use something like the below patch and flag request types that we
> >> want to do audit logging for.
> >>
> >> 2) As Pavel suggested, eliminate the need for having both and entry/exit
> >> hook, turning it into just one. That effectively cuts the number of
> >> checks and calls in half.
> >
> > I suspect the updated working-io_uring branch with HEAD at
> > 1f25193a3f54 (updated a short time ago, see my last email in this
> > thread) will improve performance. Also, as has been mention several
> > times now, for audit to work we need both the _entry and _exit call.
>
> Pulled in your new stuff, and ran a quick test again (same as before,
> the rand reads). No change, we're still down ~5% with auditctl -a
> task,never having been run. If I stub out your two audit entry/exit
> calls by adding an if (0) before them, then perf is back to normal.
That's fun.
Considering that we are down to basically a single if-conditional in
the audit code and your other tests with another single conditional
yielded no significant change it looks like we really only have one
option left before we "agree to disagree" and have to mark io_uring
and audit as mutually exclusive in Kconfig. If we moved the _entry
and _exit calls into the individual operation case blocks (quick
openat example below) so that only certain operations were able to be
audited would that be acceptable assuming the high frequency ops were
untouched? My initial gut feeling was that this would involve >50% of
the ops, but Steve Grubb seems to think it would be less; it may be
time to look at that a bit more seriously, but if it gets a NACK
regardless it isn't worth the time - thoughts?
case IORING_OP_OPENAT:
audit_uring_entry(req->opcode);
ret = io_openat(req, issue_flags);
audit_uring_exit(!ret, ret);
break;
--
paul moore
www.paul-moore.com
On Wed, May 26, 2021 at 10:48 AM Stefan Metzmacher <metze@samba.org> wrote: > > Hi Paul, Hi Stefan. > > #define CREATE_TRACE_POINTS > > #include <trace/events/io_uring.h> > > @@ -6537,6 +6538,11 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req, > > if (!req->work.creds) > > return -EINVAL; > > get_cred(req->work.creds); > > + ret = security_uring_override_creds(req->work.creds); > > + if (ret) { > > + put_cred(req->work.creds); > > + return ret; > > + } > > Why are you calling this per requests, shouldn't this be done in > io_register_personality()? Generally speaking it is more interesting to see when user alice tries to impersonate bob and not when bob registers his ID as available to use by others. We could always add a LSM hook to control when bob registers his ID, but I think the impersonation is the critical code path. However, if I'm misunderstanding how this works in io_uring please correct me. > I'm also not sure if this really gains anything as io_register_personality() > only captures the value of get_current_cred(), so the process already has changed to > the credentials (at least once for the io_uring_register(IORING_REGISTER_PERSONALITY) > call). > > metze -- paul moore www.paul-moore.com
On 2021-05-26 11:22, Jens Axboe wrote: > On 5/26/21 9:49 AM, Richard Guy Briggs wrote: > >> So why is there anything special needed for io_uring (now that the > >> native worker threads are used)? > > > > Because syscall has been bypassed by a memory-mapped work queue. > > I don't follow this one at all, that's just the delivery mechanism if > you choose to use SQPOLL. If you do, then a thread sibling of the > original task does the actual system call. There's no magic involved > there, and the tasks are related. > > So care to expand on that? These may be poor examples, but hear me out... In the case of an open call, I'm guessing that they are mapped 1:1 syscall to io_uring action so that the action can be asynchronous. So in this case, there is a record of the action, but we don't see the result success/failure. I assume that file paths can only be specified in the original syscall and not in an SQPOLL action? In the case of a syscall read/write (which aren't as interesting generally to audit, but I'm concerned there are other similar situations that are), the syscall would be called for every buffer that is passed back and forth user/kernel and vice versa, providing a way to log all that activity. In the case of SQPOLL, I understand that a syscall sets up the action and then io_uring takes over and does the bulk transfer and we'd not have the visibility of how many times that action was repeated nor that the result success/failure was due to its asynchrony. Perhaps I am showing my ignorance, so please correct me if I have it wrong. > >> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall > >> would do? If so, I think that should just be fixed... > > > > This is by design to speed it up. This is what Paul's iouring entry and > > exit hooks do. > > As far as I can tell, we're doing double logging at that point, if the > syscall helper does the audit as well. We'll get something logging the > io_uring opcode (eg IORING_OP_OPENAT2), then log again when we call the > fs helper. That's _assuming_ that we always hit the logging part when we > call into the vfs, but that's something that can be updated to be true > and kept an eye on for future additions. > > Why is the double logging useful? It only tells you that the invocation > was via io_uring as the delivery mechanism rather than the usual system > call, but the effect is the same - the file is opened, for example. > > I feel like I'm missing something here, or the other side is. Or both! Paul addressed this in his reply, but let me add a more concrete example... There was one corner case I was looking at that showed up this issue. Please indicate if I have mischaracterized or misunderstood. A syscall would generate records something like this: AUDIT_SYSCALL AUDIT_... AUDIT_EOE A io_uring SQPOLL event would generate something like this: AUDIT_URINGOP AUDIT_... AUDIT_EOE The "hybrid" event that is a syscall that starts an io_uring action would generate something like this: AUDIT_URINGOP [possible AUDIT_CONFIG_CHANGE (from killed_trees)] AUDIT_SYSCALL AUDIT_... AUDIT_EOE The AUDIT_... is all the operation-specific records that log parameters that aren't able to be expressed in the SYSLOG or URINGOP records such as pathnames, other arguments, and context (pwd, etc...). So this isn't "double logging". It is either introducing an io_uring event, or it is providing more detail about the io_uring arguments to a syscall event. > Jens Axboe - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
> ... If we moved the _entry
> and _exit calls into the individual operation case blocks (quick
> openat example below) so that only certain operations were able to be
> audited would that be acceptable assuming the high frequency ops were
> untouched? My initial gut feeling was that this would involve >50% of
> the ops, but Steve Grubb seems to think it would be less; it may be
> time to look at that a bit more seriously, but if it gets a NACK
> regardless it isn't worth the time - thoughts?
>
> case IORING_OP_OPENAT:
> audit_uring_entry(req->opcode);
> ret = io_openat(req, issue_flags);
> audit_uring_exit(!ret, ret);
> break;
I wanted to pose this question again in case it was lost in the
thread, I suspect this may be the last option before we have to "fix"
things at the Kconfig level. I definitely don't want to have to go
that route, and I suspect most everyone on this thread feels the same,
so I'm hopeful we can find a solution that is begrudgingly acceptable
to both groups.
--
paul moore
www.paul-moore.com
On 2021-05-21 17:50, Paul Moore wrote: > WARNING - This is a work in progress and should not be merged > anywhere important. It is almost surely not complete, and while it > probably compiles it likely hasn't been booted and will do terrible > things. You have been warned. > > This patch adds basic audit io_uring filtering, using as much of the > existing audit filtering infrastructure as possible. In order to do > this we reuse the audit filter rule's syscall mask for the io_uring > operation and we create a new filter for io_uring operations as > AUDIT_FILTER_URING_EXIT/audit_filter_list[7]. > > <TODO - provide some additional guidance for the userspace tools> > > Signed-off-by: Paul Moore <paul@paul-moore.com> > --- > include/uapi/linux/audit.h | 3 +- > kernel/auditfilter.c | 4 ++- > kernel/auditsc.c | 65 ++++++++++++++++++++++++++++++++++---------- > 3 files changed, 55 insertions(+), 17 deletions(-) > > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > index b26e0c435e8b..621eb3c1076e 100644 > --- a/include/uapi/linux/audit.h > +++ b/include/uapi/linux/audit.h > @@ -167,8 +167,9 @@ > #define AUDIT_FILTER_EXCLUDE 0x05 /* Apply rule before record creation */ > #define AUDIT_FILTER_TYPE AUDIT_FILTER_EXCLUDE /* obsolete misleading naming */ > #define AUDIT_FILTER_FS 0x06 /* Apply rule at __audit_inode_child */ > +#define AUDIT_FILTER_URING_EXIT 0x07 /* Apply rule at io_uring op exit */ > > -#define AUDIT_NR_FILTERS 7 > +#define AUDIT_NR_FILTERS 8 > > #define AUDIT_FILTER_PREPEND 0x10 /* Prepend to front of list */ > > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c > index db2c6b59dfc3..c21119c00504 100644 > --- a/kernel/auditfilter.c > +++ b/kernel/auditfilter.c > @@ -44,7 +44,8 @@ struct list_head audit_filter_list[AUDIT_NR_FILTERS] = { > LIST_HEAD_INIT(audit_filter_list[4]), > LIST_HEAD_INIT(audit_filter_list[5]), > LIST_HEAD_INIT(audit_filter_list[6]), > -#if AUDIT_NR_FILTERS != 7 > + LIST_HEAD_INIT(audit_filter_list[7]), > +#if AUDIT_NR_FILTERS != 8 > #error Fix audit_filter_list initialiser > #endif > }; > @@ -56,6 +57,7 @@ static struct list_head audit_rules_list[AUDIT_NR_FILTERS] = { > LIST_HEAD_INIT(audit_rules_list[4]), > LIST_HEAD_INIT(audit_rules_list[5]), > LIST_HEAD_INIT(audit_rules_list[6]), > + LIST_HEAD_INIT(audit_rules_list[7]), > }; > > DEFINE_MUTEX(audit_filter_mutex); > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index d8aa2c690bf9..4f6ab34020fb 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) > return rule->mask[word] & bit; > } > > +/** > + * audit_filter_uring - apply filters to an io_uring operation > + * @tsk: associated task > + * @ctx: audit context > + */ > +static void audit_filter_uring(struct task_struct *tsk, > + struct audit_context *ctx) > +{ > + struct audit_entry *e; > + enum audit_state state; > + > + if (auditd_test_task(tsk)) > + return; Is this necessary? auditd and auditctl don't (intentionally) use any io_uring functionality. Is it possible it might inadvertantly use some by virtue of libc or other library calls now or in the future? > + rcu_read_lock(); > + list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT], > + list) { > + if (audit_in_mask(&e->rule, ctx->uring_op) && While this seems like the most obvious approach given the parallels between syscalls and io_uring operations, as coded here it won't work due to the different mappings of syscall numbers and io_uring operations unless we re-use the auditctl -S field with raw io_uring operation numbers in the place of syscall numbers. This should have been obvious to me when I first looked at this patch. It became obvious when I started looking at the userspace auditctl.c. The easy first step would be to use something like this: auditctl -a uring,always -S 18,28 -F key=uring_open to monitor file open commands only. The same is not yet possible for the perm field, but there are so few io_uring ops at this point compared with syscalls that it might be manageable. The arch is irrelevant since io_uring operation numbers are identical across all hardware as far as I can tell. Most of the rest of the fields should make sense if they do for a syscall rule. Here's a sample of userspace code to support this patch: https://github.com/rgbriggs/audit-userspace/commit/a77baa1651b7ad841a220eb962d4cc92bc07dc96 https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau-iouring-filtering.v1.0 If we abuse the syscall infrastructure at first, we'd need a transition plan to coordinate user and kernel switchover to seperate mechanisms for the two to work together if the need should arise to have both syscall and uring filters in the same rule. I'm still exploring ideas on the best place to put this translation table, in userspace libaudit, kaudit, io_op_defs[], include/uapi/linux/io_uring.h, ... For speed, the best would be in userspace libaudit, but that will be the least obvious place for any io_uring developer to look when making any updates to the list of io_uring operations and will most likely result in additions escaping security logging. Next best for speed would be in kaudit when instantiating rules, to translate the syscall numbers to operation numbers and store them natively in a different mask (audit_in_mask). This also runs the risk of additions escaping security logging. After that, they could reuse the existing audit syscall filter infrastructure and mask and translate on the fly from the specific op in use to its equivalent syscall number when checking the existing filter. This is the least disruptive to the audit code and the most likely to be updated when new io_uring operations are added, but will incur an extra step to translate the io_uring operation number to a syscall number in the hot path. Given that multiple aspects of security appear to have been a complete afterthought to this feature, necessitating it to be bolted on after the fact, it appears that the last option might be the most attractive to the security folks, making this as easy as possible for fs folks to update the audit code would be necessary to maintain security. If there isn't a direct mapping between io_uring operations and syscalls (the reverse isn't needed), then we'll need to duplicate the mechanism throughout the stack starting in userspace auditctl and libaudit. Currently the bitmap for syscalls is 2k. The current io_uring op list appears to be 37. It might be wise to deliberately not support auditctl "-w" (and the exported audit_add_watch() function) since that is currently hardcoded to the AUDIT_FILTER_EXIT list and is the old watch form [replaced by audit_rule_fieldpair_data()] anyways that is more likely to be deprecated. It also appears to make sense not to support autrace (at least initially). Does any of this roughly make sense? > + audit_filter_rules(tsk, &e->rule, ctx, NULL, &state, > + false)) { > + rcu_read_unlock(); > + ctx->current_state = state; > + return; > + } > + } > + rcu_read_unlock(); > + return; > +} > + > /* At syscall exit time, this filter is called if the audit_state is > * not low enough that auditing cannot take place, but is also not > * high enough that we already know we have to write an audit record > @@ -1754,7 +1783,7 @@ static void audit_log_exit(void) > * __audit_free - free a per-task audit context > * @tsk: task whose audit context block to free > * > - * Called from copy_process and do_exit > + * Called from copy_process, do_exit, and the io_uring code > */ > void __audit_free(struct task_struct *tsk) > { > @@ -1772,15 +1801,21 @@ void __audit_free(struct task_struct *tsk) > * random task_struct that doesn't doesn't have any meaningful data we > * need to log via audit_log_exit(). > */ > - if (tsk == current && !context->dummy && > - context->context == AUDIT_CTX_SYSCALL) { > + if (tsk == current && !context->dummy) { > context->return_valid = AUDITSC_INVALID; > context->return_code = 0; > - > - audit_filter_syscall(tsk, context); > - audit_filter_inodes(tsk, context); > - if (context->current_state == AUDIT_RECORD_CONTEXT) > - audit_log_exit(); > + if (context->context == AUDIT_CTX_SYSCALL) { > + audit_filter_syscall(tsk, context); > + audit_filter_inodes(tsk, context); > + if (context->current_state == AUDIT_RECORD_CONTEXT) > + audit_log_exit(); > + } else if (context->context == AUDIT_CTX_URING) { > + /* TODO: verify this case is real and valid */ > + audit_filter_uring(tsk, context); > + audit_filter_inodes(tsk, context); > + if (context->current_state == AUDIT_RECORD_CONTEXT) > + audit_log_uring(context); > + } > } > > audit_set_context(tsk, NULL); > @@ -1861,12 +1896,6 @@ void __audit_uring_exit(int success, long code) > { > struct audit_context *ctx = audit_context(); > > - /* > - * TODO: At some point we will likely want to filter on io_uring ops > - * and other things similar to what we do for syscalls, but that > - * is something for another day; just record what we can here. > - */ > - > if (!ctx || ctx->dummy) > goto out; > if (ctx->context == AUDIT_CTX_SYSCALL) { > @@ -1891,6 +1920,8 @@ void __audit_uring_exit(int success, long code) > * the behavior here. > */ > audit_filter_syscall(current, ctx); > + if (ctx->current_state != AUDIT_RECORD_CONTEXT) > + audit_filter_uring(current, ctx); > audit_filter_inodes(current, ctx); > if (ctx->current_state != AUDIT_RECORD_CONTEXT) > return; > @@ -1899,7 +1930,9 @@ void __audit_uring_exit(int success, long code) > return; > } > #if 1 > - /* XXX - temporary hack to force record generation */ > + /* XXX - temporary hack to force record generation, we are leaving this > + * enabled, but if you want to actually test the filtering you > + * need to disable this #if/#endif block */ > ctx->current_state = AUDIT_RECORD_CONTEXT; > #endif > > @@ -1907,6 +1940,8 @@ void __audit_uring_exit(int success, long code) > if (!list_empty(&ctx->killed_trees)) > audit_kill_trees(ctx); > > + /* run through both filters to ensure we set the filterkey properly */ > + audit_filter_uring(current, ctx); > audit_filter_inodes(current, ctx); > if (ctx->current_state != AUDIT_RECORD_CONTEXT) > goto out; > > -- > Linux-audit mailing list > Linux-audit@redhat.com > https://listman.redhat.com/mailman/listinfo/linux-audit - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2021-05-21 17:50, Paul Moore wrote: > > WARNING - This is a work in progress and should not be merged > > anywhere important. It is almost surely not complete, and while it > > probably compiles it likely hasn't been booted and will do terrible > > things. You have been warned. > > > > This patch adds basic audit io_uring filtering, using as much of the > > existing audit filtering infrastructure as possible. In order to do > > this we reuse the audit filter rule's syscall mask for the io_uring > > operation and we create a new filter for io_uring operations as > > AUDIT_FILTER_URING_EXIT/audit_filter_list[7]. > > > > <TODO - provide some additional guidance for the userspace tools> > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > --- > > include/uapi/linux/audit.h | 3 +- > > kernel/auditfilter.c | 4 ++- > > kernel/auditsc.c | 65 ++++++++++++++++++++++++++++++++++---------- > > 3 files changed, 55 insertions(+), 17 deletions(-) ... > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > > index d8aa2c690bf9..4f6ab34020fb 100644 > > --- a/kernel/auditsc.c > > +++ b/kernel/auditsc.c > > @@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) > > return rule->mask[word] & bit; > > } > > > > +/** > > + * audit_filter_uring - apply filters to an io_uring operation > > + * @tsk: associated task > > + * @ctx: audit context > > + */ > > +static void audit_filter_uring(struct task_struct *tsk, > > + struct audit_context *ctx) > > +{ > > + struct audit_entry *e; > > + enum audit_state state; > > + > > + if (auditd_test_task(tsk)) > > + return; > > Is this necessary? auditd and auditctl don't (intentionally) use any > io_uring functionality. Is it possible it might inadvertantly use some > by virtue of libc or other library calls now or in the future? I think the better question is what harm does it do? Yes, I'm not aware of an auditd implementation that currently makes use of io_uring, but it is also not inconceivable some future implementation might want to make use of it and given the disjoint nature of kernel and userspace development I don't want the kernel to block such developments. However, if you can think of a reason why having this check here is bad I'm listening (note: we are already in the slow path here so having the additional check isn't an issue as far as I'm concerned). As a reminder, auditd_test_task() only returns true/1 if the task is registered with the audit subsystem as an auditd connection, an auditctl process should not cause this function to return true. > > + rcu_read_lock(); > > + list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT], > > + list) { > > + if (audit_in_mask(&e->rule, ctx->uring_op) && > > While this seems like the most obvious approach given the parallels > between syscalls and io_uring operations, as coded here it won't work > due to the different mappings of syscall numbers and io_uring > operations unless we re-use the auditctl -S field with raw io_uring > operation numbers in the place of syscall numbers. This should have > been obvious to me when I first looked at this patch. It became obvious > when I started looking at the userspace auditctl.c. FWIW, my intention was to treat io_uring opcodes exactly like we treat syscall numbers. Yes, this would potentially be an issue if we wanted to combine syscalls and io_uring opcodes into one filter, but why would we ever want to do that? Combining the two into one filter not only makes the filter lists longer than needed (we will always know if we are filtering on a syscall or io_uring op) and complicates the filter rule processing. Or is there a problem with this that I'm missing? > The easy first step would be to use something like this: > auditctl -a uring,always -S 18,28 -F key=uring_open > to monitor file open commands only. The same is not yet possible for > the perm field, but there are so few io_uring ops at this point compared > with syscalls that it might be manageable. The arch is irrelevant since > io_uring operation numbers are identical across all hardware as far as I > can tell. Most of the rest of the fields should make sense if they do > for a syscall rule. I've never been a fan of audit's "perm" filtering; I've always felt there were better ways to handle that so I'm not overly upset that we are skipping that functionality with this initial support. If it becomes a problem in the future we can always add that support at a later date. I currently fear that just getting io_uring and audit to coexist is going to be a large enough problem in the immediate future. > Here's a sample of userspace code to support this > patch: > https://github.com/rgbriggs/audit-userspace/commit/a77baa1651b7ad841a220eb962d4cc92bc07dc96 > https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau-iouring-filtering.v1.0 Great, thank you. I haven't grabbed a copy yet for testing, but I will. > If we abuse the syscall infrastructure at first, we'd need a transition > plan to coordinate user and kernel switchover to seperate mechanisms for > the two to work together if the need should arise to have both syscall > and uring filters in the same rule. See my comments above, I don't currently see why we would ever want syscall and io_uring filtering to happen in the same rule. Please speak up if you can think of a reason why this would either be needed, or desirable for some reason. > It might be wise to deliberately not support auditctl "-w" (and the > exported audit_add_watch() function) since that is currently hardcoded > to the AUDIT_FILTER_EXIT list and is the old watch form [replaced by > audit_rule_fieldpair_data()] anyways that is more likely to be > deprecated. It also appears to make sense not to support autrace (at > least initially). I'm going to be honest with you and simply say that I've run out of my email/review time in front of the computer on this holiday weekend (blame the lockdown/bpf/lsm discussion <g>) and I need to go for today, but this is something I'll take a look it this coming week. Hopefully the comments above give us something to think/talk about in the meantime. Regardless, thanks for your help on the userspace side of the filtering, that should make testing a lot easier moving forward. -- paul moore www.paul-moore.com
On 2021-05-30 11:26, Paul Moore wrote: > On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2021-05-21 17:50, Paul Moore wrote: > > > WARNING - This is a work in progress and should not be merged > > > anywhere important. It is almost surely not complete, and while it > > > probably compiles it likely hasn't been booted and will do terrible > > > things. You have been warned. > > > > > > This patch adds basic audit io_uring filtering, using as much of the > > > existing audit filtering infrastructure as possible. In order to do > > > this we reuse the audit filter rule's syscall mask for the io_uring > > > operation and we create a new filter for io_uring operations as > > > AUDIT_FILTER_URING_EXIT/audit_filter_list[7]. > > > > > > <TODO - provide some additional guidance for the userspace tools> > > > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > > --- > > > include/uapi/linux/audit.h | 3 +- > > > kernel/auditfilter.c | 4 ++- > > > kernel/auditsc.c | 65 ++++++++++++++++++++++++++++++++++---------- > > > 3 files changed, 55 insertions(+), 17 deletions(-) > > ... > > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > > > index d8aa2c690bf9..4f6ab34020fb 100644 > > > --- a/kernel/auditsc.c > > > +++ b/kernel/auditsc.c > > > @@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) > > > return rule->mask[word] & bit; > > > } > > > > > > +/** > > > + * audit_filter_uring - apply filters to an io_uring operation > > > + * @tsk: associated task > > > + * @ctx: audit context > > > + */ > > > +static void audit_filter_uring(struct task_struct *tsk, > > > + struct audit_context *ctx) > > > +{ > > > + struct audit_entry *e; > > > + enum audit_state state; > > > + > > > + if (auditd_test_task(tsk)) > > > + return; > > > > Is this necessary? auditd and auditctl don't (intentionally) use any > > io_uring functionality. Is it possible it might inadvertantly use some > > by virtue of libc or other library calls now or in the future? > > I think the better question is what harm does it do? Yes, I'm not > aware of an auditd implementation that currently makes use of > io_uring, but it is also not inconceivable some future implementation > might want to make use of it and given the disjoint nature of kernel > and userspace development I don't want the kernel to block such > developments. However, if you can think of a reason why having this > check here is bad I'm listening (note: we are already in the slow path > here so having the additional check isn't an issue as far as I'm > concerned). > > As a reminder, auditd_test_task() only returns true/1 if the task is > registered with the audit subsystem as an auditd connection, an > auditctl process should not cause this function to return true. My main concern was overhead, since the whole goal of io_uring is speed. The chances that audit does use this functionality in the future suggest to me that it is best to leave this check in. > > > + rcu_read_lock(); > > > + list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT], > > > + list) { > > > + if (audit_in_mask(&e->rule, ctx->uring_op) && > > > > While this seems like the most obvious approach given the parallels > > between syscalls and io_uring operations, as coded here it won't work > > due to the different mappings of syscall numbers and io_uring > > operations unless we re-use the auditctl -S field with raw io_uring > > operation numbers in the place of syscall numbers. This should have > > been obvious to me when I first looked at this patch. It became obvious > > when I started looking at the userspace auditctl.c. > > FWIW, my intention was to treat io_uring opcodes exactly like we treat > syscall numbers. Yes, this would potentially be an issue if we wanted > to combine syscalls and io_uring opcodes into one filter, but why > would we ever want to do that? Combining the two into one filter not > only makes the filter lists longer than needed (we will always know if > we are filtering on a syscall or io_uring op) and complicates the > filter rule processing. > > Or is there a problem with this that I'm missing? No, I think you have a good understanding of it. I'm asking hard questions to avoid missing something important. If we can reuse the syscall infrastructure for this then that is extremely helpful (if not lazy, which isn't necessarily a bad thing). It does mean that the io_uring op dictionary will need to live in userspace audit the way it is currently implemented, or provide a flag to indicate it is a syscall number to be translated in the kernel either at the time of rule addition or translated on the fly on rule check in the kernel adding overhead to a critical path. > > The easy first step would be to use something like this: > > auditctl -a uring,always -S 18,28 -F key=uring_open > > to monitor file open commands only. The same is not yet possible for > > the perm field, but there are so few io_uring ops at this point compared > > with syscalls that it might be manageable. The arch is irrelevant since > > io_uring operation numbers are identical across all hardware as far as I > > can tell. Most of the rest of the fields should make sense if they do > > for a syscall rule. > > I've never been a fan of audit's "perm" filtering; I've always felt > there were better ways to handle that so I'm not overly upset that we > are skipping that functionality with this initial support. If it > becomes a problem in the future we can always add that support at a > later date. Ok, I don't see a pressing need to add it initially, but should add a check to block that field from being used to avoid the confusion of unpredictable behaviour should someone try to add a perm filter to a io_uring filter. That should be done protectively in the kernel and proactively in userspace. > I currently fear that just getting io_uring and audit to coexist is > going to be a large enough problem in the immediate future. Agreed. > > Here's a sample of userspace code to support this > > patch: > > https://github.com/rgbriggs/audit-userspace/commit/a77baa1651b7ad841a220eb962d4cc92bc07dc96 > > https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau-iouring-filtering.v1.0 > > Great, thank you. I haven't grabbed a copy yet for testing, but I will. I've added a perm filter block as an additional patch in userspace and updated the tree so that first commit is no longer the top of tree but the branch name is current. I'll add a kernel perm filter check. I just noticed some list checking that is missing in tree and watch in your patch. Suggested fixup patches to follow... > > If we abuse the syscall infrastructure at first, we'd need a transition > > plan to coordinate user and kernel switchover to seperate mechanisms for > > the two to work together if the need should arise to have both syscall > > and uring filters in the same rule. > > See my comments above, I don't currently see why we would ever want > syscall and io_uring filtering to happen in the same rule. Please > speak up if you can think of a reason why this would either be needed, > or desirable for some reason. I think they can be seperate rules for now. Either a syscall rule catching all io_uring ops can be added, or an io_uring rule can be added to catch specific ops. The scenario I was thinking of was catching syscalls of specific io_uring ops. > > It might be wise to deliberately not support auditctl "-w" (and the > > exported audit_add_watch() function) since that is currently hardcoded > > to the AUDIT_FILTER_EXIT list and is the old watch form [replaced by > > audit_rule_fieldpair_data()] anyways that is more likely to be > > deprecated. It also appears to make sense not to support autrace (at > > least initially). > > I'm going to be honest with you and simply say that I've run out of my > email/review time in front of the computer on this holiday weekend > (blame the lockdown/bpf/lsm discussion <g>) and I need to go for > today, but this is something I'll take a look it this coming week. > Hopefully the comments above give us something to think/talk about in > the meantime. I wasn't expecting you to work the weekend. :-) > Regardless, thanks for your help on the userspace side of the > filtering, that should make testing a lot easier moving forward. Standard RFC disclaimers apply. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
The commit ("audit: add filtering for io_uring records") added support for filtering io_uring operations. Add checks to the audit io_uring filtering code for directory and path watches, and to keep the list counts consistent. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> --- kernel/audit_tree.c | 3 ++- kernel/audit_watch.c | 3 ++- kernel/auditfilter.c | 7 +++++-- 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c index 6c91902f4f45..2be285c2f069 100644 --- a/kernel/audit_tree.c +++ b/kernel/audit_tree.c @@ -727,7 +727,8 @@ int audit_make_tree(struct audit_krule *rule, char *pathname, u32 op) { if (pathname[0] != '/' || - rule->listnr != AUDIT_FILTER_EXIT || + (rule->listnr != AUDIT_FILTER_EXIT && + rule->listnr != AUDIT_FILTER_URING_EXIT) || op != Audit_equal || rule->inode_f || rule->watch || rule->tree) return -EINVAL; diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index 2acf7ca49154..698b62b4a2ec 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -183,7 +183,8 @@ int audit_to_watch(struct audit_krule *krule, char *path, int len, u32 op) return -EOPNOTSUPP; if (path[0] != '/' || path[len-1] == '/' || - krule->listnr != AUDIT_FILTER_EXIT || + (krule->listnr != AUDIT_FILTER_EXIT && + krule->listnr != AUDIT_FILTER_URING_EXIT) || op != Audit_equal || krule->inode_f || krule->watch || krule->tree) return -EINVAL; diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index c21119c00504..bcdedfd1088c 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -153,7 +153,8 @@ char *audit_unpack_string(void **bufp, size_t *remain, size_t len) static inline int audit_to_inode(struct audit_krule *krule, struct audit_field *f) { - if (krule->listnr != AUDIT_FILTER_EXIT || + if ((krule->listnr != AUDIT_FILTER_EXIT && + krule->listnr != AUDIT_FILTER_URING_EXIT) || krule->inode_f || krule->watch || krule->tree || (f->op != Audit_equal && f->op != Audit_not_equal)) return -EINVAL; @@ -250,6 +251,7 @@ static inline struct audit_entry *audit_to_entry_common(struct audit_rule_data * pr_err("AUDIT_FILTER_ENTRY is deprecated\n"); goto exit_err; case AUDIT_FILTER_EXIT: + case AUDIT_FILTER_URING_EXIT: case AUDIT_FILTER_TASK: #endif case AUDIT_FILTER_USER: @@ -982,7 +984,8 @@ static inline int audit_add_rule(struct audit_entry *entry) } entry->rule.prio = ~0ULL; - if (entry->rule.listnr == AUDIT_FILTER_EXIT) { + if (entry->rule.listnr == AUDIT_FILTER_EXIT || + entry->rule.listnr == AUDIT_FILTER_URING_EXIT) { if (entry->rule.flags & AUDIT_FILTER_PREPEND) entry->rule.prio = ++prio_high; else -- 2.27.0
The commit ("audit: add filtering for io_uring records") added support for filtering io_uring operations. The PERM field is invalid for io_uring filtering, so block it. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> --- kernel/auditfilter.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index bcdedfd1088c..d75acb014ccd 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -336,6 +336,10 @@ static int audit_field_valid(struct audit_entry *entry, struct audit_field *f) if (entry->rule.listnr != AUDIT_FILTER_FS) return -EINVAL; break; + case AUDIT_PERM: + if (entry->rule.listnr == AUDIT_FILTER_URING_EXIT) + return -EINVAL; + break; } switch (entry->rule.listnr) { -- 2.27.0
[-- Attachment #1: Type: text/plain, Size: 4288 bytes --] Hi Richard, Thank you for the patch! Yet something to improve: [auto build test ERROR on pcmoore-audit/next] [also build test ERROR on v5.13-rc4 next-20210528] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Richard-Guy-Briggs/audit-add-filtering-for-io_uring-records-addendum/20210531-214941 base: https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit.git next config: parisc-randconfig-r015-20210531 (attached as .config) compiler: hppa-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/07a3e22a2f984838bc98b43b58e8ef08e9353483 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Richard-Guy-Briggs/audit-add-filtering-for-io_uring-records-addendum/20210531-214941 git checkout 07a3e22a2f984838bc98b43b58e8ef08e9353483 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=parisc If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): kernel/auditfilter.c: In function 'audit_to_inode': >> kernel/auditfilter.c:155:24: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'? 155 | krule->listnr != AUDIT_FILTER_URING_EXIT) || | ^~~~~~~~~~~~~~~~~~~~~~~ | AUDIT_FILTER_EXIT kernel/auditfilter.c:155:24: note: each undeclared identifier is reported only once for each function it appears in kernel/auditfilter.c: In function 'audit_to_entry_common': kernel/auditfilter.c:252:7: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'? 252 | case AUDIT_FILTER_URING_EXIT: | ^~~~~~~~~~~~~~~~~~~~~~~ | AUDIT_FILTER_EXIT kernel/auditfilter.c: In function 'audit_add_rule': kernel/auditfilter.c:986:28: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'? 986 | entry->rule.listnr == AUDIT_FILTER_URING_EXIT) { | ^~~~~~~~~~~~~~~~~~~~~~~ | AUDIT_FILTER_EXIT -- kernel/audit_watch.c: In function 'audit_to_watch': >> kernel/audit_watch.c:187:24: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'? 187 | krule->listnr != AUDIT_FILTER_URING_EXIT) || | ^~~~~~~~~~~~~~~~~~~~~~~ | AUDIT_FILTER_EXIT kernel/audit_watch.c:187:24: note: each undeclared identifier is reported only once for each function it appears in -- kernel/audit_tree.c: In function 'audit_make_tree': >> kernel/audit_tree.c:731:23: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'? 731 | rule->listnr != AUDIT_FILTER_URING_EXIT) || | ^~~~~~~~~~~~~~~~~~~~~~~ | AUDIT_FILTER_EXIT kernel/audit_tree.c:731:23: note: each undeclared identifier is reported only once for each function it appears in vim +155 kernel/auditfilter.c 149 150 /* Translate an inode field to kernel representation. */ 151 static inline int audit_to_inode(struct audit_krule *krule, 152 struct audit_field *f) 153 { 154 if ((krule->listnr != AUDIT_FILTER_EXIT && > 155 krule->listnr != AUDIT_FILTER_URING_EXIT) || 156 krule->inode_f || krule->watch || krule->tree || 157 (f->op != Audit_equal && f->op != Audit_not_equal)) 158 return -EINVAL; 159 160 krule->inode_f = f; 161 return 0; 162 } 163 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 31564 bytes --]
[-- Attachment #1: Type: text/plain, Size: 3329 bytes --] Hi Richard, Thank you for the patch! Yet something to improve: [auto build test ERROR on pcmoore-audit/next] [also build test ERROR on v5.13-rc4 next-20210528] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Richard-Guy-Briggs/audit-add-filtering-for-io_uring-records-addendum/20210531-214941 base: https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit.git next config: arm64-randconfig-r021-20210531 (attached as .config) compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project bc6799f2f79f0ae87e9f1ebf9d25ba799fbd25a9) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install arm64 cross compiling tool for clang build # apt-get install binutils-aarch64-linux-gnu # https://github.com/0day-ci/linux/commit/07a3e22a2f984838bc98b43b58e8ef08e9353483 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Richard-Guy-Briggs/audit-add-filtering-for-io_uring-records-addendum/20210531-214941 git checkout 07a3e22a2f984838bc98b43b58e8ef08e9353483 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm64 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): >> kernel/auditfilter.c:155:24: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT' krule->listnr != AUDIT_FILTER_URING_EXIT) || ^ kernel/auditfilter.c:252:7: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT' case AUDIT_FILTER_URING_EXIT: ^ kernel/auditfilter.c:986:28: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT' entry->rule.listnr == AUDIT_FILTER_URING_EXIT) { ^ 3 errors generated. -- >> kernel/audit_watch.c:187:24: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT' krule->listnr != AUDIT_FILTER_URING_EXIT) || ^ 1 error generated. -- >> kernel/audit_tree.c:731:23: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT' rule->listnr != AUDIT_FILTER_URING_EXIT) || ^ 1 error generated. vim +/AUDIT_FILTER_URING_EXIT +155 kernel/auditfilter.c 149 150 /* Translate an inode field to kernel representation. */ 151 static inline int audit_to_inode(struct audit_krule *krule, 152 struct audit_field *f) 153 { 154 if ((krule->listnr != AUDIT_FILTER_EXIT && > 155 krule->listnr != AUDIT_FILTER_URING_EXIT) || 156 krule->inode_f || krule->watch || krule->tree || 157 (f->op != Audit_equal && f->op != Audit_not_equal)) 158 return -EINVAL; 159 160 krule->inode_f = f; 161 return 0; 162 } 163 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 41736 bytes --]
On Mon, May 31, 2021 at 9:44 AM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2021-05-30 11:26, Paul Moore wrote: > > On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2021-05-21 17:50, Paul Moore wrote: ... > > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > > > > index d8aa2c690bf9..4f6ab34020fb 100644 > > > > --- a/kernel/auditsc.c > > > > +++ b/kernel/auditsc.c > > > > @@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val) > > > > return rule->mask[word] & bit; > > > > } > > > > > > > > +/** > > > > + * audit_filter_uring - apply filters to an io_uring operation > > > > + * @tsk: associated task > > > > + * @ctx: audit context > > > > + */ > > > > +static void audit_filter_uring(struct task_struct *tsk, > > > > + struct audit_context *ctx) > > > > +{ > > > > + struct audit_entry *e; > > > > + enum audit_state state; > > > > + > > > > + if (auditd_test_task(tsk)) > > > > + return; > > > > > > Is this necessary? auditd and auditctl don't (intentionally) use any > > > io_uring functionality. Is it possible it might inadvertantly use some > > > by virtue of libc or other library calls now or in the future? > > > > I think the better question is what harm does it do? Yes, I'm not > > aware of an auditd implementation that currently makes use of > > io_uring, but it is also not inconceivable some future implementation > > might want to make use of it and given the disjoint nature of kernel > > and userspace development I don't want the kernel to block such > > developments. However, if you can think of a reason why having this > > check here is bad I'm listening (note: we are already in the slow path > > here so having the additional check isn't an issue as far as I'm > > concerned). > > > > As a reminder, auditd_test_task() only returns true/1 if the task is > > registered with the audit subsystem as an auditd connection, an > > auditctl process should not cause this function to return true. > > My main concern was overhead, since the whole goal of io_uring is speed. At the point where this test takes place we are already in the audit slow path as far as io_uring is concerned. I understand your concern, but the advantage of being able to effectively use io_uring in the future makes this worth keeping in my opinion. > The chances that audit does use this functionality in the future suggest > to me that it is best to leave this check in. Sounds like we are in agreement. We'll keep it for now. > > > > + rcu_read_lock(); > > > > + list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT], > > > > + list) { > > > > + if (audit_in_mask(&e->rule, ctx->uring_op) && > > > > > > While this seems like the most obvious approach given the parallels > > > between syscalls and io_uring operations, as coded here it won't work > > > due to the different mappings of syscall numbers and io_uring > > > operations unless we re-use the auditctl -S field with raw io_uring > > > operation numbers in the place of syscall numbers. This should have > > > been obvious to me when I first looked at this patch. It became obvious > > > when I started looking at the userspace auditctl.c. > > > > FWIW, my intention was to treat io_uring opcodes exactly like we treat > > syscall numbers. Yes, this would potentially be an issue if we wanted > > to combine syscalls and io_uring opcodes into one filter, but why > > would we ever want to do that? Combining the two into one filter not > > only makes the filter lists longer than needed (we will always know if > > we are filtering on a syscall or io_uring op) and complicates the > > filter rule processing. > > > > Or is there a problem with this that I'm missing? > > No, I think you have a good understanding of it. I'm asking hard > questions to avoid missing something important. If we can reuse the > syscall infrastructure for this then that is extremely helpful (if not > lazy, which isn't necessarily a bad thing). It does mean that the > io_uring op dictionary will need to live in userspace audit the way it > is currently implemented .... Which I currently believe is the right thing to do. > > > The easy first step would be to use something like this: > > > auditctl -a uring,always -S 18,28 -F key=uring_open > > > to monitor file open commands only. The same is not yet possible for > > > the perm field, but there are so few io_uring ops at this point compared > > > with syscalls that it might be manageable. The arch is irrelevant since > > > io_uring operation numbers are identical across all hardware as far as I > > > can tell. Most of the rest of the fields should make sense if they do > > > for a syscall rule. > > > > I've never been a fan of audit's "perm" filtering; I've always felt > > there were better ways to handle that so I'm not overly upset that we > > are skipping that functionality with this initial support. If it > > becomes a problem in the future we can always add that support at a > > later date. > > Ok, I don't see a pressing need to add it initially, but should add a > check to block that field from being used to avoid the confusion of > unpredictable behaviour should someone try to add a perm filter to a > io_uring filter. That should be done protectively in the kernel and > proactively in userspace. Sure, that's reasonable. > > > Here's a sample of userspace code to support this > > > patch: > > > https://github.com/rgbriggs/audit-userspace/commit/a77baa1651b7ad841a220eb962d4cc92bc07dc96 > > > https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau-iouring-filtering.v1.0 > > > > Great, thank you. I haven't grabbed a copy yet for testing, but I will. > > I've added a perm filter block as an additional patch in userspace and > updated the tree so that first commit is no longer the top of tree but > the branch name is current. > > I'll add a kernel perm filter check. > > I just noticed some list checking that is missing in tree and watch in > your patch. > > Suggested fixup patches to follow... I see them, thank you, comments will follow over there. Although to be honest I'm mostly focusing on the testing right now while we wait to hear back from Jens on what he is willing to accept regarding audit calls in io_issue_sqe(). If we can't do the _entry()/_exit() calls then this work is pretty much dead and we just have to deal with it in Kconfig. I might make one last, clean patchset and put it in a branch for the distros that want to carry the patchset, but it isn't clear to me that it is something I would want to maintain long term. Long running out of tree patches are generally A Bad Idea. > > > If we abuse the syscall infrastructure at first, we'd need a transition > > > plan to coordinate user and kernel switchover to seperate mechanisms for > > > the two to work together if the need should arise to have both syscall > > > and uring filters in the same rule. > > > > See my comments above, I don't currently see why we would ever want > > syscall and io_uring filtering to happen in the same rule. Please > > speak up if you can think of a reason why this would either be needed, > > or desirable for some reason. > > I think they can be seperate rules for now. Either a syscall rule > catching all io_uring ops can be added, or an io_uring rule can be added > to catch specific ops. The scenario I was thinking of was catching > syscalls of specific io_uring ops. Perhaps I'm misunderstand you, but that scenario really shouldn't exist. The io_uring ops function independently of syscalls; you can *submit* io_uring ops via io_uring_enter(), but they are not guaranteed to be dispatched synchronously (obviously), and given the cred shenanigans that can happen with io_uring there is no guarantee the filters would even be applicable. It isn't an issue of "can" the filters be separate, they *have* to be separate. -- paul moore www.paul-moore.com
On 5/28/21 5:02 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
>> ... If we moved the _entry
>> and _exit calls into the individual operation case blocks (quick
>> openat example below) so that only certain operations were able to be
>> audited would that be acceptable assuming the high frequency ops were
>> untouched? My initial gut feeling was that this would involve >50% of
>> the ops, but Steve Grubb seems to think it would be less; it may be
>> time to look at that a bit more seriously, but if it gets a NACK
>> regardless it isn't worth the time - thoughts?
>>
>> case IORING_OP_OPENAT:
>> audit_uring_entry(req->opcode);
>> ret = io_openat(req, issue_flags);
>> audit_uring_exit(!ret, ret);
>> break;
>
> I wanted to pose this question again in case it was lost in the
> thread, I suspect this may be the last option before we have to "fix"
> things at the Kconfig level. I definitely don't want to have to go
> that route, and I suspect most everyone on this thread feels the same,
> so I'm hopeful we can find a solution that is begrudgingly acceptable
> to both groups.
May work for me, but have to ask how many, and what is the
criteria? I'd think anything opening a file or manipulating fs:
IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
IORING_OP_FILES_UPDATE
+ coming mkdirat and others.
IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
What about?
IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
IORING_OP_FALLOCATE, IORING_OP_STATX,
IORING_OP_FADVISE, IORING_OP_MADVISE,
IORING_OP_EPOLL_CTL
Another question, io_uring may exercise asynchronous paths,
i.e. io_issue_sqe() returns before requests completes.
Shouldn't be the case for open/etc at the moment, but was that
considered?
I don't see it happening, but would prefer to keep it open
async reimplementation in a distant future. Does audit sleep?
--
Pavel Begunkov
On 2021-06-01 21:40, Paul Moore wrote: > On Mon, May 31, 2021 at 9:44 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2021-05-30 11:26, Paul Moore wrote: > > > On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > On 2021-05-21 17:50, Paul Moore wrote: > > > > If we abuse the syscall infrastructure at first, we'd need a transition > > > > plan to coordinate user and kernel switchover to seperate mechanisms for > > > > the two to work together if the need should arise to have both syscall > > > > and uring filters in the same rule. > > > > > > See my comments above, I don't currently see why we would ever want > > > syscall and io_uring filtering to happen in the same rule. Please > > > speak up if you can think of a reason why this would either be needed, > > > or desirable for some reason. > > > > I think they can be seperate rules for now. Either a syscall rule > > catching all io_uring ops can be added, or an io_uring rule can be added > > to catch specific ops. The scenario I was thinking of was catching > > syscalls of specific io_uring ops. > > Perhaps I'm misunderstand you, but that scenario really shouldn't > exist. The io_uring ops function independently of syscalls; you can > *submit* io_uring ops via io_uring_enter(), but they are not > guaranteed to be dispatched synchronously (obviously), and given the > cred shenanigans that can happen with io_uring there is no guarantee > the filters would even be applicable. That wasn't my understanding. There are a number of io_uring calls starting with at least open that are currently synchronous (but may become async in future) that we may want to single out which would be a specific io_uring syscall with a specific io_uring opcode. I guess that particular situation would be caught by the io_uring opcode triggering an event that includes SYSCALL and URINGOP records. > It isn't an issue of "can" the filters be separate, they *have* to be separate. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On 2021-06-02 09:26, Pavel Begunkov wrote: > On 5/28/21 5:02 PM, Paul Moore wrote: > > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: > >> ... If we moved the _entry > >> and _exit calls into the individual operation case blocks (quick > >> openat example below) so that only certain operations were able to be > >> audited would that be acceptable assuming the high frequency ops were > >> untouched? My initial gut feeling was that this would involve >50% of > >> the ops, but Steve Grubb seems to think it would be less; it may be > >> time to look at that a bit more seriously, but if it gets a NACK > >> regardless it isn't worth the time - thoughts? > >> > >> case IORING_OP_OPENAT: > >> audit_uring_entry(req->opcode); > >> ret = io_openat(req, issue_flags); > >> audit_uring_exit(!ret, ret); > >> break; > > > > I wanted to pose this question again in case it was lost in the > > thread, I suspect this may be the last option before we have to "fix" > > things at the Kconfig level. I definitely don't want to have to go > > that route, and I suspect most everyone on this thread feels the same, > > so I'm hopeful we can find a solution that is begrudgingly acceptable > > to both groups. > > May work for me, but have to ask how many, and what is the > criteria? I'd think anything opening a file or manipulating fs: > > IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], > IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, > IORING_OP_FILES_UPDATE > + coming mkdirat and others. > > IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? > > What about? > IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, > IORING_OP_FALLOCATE, IORING_OP_STATX, > IORING_OP_FADVISE, IORING_OP_MADVISE, > IORING_OP_EPOLL_CTL > > > Another question, io_uring may exercise asynchronous paths, > i.e. io_issue_sqe() returns before requests completes. > Shouldn't be the case for open/etc at the moment, but was that > considered? This would be why audit needs to monitor a thread until it wraps up, to wait for the result code. My understanding is that both sync and async parts of an op would be monitored. > I don't see it happening, but would prefer to keep it open > async reimplementation in a distant future. Does audit sleep? Some parts do, some parts don't depending on what they are interacting with in the kernel. It can be made to not sleep if needed. > Pavel Begunkov - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Wed, Jun 2, 2021 at 11:38 AM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2021-06-01 21:40, Paul Moore wrote: > > On Mon, May 31, 2021 at 9:44 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2021-05-30 11:26, Paul Moore wrote: > > > > On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > > > On 2021-05-21 17:50, Paul Moore wrote: > > > > > If we abuse the syscall infrastructure at first, we'd need a transition > > > > > plan to coordinate user and kernel switchover to seperate mechanisms for > > > > > the two to work together if the need should arise to have both syscall > > > > > and uring filters in the same rule. > > > > > > > > See my comments above, I don't currently see why we would ever want > > > > syscall and io_uring filtering to happen in the same rule. Please > > > > speak up if you can think of a reason why this would either be needed, > > > > or desirable for some reason. > > > > > > I think they can be seperate rules for now. Either a syscall rule > > > catching all io_uring ops can be added, or an io_uring rule can be added > > > to catch specific ops. The scenario I was thinking of was catching > > > syscalls of specific io_uring ops. > > > > Perhaps I'm misunderstand you, but that scenario really shouldn't > > exist. The io_uring ops function independently of syscalls; you can > > *submit* io_uring ops via io_uring_enter(), but they are not > > guaranteed to be dispatched synchronously (obviously), and given the > > cred shenanigans that can happen with io_uring there is no guarantee > > the filters would even be applicable. > > That wasn't my understanding. There are a number of io_uring calls > starting with at least open that are currently synchronous (but may > become async in future) that we may want to single out which would be a > specific io_uring syscall with a specific io_uring opcode. I guess > that particular situation would be caught by the io_uring opcode > triggering an event that includes SYSCALL and URINGOP records. The only io_uring syscalls are io_uring_setup(2), io_uring_enter(2), etc., the stuff that is dispatched in io_issue_sqe() are the io_uring ops/opcodes/whatever. They *look* like syscalls but they are not and we have to treat them differently. > > It isn't an issue of "can" the filters be separate, they *have* to be separate. -- paul moore www.paul-moore.com
On 2021-05-21 17:49, Paul Moore wrote: > WARNING - This is a work in progress and should not be merged > anywhere important. It is almost surely not complete, and while it > probably compiles it likely hasn't been booted and will do terrible > things. You have been warned. > > This patch adds basic auditing to io_uring operations, regardless of > their context. This is accomplished by allocating audit_context > structures for the io-wq worker and io_uring SQPOLL kernel threads > as well as explicitly auditing the io_uring operations in > io_issue_sqe(). The io_uring operations are audited using a new > AUDIT_URINGOP record, an example is shown below: > > % <TODO - insert AUDIT_URINGOP record example> > > Thanks to Richard Guy Briggs for review and feedback. > > Signed-off-by: Paul Moore <paul@paul-moore.com> > --- > fs/io-wq.c | 4 + > fs/io_uring.c | 11 +++ > include/linux/audit.h | 17 ++++ > include/uapi/linux/audit.h | 1 > kernel/audit.h | 2 + > kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ > 6 files changed, 208 insertions(+) > > diff --git a/fs/io-wq.c b/fs/io-wq.c > index 5361a9b4b47b..8af09a3336e0 100644 > --- a/fs/io-wq.c > +++ b/fs/io-wq.c > @@ -16,6 +16,7 @@ > #include <linux/rculist_nulls.h> > #include <linux/cpu.h> > #include <linux/tracehook.h> > +#include <linux/audit.h> > > #include "io-wq.h" > > @@ -535,6 +536,8 @@ static int io_wqe_worker(void *data) > snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid); > set_task_comm(current, buf); > > + audit_alloc_kernel(current); > + > while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) { > long ret; > > @@ -573,6 +576,7 @@ static int io_wqe_worker(void *data) > raw_spin_unlock_irq(&wqe->lock); > } > > + audit_free(current); > io_worker_exit(worker); > return 0; > } > diff --git a/fs/io_uring.c b/fs/io_uring.c > index e481ac8a757a..e9941d1ad8fd 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -78,6 +78,7 @@ > #include <linux/task_work.h> > #include <linux/pagemap.h> > #include <linux/io_uring.h> > +#include <linux/audit.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/io_uring.h> > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > if (req->work.creds && req->work.creds != current_cred()) > creds = override_creds(req->work.creds); > > + if (req->opcode < IORING_OP_LAST) > + audit_uring_entry(req->opcode); > + > switch (req->opcode) { > case IORING_OP_NOP: > ret = io_nop(req, issue_flags); > @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > break; > } > > + if (req->opcode < IORING_OP_LAST) > + audit_uring_exit(!ret, ret); > + > if (creds) > revert_creds(creds); > > @@ -6827,6 +6834,8 @@ static int io_sq_thread(void *data) > set_cpus_allowed_ptr(current, cpu_online_mask); > current->flags |= PF_NO_SETAFFINITY; > > + audit_alloc_kernel(current); > + > mutex_lock(&sqd->lock); > /* a user may had exited before the thread started */ > io_run_task_work_head(&sqd->park_task_work); > @@ -6916,6 +6925,8 @@ static int io_sq_thread(void *data) > io_run_task_work_head(&sqd->park_task_work); > mutex_unlock(&sqd->lock); > > + audit_free(current); > + > complete(&sqd->exited); > do_exit(0); > } > diff --git a/include/linux/audit.h b/include/linux/audit.h > index 82b7c1116a85..6a0c013bc7de 100644 > --- a/include/linux/audit.h > +++ b/include/linux/audit.h > @@ -286,7 +286,10 @@ static inline int audit_signal_info(int sig, struct task_struct *t) > /* These are defined in auditsc.c */ > /* Public API */ > extern int audit_alloc(struct task_struct *task); > +extern int audit_alloc_kernel(struct task_struct *task); > extern void __audit_free(struct task_struct *task); > +extern void __audit_uring_entry(u8 op); > +extern void __audit_uring_exit(int success, long code); > extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1, > unsigned long a2, unsigned long a3); > extern void __audit_syscall_exit(int ret_success, long ret_value); > @@ -323,6 +326,16 @@ static inline void audit_free(struct task_struct *task) > if (unlikely(task->audit_context)) > __audit_free(task); > } > +static inline void audit_uring_entry(u8 op) > +{ > + if (unlikely(audit_context())) > + __audit_uring_entry(op); > +} > +static inline void audit_uring_exit(int success, long code) > +{ > + if (unlikely(audit_context())) > + __audit_uring_exit(success, code); > +} > static inline void audit_syscall_entry(int major, unsigned long a0, > unsigned long a1, unsigned long a2, > unsigned long a3) > @@ -554,6 +567,10 @@ static inline int audit_alloc(struct task_struct *task) > { > return 0; > } > +static inline int audit_alloc_kernel(struct task_struct *task) > +{ > + return 0; > +} > static inline void audit_free(struct task_struct *task) > { } > static inline void audit_syscall_entry(int major, unsigned long a0, > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > index cd2d8279a5e4..b26e0c435e8b 100644 > --- a/include/uapi/linux/audit.h > +++ b/include/uapi/linux/audit.h > @@ -118,6 +118,7 @@ > #define AUDIT_TIME_ADJNTPVAL 1333 /* NTP value adjustment */ > #define AUDIT_BPF 1334 /* BPF subsystem */ > #define AUDIT_EVENT_LISTENER 1335 /* Task joined multicast read socket */ > +#define AUDIT_URINGOP 1336 /* io_uring operation */ > > #define AUDIT_AVC 1400 /* SE Linux avc denial or grant */ > #define AUDIT_SELINUX_ERR 1401 /* Internal SE Linux Errors */ > diff --git a/kernel/audit.h b/kernel/audit.h > index fba180de5912..50de827497ca 100644 > --- a/kernel/audit.h > +++ b/kernel/audit.h > @@ -100,10 +100,12 @@ struct audit_context { > enum { > AUDIT_CTX_UNUSED, /* audit_context is currently unused */ > AUDIT_CTX_SYSCALL, /* in use by syscall */ > + AUDIT_CTX_URING, /* in use by io_uring */ > } context; > enum audit_state state, current_state; > unsigned int serial; /* serial number for record */ > int major; /* syscall number */ > + int uring_op; /* uring operation */ > struct timespec64 ctime; /* time of syscall entry */ > unsigned long argv[4]; /* syscall arguments */ > long return_code;/* syscall return code */ > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index cc89e9f9a753..729849d41631 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -953,6 +953,7 @@ static void audit_reset_context(struct audit_context *ctx) > ctx->current_state = ctx->state; > ctx->serial = 0; > ctx->major = 0; > + ctx->uring_op = 0; > ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 }; > memset(ctx->argv, 0, sizeof(ctx->argv)); > ctx->return_code = 0; > @@ -1038,6 +1039,31 @@ int audit_alloc(struct task_struct *tsk) > return 0; > } > > +/** > + * audit_alloc_kernel - allocate an audit_context for a kernel task > + * @tsk: the kernel task > + * > + * Similar to the audit_alloc() function, but intended for kernel private > + * threads. Returns zero on success, negative values on failure. > + */ > +int audit_alloc_kernel(struct task_struct *tsk) > +{ > + /* > + * At the moment we are just going to call into audit_alloc() to > + * simplify the code, but there two things to keep in mind with this > + * approach: > + * > + * 1. Filtering internal kernel tasks is a bit laughable in almost all > + * cases, but there is at least one case where there is a benefit: > + * the '-a task,never' case allows the admin to effectively disable > + * task auditing at runtime. > + * > + * 2. The {set,clear}_task_syscall_work() ops likely have zero effect > + * on these internal kernel tasks, but they probably don't hurt either. > + */ > + return audit_alloc(tsk); > +} > + > static inline void audit_free_context(struct audit_context *context) > { > /* resetting is extra work, but it is likely just noise */ > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void) > audit_log_end(ab); > } > > +/** > + * audit_log_uring - generate a AUDIT_URINGOP record > + * @ctx: the audit context > + */ > +static void audit_log_uring(struct audit_context *ctx) > +{ > + struct audit_buffer *ab; > + const struct cred *cred; > + > + /* > + * TODO: What do we log here? I'm tossing in a few things to start the > + * conversation, but additional thought needs to go into this. > + */ > + > + ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP); > + if (!ab) > + return; > + cred = current_cred(); This may need to be req->work.creds. I haven't been following if the io_uring thread inherited the user task's creds (and below, comm and exe). > + audit_log_format(ab, "uring_op=%d", ctx->uring_op); arch is stored below in __audit_uring_entry() and never used in the AUDIT_CTX_URING case. That assignment can either be dropped or printed before uring_op similar to the SYSCALL record. There aren't really any arg[0-3] to print. io_uring_register and io_uring_setup() args are better covered by other records. io_uring_enter() has 6 args and the last two aren't covered by SYSCALL anyways. > + if (ctx->return_valid != AUDITSC_INVALID) > + audit_log_format(ab, " success=%s exit=%ld", > + (ctx->return_valid == AUDITSC_SUCCESS ? > + "yes" : "no"), > + ctx->return_code); > + audit_log_format(ab, > + " items=%d" > + " ppid=%d pid=%d auid=%u uid=%u gid=%u" > + " euid=%u suid=%u fsuid=%u" > + " egid=%u sgid=%u fsgid=%u", > + ctx->name_count, > + task_ppid_nr(current), > + task_tgid_nr(current), > + from_kuid(&init_user_ns, audit_get_loginuid(current)), > + from_kuid(&init_user_ns, cred->uid), > + from_kgid(&init_user_ns, cred->gid), > + from_kuid(&init_user_ns, cred->euid), > + from_kuid(&init_user_ns, cred->suid), > + from_kuid(&init_user_ns, cred->fsuid), > + from_kgid(&init_user_ns, cred->egid), > + from_kgid(&init_user_ns, cred->sgid), > + from_kgid(&init_user_ns, cred->fsgid)); The audit session ID is still important, relevant and qualifies auid. In keeping with the SYSCALL record format, I think we want to keep ses=audit_get_sessionid(current) in here. I'm pretty sure we also want to keep comm= and exe= too, but may have to reach into req->task to get it. There are two values for comm possible, one from the original task and second "iou-sqp-<pid>" set at the top of io_sq_thread(). I'm reluctant to leave them out now and then have to re-add them in yet another field order later. > + audit_log_task_context(ab); > + audit_log_key(ab, ctx->filterkey); > + audit_log_end(ab); > +} > + > static void audit_log_exit(void) > { > int i, call_panic = 0; > @@ -1571,6 +1643,9 @@ static void audit_log_exit(void) > audit_log_key(ab, context->filterkey); > audit_log_end(ab); > break; > + case AUDIT_CTX_URING: > + audit_log_uring(context); > + break; > default: > BUG(); > break; > @@ -1740,6 +1815,104 @@ static void audit_return_fixup(struct audit_context *ctx, > ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE); > } > > +/** > + * __audit_uring_entry - prepare the kernel task's audit context for io_uring > + * @op: the io_uring opcode > + * > + * This is similar to audit_syscall_entry() but is intended for use by io_uring > + * operations. > + */ > +void __audit_uring_entry(u8 op) > +{ > + struct audit_context *ctx = audit_context(); > + > + if (!audit_enabled || !ctx || ctx->state == AUDIT_DISABLED) > + return; > + > + /* > + * NOTE: It's possible that we can be called from the process' context > + * before it returns to userspace, and before audit_syscall_exit() > + * is called. In this case there is not much to do, just record > + * the io_uring details and return. > + */ > + ctx->uring_op = op; > + if (ctx->context == AUDIT_CTX_SYSCALL) > + return; > + > + ctx->dummy = !audit_n_rules; > + if (!ctx->dummy && ctx->state == AUDIT_BUILD_CONTEXT) > + ctx->prio = 0; > + > + ctx->arch = syscall_get_arch(current); > + ctx->context = AUDIT_CTX_URING; > + ctx->current_state = ctx->state; > + ktime_get_coarse_real_ts64(&ctx->ctime); > +} > + > +/** > + * __audit_uring_exit - wrap up the kernel task's audit context after io_uring > + * @success: true/false value to indicate if the operation succeeded or not > + * @code: operation return code > + * > + * This is similar to audit_syscall_exit() but is intended for use by io_uring > + * operations. > + */ > +void __audit_uring_exit(int success, long code) > +{ > + struct audit_context *ctx = audit_context(); > + > + /* > + * TODO: At some point we will likely want to filter on io_uring ops > + * and other things similar to what we do for syscalls, but that > + * is something for another day; just record what we can here. > + */ > + > + if (!ctx || ctx->dummy) > + goto out; > + if (ctx->context == AUDIT_CTX_SYSCALL) { > + /* > + * NOTE: See the note in __audit_uring_entry() about the case > + * where we may be called from process context before we > + * return to userspace via audit_syscall_exit(). In this > + * case we simply emit a URINGOP record and bail, the > + * normal syscall exit handling will take care of > + * everything else. > + * It is also worth mentioning that when we are called, > + * the current process creds may differ from the creds > + * used during the normal syscall processing; keep that > + * in mind if/when we move the record generation code. > + */ > + > + /* > + * We need to filter on the syscall info here to decide if we > + * should emit a URINGOP record. I know it seems odd but this > + * solves the problem where users have a filter to block *all* > + * syscall records in the "exit" filter; we want to preserve > + * the behavior here. > + */ > + audit_filter_syscall(current, ctx); > + audit_filter_inodes(current, ctx); > + if (ctx->current_state != AUDIT_RECORD_CONTEXT) > + return; > + > + audit_log_uring(ctx); > + return; > + } > + > + /* this may generate CONFIG_CHANGE records */ > + if (!list_empty(&ctx->killed_trees)) > + audit_kill_trees(ctx); > + > + audit_filter_inodes(current, ctx); > + if (ctx->current_state != AUDIT_RECORD_CONTEXT) > + goto out; > + audit_return_fixup(ctx, success, code); > + audit_log_exit(); > + > +out: > + audit_reset_context(ctx); > +} > + > /** > * __audit_syscall_entry - fill in an audit record at syscall entry > * @major: major syscall type (function) > > -- > Linux-audit mailing list > Linux-audit@redhat.com > https://listman.redhat.com/mailman/listinfo/linux-audit - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > On 5/28/21 5:02 PM, Paul Moore wrote: > > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: > >> ... If we moved the _entry > >> and _exit calls into the individual operation case blocks (quick > >> openat example below) so that only certain operations were able to be > >> audited would that be acceptable assuming the high frequency ops were > >> untouched? My initial gut feeling was that this would involve >50% of > >> the ops, but Steve Grubb seems to think it would be less; it may be > >> time to look at that a bit more seriously, but if it gets a NACK > >> regardless it isn't worth the time - thoughts? > >> > >> case IORING_OP_OPENAT: > >> audit_uring_entry(req->opcode); > >> ret = io_openat(req, issue_flags); > >> audit_uring_exit(!ret, ret); > >> break; > > > > I wanted to pose this question again in case it was lost in the > > thread, I suspect this may be the last option before we have to "fix" > > things at the Kconfig level. I definitely don't want to have to go > > that route, and I suspect most everyone on this thread feels the same, > > so I'm hopeful we can find a solution that is begrudgingly acceptable > > to both groups. > > May work for me, but have to ask how many, and what is the > criteria? I'd think anything opening a file or manipulating fs: > > IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], > IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, > IORING_OP_FILES_UPDATE > + coming mkdirat and others. > > IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? > > What about? > IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, > IORING_OP_FALLOCATE, IORING_OP_STATX, > IORING_OP_FADVISE, IORING_OP_MADVISE, > IORING_OP_EPOLL_CTL Looking quickly at v5.13-rc4 the following seems like candidates for auditing, there may be a small number of subtractions/additions to this list as people take a closer look, but it should serve as a starting point: IORING_OP_SENDMSG IORING_OP_RECVMSG IORING_OP_ACCEPT IORING_OP_CONNECT IORING_OP_FALLOCATE IORING_OP_OPENAT IORING_OP_CLOSE IORING_OP_MADVISE IORING_OP_OPENAT2 IORING_OP_SHUTDOWN IORING_OP_RENAMEAT IORING_OP_UNLINKAT ... can you live with that list? > Another question, io_uring may exercise asynchronous paths, > i.e. io_issue_sqe() returns before requests completes. > Shouldn't be the case for open/etc at the moment, but was that > considered? Yes, I noticed that when testing the code (and it makes sense when you look at how io_uring handles things). Depending on the state of the system when the io_uring request is submitted I've seen both sync and async io_uring operations with the associated different calling contexts. In the case where io_issue_sqe() needs to defer the operation to a different context you will see an audit record indicating that the operation failed and then another audit record when it completes; it's actually pretty interesting to be able to see how the system and io_uring are working. We could always mask out these delayed attempts, but at this early stage they are helpful, and they may be useful for admins. > I don't see it happening, but would prefer to keep it open > async reimplementation in a distant future. Does audit sleep? The only place in the audit_uring_entry()/audit_uring_exit() code path that could sleep at present is the call to audit_log_uring() which is made when the rules dictate that an audit record be generated. The offending code is an allocation in audit_log_uring() which is currently GFP_KERNEL but really should be GFP_ATOMIC, or similar. It was a copy-n-paste from the similar syscall function where GFP_KERNEL is appropriate due to the calling context at the end of the syscall. I'll change that as soon as I'm done with this email. Of course if you are calling io_uring_enter(2), or something similar, then audit may sleep as part of the normal syscall processing (as mentioned above), but that is due to the fact that io_uring_enter(2) is a syscall and not because of anything in io_issue_sqe(). -- paul moore www.paul-moore.com
On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <rgb@redhat.com> wrote: > On 2021-05-21 17:49, Paul Moore wrote: > > WARNING - This is a work in progress and should not be merged > > anywhere important. It is almost surely not complete, and while it > > probably compiles it likely hasn't been booted and will do terrible > > things. You have been warned. > > > > This patch adds basic auditing to io_uring operations, regardless of > > their context. This is accomplished by allocating audit_context > > structures for the io-wq worker and io_uring SQPOLL kernel threads > > as well as explicitly auditing the io_uring operations in > > io_issue_sqe(). The io_uring operations are audited using a new > > AUDIT_URINGOP record, an example is shown below: > > > > % <TODO - insert AUDIT_URINGOP record example> > > > > Thanks to Richard Guy Briggs for review and feedback. > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > --- > > fs/io-wq.c | 4 + > > fs/io_uring.c | 11 +++ > > include/linux/audit.h | 17 ++++ > > include/uapi/linux/audit.h | 1 > > kernel/audit.h | 2 + > > kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ > > 6 files changed, 208 insertions(+) ... > > diff --git a/fs/io_uring.c b/fs/io_uring.c > > index e481ac8a757a..e9941d1ad8fd 100644 > > --- a/fs/io_uring.c > > +++ b/fs/io_uring.c > > @@ -78,6 +78,7 @@ > > #include <linux/task_work.h> > > #include <linux/pagemap.h> > > #include <linux/io_uring.h> > > +#include <linux/audit.h> > > > > #define CREATE_TRACE_POINTS > > #include <trace/events/io_uring.h> > > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > > if (req->work.creds && req->work.creds != current_cred()) > > creds = override_creds(req->work.creds); > > > > + if (req->opcode < IORING_OP_LAST) > > + audit_uring_entry(req->opcode); Note well the override_creds() call right above the audit code that is being added, it will be important later in this email. > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > > index cc89e9f9a753..729849d41631 100644 > > --- a/kernel/auditsc.c > > +++ b/kernel/auditsc.c > > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void) > > audit_log_end(ab); > > } > > > > +/** > > + * audit_log_uring - generate a AUDIT_URINGOP record > > + * @ctx: the audit context > > + */ > > +static void audit_log_uring(struct audit_context *ctx) > > +{ > > + struct audit_buffer *ab; > > + const struct cred *cred; > > + > > + /* > > + * TODO: What do we log here? I'm tossing in a few things to start the > > + * conversation, but additional thought needs to go into this. > > + */ > > + > > + ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP); > > + if (!ab) > > + return; > > + cred = current_cred(); > > This may need to be req->work.creds. I haven't been following if the > io_uring thread inherited the user task's creds (and below, comm and > exe). Nope, we're good. See the existing code in io_issue_sqe() :) > > + audit_log_format(ab, "uring_op=%d", ctx->uring_op); > > arch is stored below in __audit_uring_entry() and never used in the > AUDIT_CTX_URING case. That assignment can either be dropped or printed > before uring_op similar to the SYSCALL record. Good point, I'll drop the code that records the arch from _entry(); it is really only useful to give the appropriate context if needed for other things in the audit stream, and that isn't the case like it is with syscalls. > There aren't really any arg[0-3] to print. Which is why I didn't print them. > io_uring_register and io_uring_setup() args are better covered by other > records. io_uring_enter() has 6 args and the last two aren't covered by > SYSCALL anyways. ??? I think you are confusing the io_uring ops with syscalls; they are very different things from an audit perspective and the io_uring auditing is not intended to replace syscall records. The io_uring_setup() and io_uring_enter() syscalls will be auditing just as any other syscalls would be using the existing syscall audit code. > > + if (ctx->return_valid != AUDITSC_INVALID) > > + audit_log_format(ab, " success=%s exit=%ld", > > + (ctx->return_valid == AUDITSC_SUCCESS ? > > + "yes" : "no"), > > + ctx->return_code); > > + audit_log_format(ab, > > + " items=%d" > > + " ppid=%d pid=%d auid=%u uid=%u gid=%u" > > + " euid=%u suid=%u fsuid=%u" > > + " egid=%u sgid=%u fsgid=%u", > > + ctx->name_count, > > + task_ppid_nr(current), > > + task_tgid_nr(current), > > + from_kuid(&init_user_ns, audit_get_loginuid(current)), > > + from_kuid(&init_user_ns, cred->uid), > > + from_kgid(&init_user_ns, cred->gid), > > + from_kuid(&init_user_ns, cred->euid), > > + from_kuid(&init_user_ns, cred->suid), > > + from_kuid(&init_user_ns, cred->fsuid), > > + from_kgid(&init_user_ns, cred->egid), > > + from_kgid(&init_user_ns, cred->sgid), > > + from_kgid(&init_user_ns, cred->fsgid)); > > The audit session ID is still important, relevant and qualifies auid. > In keeping with the SYSCALL record format, I think we want to keep > ses=audit_get_sessionid(current) in here. This might be another case of syscall/io_uring confusion. An io_uring op doesn't necessarily have an audit session ID or an audit UID in the conventional sense; for example think about SQPOLL works, shared rings, etc. > I'm pretty sure we also want to keep comm= and exe= too, but may have to > reach into req->task to get it. There are two values for comm possible, > one from the original task and second "iou-sqp-<pid>" set at the top of > io_sq_thread(). I think this is more syscall/io_uring confusion. -- paul moore www.paul-moore.com
On 6/2/21 4:46 PM, Richard Guy Briggs wrote: > On 2021-06-02 09:26, Pavel Begunkov wrote: >> On 5/28/21 5:02 PM, Paul Moore wrote: >>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: >>>> ... If we moved the _entry >>>> and _exit calls into the individual operation case blocks (quick >>>> openat example below) so that only certain operations were able to be >>>> audited would that be acceptable assuming the high frequency ops were >>>> untouched? My initial gut feeling was that this would involve >50% of >>>> the ops, but Steve Grubb seems to think it would be less; it may be >>>> time to look at that a bit more seriously, but if it gets a NACK >>>> regardless it isn't worth the time - thoughts? >>>> >>>> case IORING_OP_OPENAT: >>>> audit_uring_entry(req->opcode); >>>> ret = io_openat(req, issue_flags); >>>> audit_uring_exit(!ret, ret); >>>> break; >>> >>> I wanted to pose this question again in case it was lost in the >>> thread, I suspect this may be the last option before we have to "fix" >>> things at the Kconfig level. I definitely don't want to have to go >>> that route, and I suspect most everyone on this thread feels the same, >>> so I'm hopeful we can find a solution that is begrudgingly acceptable >>> to both groups. >> >> May work for me, but have to ask how many, and what is the >> criteria? I'd think anything opening a file or manipulating fs: >> >> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], >> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, >> IORING_OP_FILES_UPDATE >> + coming mkdirat and others. >> >> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? >> >> What about? >> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, >> IORING_OP_FALLOCATE, IORING_OP_STATX, >> IORING_OP_FADVISE, IORING_OP_MADVISE, >> IORING_OP_EPOLL_CTL >> >> >> Another question, io_uring may exercise asynchronous paths, >> i.e. io_issue_sqe() returns before requests completes. >> Shouldn't be the case for open/etc at the moment, but was that >> considered? > > This would be why audit needs to monitor a thread until it wraps up, to > wait for the result code. My understanding is that both sync and async > parts of an op would be monitored. There may be a misunderstanding audit_start(req) ret = io_issue_sqe(req); audit_end(ret); io_issue_sqe() may return 0 but leave the request inflight, which will be completed asynchronously e.g. by IRQ, not going through io_issue_sqe() or any io_read()/etc helpers again, and after last audit_end() had already happened. That's the case with read/write/timeout, but is not true for open/etc. >> I don't see it happening, but would prefer to keep it open >> async reimplementation in a distant future. Does audit sleep? > > Some parts do, some parts don't depending on what they are interacting > with in the kernel. It can be made to not sleep if needed. Ok, good -- Pavel Begunkov
On 6/2/21 8:46 PM, Paul Moore wrote: > On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> On 5/28/21 5:02 PM, Paul Moore wrote: >>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: >>>> ... If we moved the _entry >>>> and _exit calls into the individual operation case blocks (quick >>>> openat example below) so that only certain operations were able to be >>>> audited would that be acceptable assuming the high frequency ops were >>>> untouched? My initial gut feeling was that this would involve >50% of >>>> the ops, but Steve Grubb seems to think it would be less; it may be >>>> time to look at that a bit more seriously, but if it gets a NACK >>>> regardless it isn't worth the time - thoughts? >>>> >>>> case IORING_OP_OPENAT: >>>> audit_uring_entry(req->opcode); >>>> ret = io_openat(req, issue_flags); >>>> audit_uring_exit(!ret, ret); >>>> break; >>> >>> I wanted to pose this question again in case it was lost in the >>> thread, I suspect this may be the last option before we have to "fix" >>> things at the Kconfig level. I definitely don't want to have to go >>> that route, and I suspect most everyone on this thread feels the same, >>> so I'm hopeful we can find a solution that is begrudgingly acceptable >>> to both groups. >> >> May work for me, but have to ask how many, and what is the >> criteria? I'd think anything opening a file or manipulating fs: >> >> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], >> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, >> IORING_OP_FILES_UPDATE >> + coming mkdirat and others. >> >> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? >> >> What about? >> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, >> IORING_OP_FALLOCATE, IORING_OP_STATX, >> IORING_OP_FADVISE, IORING_OP_MADVISE, >> IORING_OP_EPOLL_CTL > > Looking quickly at v5.13-rc4 the following seems like candidates for > auditing, there may be a small number of subtractions/additions to > this list as people take a closer look, but it should serve as a > starting point: > > IORING_OP_SENDMSG > IORING_OP_RECVMSG > IORING_OP_ACCEPT > IORING_OP_CONNECT > IORING_OP_FALLOCATE > IORING_OP_OPENAT > IORING_OP_CLOSE > IORING_OP_MADVISE > IORING_OP_OPENAT2 > IORING_OP_SHUTDOWN > IORING_OP_RENAMEAT > IORING_OP_UNLINKAT > > ... can you live with that list? it will bloat binary somewhat, but considering it's all in one place -- io_issue_sqe(), it's workable. Not nice to have send/recv msg in the list, but I admit they may do some crazy things. What can be traced for them? Because at the moment of issue_sqe() not everything is read from the userspace. see: io_sendmsg() { ...; io_sendmsg_copy_hdr(); }, will copy header only in io_sendmsg() in most cases, and then stash it for re-issuing if needed. >> Another question, io_uring may exercise asynchronous paths, >> i.e. io_issue_sqe() returns before requests completes. >> Shouldn't be the case for open/etc at the moment, but was that >> considered? > > Yes, I noticed that when testing the code (and it makes sense when you > look at how io_uring handles things). Depending on the state of the > system when the io_uring request is submitted I've seen both sync and > async io_uring operations with the associated different calling > contexts. In the case where io_issue_sqe() needs to defer the > operation to a different context you will see an audit record > indicating that the operation failed and then another audit record > when it completes; it's actually pretty interesting to be able to see > how the system and io_uring are working. Copying a reply to another message to keep clear out of misunderstanding. "io_issue_sqe() may return 0 but leave the request inflight, which will be completed asynchronously e.g. by IRQ, not going through io_issue_sqe() or any io_read()/etc helpers again, and after last audit_end() had already happened. That's the case with read/write/timeout, but is not true for open/etc." And there is interest in async send/recv[msg] as well (via IRQ as described, callbacks, etc.). > We could always mask out these delayed attempts, but at this early > stage they are helpful, and they may be useful for admins. > >> I don't see it happening, but would prefer to keep it open >> async reimplementation in a distant future. Does audit sleep? > > The only place in the audit_uring_entry()/audit_uring_exit() code path > that could sleep at present is the call to audit_log_uring() which is > made when the rules dictate that an audit record be generated. The > offending code is an allocation in audit_log_uring() which is > currently GFP_KERNEL but really should be GFP_ATOMIC, or similar. It > was a copy-n-paste from the similar syscall function where GFP_KERNEL > is appropriate due to the calling context at the end of the syscall. > I'll change that as soon as I'm done with this email. Ok, depends where it steers, but there may be a requirement to not sleep for some hooks because of not having a sleepable context. > > Of course if you are calling io_uring_enter(2), or something similar, > then audit may sleep as part of the normal syscall processing (as > mentioned above), but that is due to the fact that io_uring_enter(2) > is a syscall and not because of anything in io_issue_sqe(). > -- Pavel Begunkov
On 5/28/21 10:02 AM, Paul Moore wrote:
> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
>> ... If we moved the _entry
>> and _exit calls into the individual operation case blocks (quick
>> openat example below) so that only certain operations were able to be
>> audited would that be acceptable assuming the high frequency ops were
>> untouched? My initial gut feeling was that this would involve >50% of
>> the ops, but Steve Grubb seems to think it would be less; it may be
>> time to look at that a bit more seriously, but if it gets a NACK
>> regardless it isn't worth the time - thoughts?
>>
>> case IORING_OP_OPENAT:
>> audit_uring_entry(req->opcode);
>> ret = io_openat(req, issue_flags);
>> audit_uring_exit(!ret, ret);
>> break;
>
> I wanted to pose this question again in case it was lost in the
> thread, I suspect this may be the last option before we have to "fix"
> things at the Kconfig level. I definitely don't want to have to go
> that route, and I suspect most everyone on this thread feels the same,
> so I'm hopeful we can find a solution that is begrudgingly acceptable
> to both groups.
Sorry for the lack of response here, but to sum up my order of
preference:
1) It's probably better to just make the audit an opt-out in io_op_defs
for each opcode, and avoid needing boiler plate code for each op
handler. The opt-out would ensure that new opcodes get it by default
it someone doesn't know what it is, and the io_op_defs addition would
mean that it's in generic code rather then in the handlers. Yes it's
a bit slower, but it's saner imho.
2) With the above, I'm fine with adding this to io_uring. I don't think
going the route of mutual exclusion in kconfig helps anyone, it'd
be counter productive to both sides.
Hope that works and helps move this forward. I'll be mostly out of touch
the next week and a half, but wanted to ensure that I sent out my
(brief) thoughts before going away.
--
Jens Axboe
On 6/3/2021 3:51 AM, Pavel Begunkov wrote: > On 6/2/21 8:46 PM, Paul Moore wrote: >> On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >>> On 5/28/21 5:02 PM, Paul Moore wrote: >>>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote: >>>>> ... If we moved the _entry >>>>> and _exit calls into the individual operation case blocks (quick >>>>> openat example below) so that only certain operations were able to be >>>>> audited would that be acceptable assuming the high frequency ops were >>>>> untouched? My initial gut feeling was that this would involve >50% of >>>>> the ops, but Steve Grubb seems to think it would be less; it may be >>>>> time to look at that a bit more seriously, but if it gets a NACK >>>>> regardless it isn't worth the time - thoughts? >>>>> >>>>> case IORING_OP_OPENAT: >>>>> audit_uring_entry(req->opcode); >>>>> ret = io_openat(req, issue_flags); >>>>> audit_uring_exit(!ret, ret); >>>>> break; >>>> I wanted to pose this question again in case it was lost in the >>>> thread, I suspect this may be the last option before we have to "fix" >>>> things at the Kconfig level. I definitely don't want to have to go >>>> that route, and I suspect most everyone on this thread feels the same, >>>> so I'm hopeful we can find a solution that is begrudgingly acceptable >>>> to both groups. >>> May work for me, but have to ask how many, and what is the >>> criteria? I'd think anything opening a file or manipulating fs: >>> >>> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2], >>> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN, >>> IORING_OP_FILES_UPDATE >>> + coming mkdirat and others. >>> >>> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV? >>> >>> What about? >>> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE, >>> IORING_OP_FALLOCATE, IORING_OP_STATX, >>> IORING_OP_FADVISE, IORING_OP_MADVISE, >>> IORING_OP_EPOLL_CTL >> Looking quickly at v5.13-rc4 the following seems like candidates for >> auditing, there may be a small number of subtractions/additions to >> this list as people take a closer look, but it should serve as a >> starting point: >> >> IORING_OP_SENDMSG >> IORING_OP_RECVMSG >> IORING_OP_ACCEPT >> IORING_OP_CONNECT >> IORING_OP_FALLOCATE >> IORING_OP_OPENAT >> IORING_OP_CLOSE >> IORING_OP_MADVISE >> IORING_OP_OPENAT2 >> IORING_OP_SHUTDOWN >> IORING_OP_RENAMEAT >> IORING_OP_UNLINKAT >> >> ... can you live with that list? > it will bloat binary somewhat, but considering it's all in one > place -- io_issue_sqe(), it's workable. > > Not nice to have send/recv msg in the list, but I admit they > may do some crazy things. What can be traced for them? Both SELinux and Smack do access checks on packet operations. As access may be denied by these checks, audit needs to be available. This is true for UDS, IP and at least one other protocol family. > Because > at the moment of issue_sqe() not everything is read from the > userspace. > > see: io_sendmsg() { ...; io_sendmsg_copy_hdr(); }, > > will copy header only in io_sendmsg() in most cases, and > then stash it for re-issuing if needed. > > >>> Another question, io_uring may exercise asynchronous paths, >>> i.e. io_issue_sqe() returns before requests completes. >>> Shouldn't be the case for open/etc at the moment, but was that >>> considered? >> Yes, I noticed that when testing the code (and it makes sense when you >> look at how io_uring handles things). Depending on the state of the >> system when the io_uring request is submitted I've seen both sync and >> async io_uring operations with the associated different calling >> contexts. In the case where io_issue_sqe() needs to defer the >> operation to a different context you will see an audit record >> indicating that the operation failed and then another audit record >> when it completes; it's actually pretty interesting to be able to see >> how the system and io_uring are working. > Copying a reply to another message to keep clear out > of misunderstanding. > > "io_issue_sqe() may return 0 but leave the request inflight, > which will be completed asynchronously e.g. by IRQ, not going > through io_issue_sqe() or any io_read()/etc helpers again, and > after last audit_end() had already happened. > That's the case with read/write/timeout, but is not true for > open/etc." > > And there is interest in async send/recv[msg] as well (via > IRQ as described, callbacks, etc.). > >> We could always mask out these delayed attempts, but at this early >> stage they are helpful, and they may be useful for admins. >> >>> I don't see it happening, but would prefer to keep it open >>> async reimplementation in a distant future. Does audit sleep? >> The only place in the audit_uring_entry()/audit_uring_exit() code path >> that could sleep at present is the call to audit_log_uring() which is >> made when the rules dictate that an audit record be generated. The >> offending code is an allocation in audit_log_uring() which is >> currently GFP_KERNEL but really should be GFP_ATOMIC, or similar. It >> was a copy-n-paste from the similar syscall function where GFP_KERNEL >> is appropriate due to the calling context at the end of the syscall. >> I'll change that as soon as I'm done with this email. > Ok, depends where it steers, but there may be a requirement to > not sleep for some hooks because of not having a sleepable context. > >> Of course if you are calling io_uring_enter(2), or something similar, >> then audit may sleep as part of the normal syscall processing (as >> mentioned above), but that is due to the fact that io_uring_enter(2) >> is a syscall and not because of anything in io_issue_sqe(). >>
On Thu, Jun 3, 2021 at 11:54 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 5/28/21 10:02 AM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 4:19 PM Paul Moore <paul@paul-moore.com> wrote:
> >> ... If we moved the _entry
> >> and _exit calls into the individual operation case blocks (quick
> >> openat example below) so that only certain operations were able to be
> >> audited would that be acceptable assuming the high frequency ops were
> >> untouched? My initial gut feeling was that this would involve >50% of
> >> the ops, but Steve Grubb seems to think it would be less; it may be
> >> time to look at that a bit more seriously, but if it gets a NACK
> >> regardless it isn't worth the time - thoughts?
> >>
> >> case IORING_OP_OPENAT:
> >> audit_uring_entry(req->opcode);
> >> ret = io_openat(req, issue_flags);
> >> audit_uring_exit(!ret, ret);
> >> break;
> >
> > I wanted to pose this question again in case it was lost in the
> > thread, I suspect this may be the last option before we have to "fix"
> > things at the Kconfig level. I definitely don't want to have to go
> > that route, and I suspect most everyone on this thread feels the same,
> > so I'm hopeful we can find a solution that is begrudgingly acceptable
> > to both groups.
>
> Sorry for the lack of response here, but to sum up my order of
> preference:
>
> 1) It's probably better to just make the audit an opt-out in io_op_defs
> for each opcode, and avoid needing boiler plate code for each op
> handler. The opt-out would ensure that new opcodes get it by default
> it someone doesn't know what it is, and the io_op_defs addition would
> mean that it's in generic code rather then in the handlers. Yes it's
> a bit slower, but it's saner imho.
>
> 2) With the above, I'm fine with adding this to io_uring. I don't think
> going the route of mutual exclusion in kconfig helps anyone, it'd
> be counter productive to both sides.
>
> Hope that works and helps move this forward. I'll be mostly out of touch
> the next week and a half, but wanted to ensure that I sent out my
> (brief) thoughts before going away.
Thanks Jens. I'll revise the patchset based on this (basically doing
an opt-out version of what you did on May 26th) and do a v2 post with
the other accumulated fixes/changes. If there is anything else that
needs discussion/review I'm sure Pavel can help us out, he's been
helpful thus far.
--
paul moore
www.paul-moore.com
On Mon, May 31, 2021 at 9:45 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > The commit ("audit: add filtering for io_uring records") added support for > filtering io_uring operations. > > Add checks to the audit io_uring filtering code for directory and path watches, > and to keep the list counts consistent. > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> > --- > kernel/audit_tree.c | 3 ++- > kernel/audit_watch.c | 3 ++- > kernel/auditfilter.c | 7 +++++-- > 3 files changed, 9 insertions(+), 4 deletions(-) Thanks for pointing these omissions out in the original patch. When a patch has obvious problems generally people just provide feedback and the patch author incorporates the fixes; this helps ensure we don't merge known broken patches, helping preserve `git bisect`. Do you mind if I incorporate these suggestions, and the one in patch 2/2, into the filtering patch in the original RFC patchset? I'll add a 'thank you' comment in the commit description as I did to the other patch where you provided feedback. I feel that is the proper way to handle this. > diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c > index 6c91902f4f45..2be285c2f069 100644 > --- a/kernel/audit_tree.c > +++ b/kernel/audit_tree.c > @@ -727,7 +727,8 @@ int audit_make_tree(struct audit_krule *rule, char *pathname, u32 op) > { > > if (pathname[0] != '/' || > - rule->listnr != AUDIT_FILTER_EXIT || > + (rule->listnr != AUDIT_FILTER_EXIT && > + rule->listnr != AUDIT_FILTER_URING_EXIT) || > op != Audit_equal || > rule->inode_f || rule->watch || rule->tree) > return -EINVAL; > diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c > index 2acf7ca49154..698b62b4a2ec 100644 > --- a/kernel/audit_watch.c > +++ b/kernel/audit_watch.c > @@ -183,7 +183,8 @@ int audit_to_watch(struct audit_krule *krule, char *path, int len, u32 op) > return -EOPNOTSUPP; > > if (path[0] != '/' || path[len-1] == '/' || > - krule->listnr != AUDIT_FILTER_EXIT || > + (krule->listnr != AUDIT_FILTER_EXIT && > + krule->listnr != AUDIT_FILTER_URING_EXIT) || > op != Audit_equal || > krule->inode_f || krule->watch || krule->tree) > return -EINVAL; > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c > index c21119c00504..bcdedfd1088c 100644 > --- a/kernel/auditfilter.c > +++ b/kernel/auditfilter.c > @@ -153,7 +153,8 @@ char *audit_unpack_string(void **bufp, size_t *remain, size_t len) > static inline int audit_to_inode(struct audit_krule *krule, > struct audit_field *f) > { > - if (krule->listnr != AUDIT_FILTER_EXIT || > + if ((krule->listnr != AUDIT_FILTER_EXIT && > + krule->listnr != AUDIT_FILTER_URING_EXIT) || > krule->inode_f || krule->watch || krule->tree || > (f->op != Audit_equal && f->op != Audit_not_equal)) > return -EINVAL; > @@ -250,6 +251,7 @@ static inline struct audit_entry *audit_to_entry_common(struct audit_rule_data * > pr_err("AUDIT_FILTER_ENTRY is deprecated\n"); > goto exit_err; > case AUDIT_FILTER_EXIT: > + case AUDIT_FILTER_URING_EXIT: > case AUDIT_FILTER_TASK: > #endif > case AUDIT_FILTER_USER: > @@ -982,7 +984,8 @@ static inline int audit_add_rule(struct audit_entry *entry) > } > > entry->rule.prio = ~0ULL; > - if (entry->rule.listnr == AUDIT_FILTER_EXIT) { > + if (entry->rule.listnr == AUDIT_FILTER_EXIT || > + entry->rule.listnr == AUDIT_FILTER_URING_EXIT) { > if (entry->rule.flags & AUDIT_FILTER_PREPEND) > entry->rule.prio = ++prio_high; > else > -- > 2.27.0 -- paul moore www.paul-moore.com
On 2021-06-07 19:15, Paul Moore wrote: > On Mon, May 31, 2021 at 9:45 AM Richard Guy Briggs <rgb@redhat.com> wrote: > > The commit ("audit: add filtering for io_uring records") added support for > > filtering io_uring operations. > > > > Add checks to the audit io_uring filtering code for directory and path watches, > > and to keep the list counts consistent. > > > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com> > > --- > > kernel/audit_tree.c | 3 ++- > > kernel/audit_watch.c | 3 ++- > > kernel/auditfilter.c | 7 +++++-- > > 3 files changed, 9 insertions(+), 4 deletions(-) > > Thanks for pointing these omissions out in the original patch. When a > patch has obvious problems generally people just provide feedback and > the patch author incorporates the fixes; this helps ensure we don't > merge known broken patches, helping preserve `git bisect`. > > Do you mind if I incorporate these suggestions, and the one in patch > 2/2, into the filtering patch in the original RFC patchset? I'll add > a 'thank you' comment in the commit description as I did to the other > patch where you provided feedback. I feel that is the proper way to > handle this. I should have been more explicit. The intent was to have the fixes incorporated directly into your patches since they aren't committed in any public tree yet, exactly for bisect reasons. I didn't add a "fixes:" tag because it had no public commit hash, but could/should have instead made a note about it or even used the "fixup:" subject tag. Posting using the thread as the "in-reply-to:" for this patchset was the simplest and clearest way to demonstrate the intent and check they were correct and I was too lazy to add a cover letter explaining that. There is also a Co-developed-by: tag that could be used if you feel that is appropriate, now that you mention it, but that appears to imply a much stronger equal contribution. > > diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c > > index 6c91902f4f45..2be285c2f069 100644 > > --- a/kernel/audit_tree.c > > +++ b/kernel/audit_tree.c > > @@ -727,7 +727,8 @@ int audit_make_tree(struct audit_krule *rule, char *pathname, u32 op) > > { > > > > if (pathname[0] != '/' || > > - rule->listnr != AUDIT_FILTER_EXIT || > > + (rule->listnr != AUDIT_FILTER_EXIT && > > + rule->listnr != AUDIT_FILTER_URING_EXIT) || > > op != Audit_equal || > > rule->inode_f || rule->watch || rule->tree) > > return -EINVAL; > > diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c > > index 2acf7ca49154..698b62b4a2ec 100644 > > --- a/kernel/audit_watch.c > > +++ b/kernel/audit_watch.c > > @@ -183,7 +183,8 @@ int audit_to_watch(struct audit_krule *krule, char *path, int len, u32 op) > > return -EOPNOTSUPP; > > > > if (path[0] != '/' || path[len-1] == '/' || > > - krule->listnr != AUDIT_FILTER_EXIT || > > + (krule->listnr != AUDIT_FILTER_EXIT && > > + krule->listnr != AUDIT_FILTER_URING_EXIT) || > > op != Audit_equal || > > krule->inode_f || krule->watch || krule->tree) > > return -EINVAL; > > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c > > index c21119c00504..bcdedfd1088c 100644 > > --- a/kernel/auditfilter.c > > +++ b/kernel/auditfilter.c > > @@ -153,7 +153,8 @@ char *audit_unpack_string(void **bufp, size_t *remain, size_t len) > > static inline int audit_to_inode(struct audit_krule *krule, > > struct audit_field *f) > > { > > - if (krule->listnr != AUDIT_FILTER_EXIT || > > + if ((krule->listnr != AUDIT_FILTER_EXIT && > > + krule->listnr != AUDIT_FILTER_URING_EXIT) || > > krule->inode_f || krule->watch || krule->tree || > > (f->op != Audit_equal && f->op != Audit_not_equal)) > > return -EINVAL; > > @@ -250,6 +251,7 @@ static inline struct audit_entry *audit_to_entry_common(struct audit_rule_data * > > pr_err("AUDIT_FILTER_ENTRY is deprecated\n"); > > goto exit_err; > > case AUDIT_FILTER_EXIT: > > + case AUDIT_FILTER_URING_EXIT: > > case AUDIT_FILTER_TASK: > > #endif > > case AUDIT_FILTER_USER: > > @@ -982,7 +984,8 @@ static inline int audit_add_rule(struct audit_entry *entry) > > } > > > > entry->rule.prio = ~0ULL; > > - if (entry->rule.listnr == AUDIT_FILTER_EXIT) { > > + if (entry->rule.listnr == AUDIT_FILTER_EXIT || > > + entry->rule.listnr == AUDIT_FILTER_URING_EXIT) { > > if (entry->rule.flags & AUDIT_FILTER_PREPEND) > > entry->rule.prio = ++prio_high; > > else > > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Tue, Jun 8, 2021 at 8:55 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> I should have been more explicit. The intent was to have the fixes
> incorporated directly into your patches since they aren't committed in
> any public tree yet, exactly for bisect reasons ...
No worries, thanks for the clarification Richard. I just wanted to
make sure since the contribution format was a bit unusual given the
context :)
Regardless, thanks again for the feedback, I'll get this incorporated.
--
paul moore
www.paul-moore.com
On 2021-06-02 13:46, Paul Moore wrote: > On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2021-05-21 17:49, Paul Moore wrote: > > > WARNING - This is a work in progress and should not be merged > > > anywhere important. It is almost surely not complete, and while it > > > probably compiles it likely hasn't been booted and will do terrible > > > things. You have been warned. > > > > > > This patch adds basic auditing to io_uring operations, regardless of > > > their context. This is accomplished by allocating audit_context > > > structures for the io-wq worker and io_uring SQPOLL kernel threads > > > as well as explicitly auditing the io_uring operations in > > > io_issue_sqe(). The io_uring operations are audited using a new > > > AUDIT_URINGOP record, an example is shown below: > > > > > > % <TODO - insert AUDIT_URINGOP record example> > > > > > > Thanks to Richard Guy Briggs for review and feedback. > > > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > > --- > > > fs/io-wq.c | 4 + > > > fs/io_uring.c | 11 +++ > > > include/linux/audit.h | 17 ++++ > > > include/uapi/linux/audit.h | 1 > > > kernel/audit.h | 2 + > > > kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ > > > 6 files changed, 208 insertions(+) > > ... > > > > diff --git a/fs/io_uring.c b/fs/io_uring.c > > > index e481ac8a757a..e9941d1ad8fd 100644 > > > --- a/fs/io_uring.c > > > +++ b/fs/io_uring.c > > > @@ -78,6 +78,7 @@ > > > #include <linux/task_work.h> > > > #include <linux/pagemap.h> > > > #include <linux/io_uring.h> > > > +#include <linux/audit.h> > > > > > > #define CREATE_TRACE_POINTS > > > #include <trace/events/io_uring.h> > > > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > > > if (req->work.creds && req->work.creds != current_cred()) > > > creds = override_creds(req->work.creds); > > > > > > + if (req->opcode < IORING_OP_LAST) > > > + audit_uring_entry(req->opcode); > > Note well the override_creds() call right above the audit code that is > being added, it will be important later in this email. > > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > > > index cc89e9f9a753..729849d41631 100644 > > > --- a/kernel/auditsc.c > > > +++ b/kernel/auditsc.c > > > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void) > > > audit_log_end(ab); > > > } > > > > > > +/** > > > + * audit_log_uring - generate a AUDIT_URINGOP record > > > + * @ctx: the audit context > > > + */ > > > +static void audit_log_uring(struct audit_context *ctx) > > > +{ > > > + struct audit_buffer *ab; > > > + const struct cred *cred; > > > + > > > + /* > > > + * TODO: What do we log here? I'm tossing in a few things to start the > > > + * conversation, but additional thought needs to go into this. > > > + */ > > > + > > > + ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP); > > > + if (!ab) > > > + return; > > > + cred = current_cred(); > > > > This may need to be req->work.creds. I haven't been following if the > > io_uring thread inherited the user task's creds (and below, comm and > > exe). > > Nope, we're good. See the existing code in io_issue_sqe() :) > > > > + audit_log_format(ab, "uring_op=%d", ctx->uring_op); > > > > arch is stored below in __audit_uring_entry() and never used in the > > AUDIT_CTX_URING case. That assignment can either be dropped or printed > > before uring_op similar to the SYSCALL record. > > Good point, I'll drop the code that records the arch from _entry(); it > is really only useful to give the appropriate context if needed for > other things in the audit stream, and that isn't the case like it is > with syscalls. > > > There aren't really any arg[0-3] to print. > > Which is why I didn't print them. > > > io_uring_register and io_uring_setup() args are better covered by other > > records. io_uring_enter() has 6 args and the last two aren't covered by > > SYSCALL anyways. > > ??? > > I think you are confusing the io_uring ops with syscalls; they are > very different things from an audit perspective and the io_uring > auditing is not intended to replace syscall records. The > io_uring_setup() and io_uring_enter() syscalls will be auditing just > as any other syscalls would be using the existing syscall audit code. > > > > + if (ctx->return_valid != AUDITSC_INVALID) > > > + audit_log_format(ab, " success=%s exit=%ld", > > > + (ctx->return_valid == AUDITSC_SUCCESS ? > > > + "yes" : "no"), > > > + ctx->return_code); > > > + audit_log_format(ab, > > > + " items=%d" > > > + " ppid=%d pid=%d auid=%u uid=%u gid=%u" > > > + " euid=%u suid=%u fsuid=%u" > > > + " egid=%u sgid=%u fsgid=%u", > > > + ctx->name_count, > > > + task_ppid_nr(current), > > > + task_tgid_nr(current), > > > + from_kuid(&init_user_ns, audit_get_loginuid(current)), > > > + from_kuid(&init_user_ns, cred->uid), > > > + from_kgid(&init_user_ns, cred->gid), > > > + from_kuid(&init_user_ns, cred->euid), > > > + from_kuid(&init_user_ns, cred->suid), > > > + from_kuid(&init_user_ns, cred->fsuid), > > > + from_kgid(&init_user_ns, cred->egid), > > > + from_kgid(&init_user_ns, cred->sgid), > > > + from_kgid(&init_user_ns, cred->fsgid)); > > > > The audit session ID is still important, relevant and qualifies auid. > > In keeping with the SYSCALL record format, I think we want to keep > > ses=audit_get_sessionid(current) in here. > > This might be another case of syscall/io_uring confusion. An io_uring > op doesn't necessarily have an audit session ID or an audit UID in the > conventional sense; for example think about SQPOLL works, shared > rings, etc. Right, but those syscalls are what instigate io_uring operations, so whatever process starts that operation, or gets handed that handle should be tracked with auid and sessionid (the two work together to track) unless we can easily track io_uring ops to connect them to a previous setup syscall. If we see a need to keep the auid, then the sessionid goes with it. > > I'm pretty sure we also want to keep comm= and exe= too, but may have to > > reach into req->task to get it. There are two values for comm possible, > > one from the original task and second "iou-sqp-<pid>" set at the top of > > io_sq_thread(). > > I think this is more syscall/io_uring confusion. I wouldn't call them confusion but rather parallels, attributing a particular subject to an action. > paul moore - RGB -- Richard Guy Briggs <rgb@redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635
On Tue, Aug 24, 2021 at 9:21 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > On 2021-06-02 13:46, Paul Moore wrote: > > On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <rgb@redhat.com> wrote: > > > On 2021-05-21 17:49, Paul Moore wrote: > > > > WARNING - This is a work in progress and should not be merged > > > > anywhere important. It is almost surely not complete, and while it > > > > probably compiles it likely hasn't been booted and will do terrible > > > > things. You have been warned. > > > > > > > > This patch adds basic auditing to io_uring operations, regardless of > > > > their context. This is accomplished by allocating audit_context > > > > structures for the io-wq worker and io_uring SQPOLL kernel threads > > > > as well as explicitly auditing the io_uring operations in > > > > io_issue_sqe(). The io_uring operations are audited using a new > > > > AUDIT_URINGOP record, an example is shown below: > > > > > > > > % <TODO - insert AUDIT_URINGOP record example> > > > > > > > > Thanks to Richard Guy Briggs for review and feedback. > > > > > > > > Signed-off-by: Paul Moore <paul@paul-moore.com> > > > > --- > > > > fs/io-wq.c | 4 + > > > > fs/io_uring.c | 11 +++ > > > > include/linux/audit.h | 17 ++++ > > > > include/uapi/linux/audit.h | 1 > > > > kernel/audit.h | 2 + > > > > kernel/auditsc.c | 173 ++++++++++++++++++++++++++++++++++++++++++++ > > > > 6 files changed, 208 insertions(+) ... > > > > + if (ctx->return_valid != AUDITSC_INVALID) > > > > + audit_log_format(ab, " success=%s exit=%ld", > > > > + (ctx->return_valid == AUDITSC_SUCCESS ? > > > > + "yes" : "no"), > > > > + ctx->return_code); > > > > + audit_log_format(ab, > > > > + " items=%d" > > > > + " ppid=%d pid=%d auid=%u uid=%u gid=%u" > > > > + " euid=%u suid=%u fsuid=%u" > > > > + " egid=%u sgid=%u fsgid=%u", > > > > + ctx->name_count, > > > > + task_ppid_nr(current), > > > > + task_tgid_nr(current), > > > > + from_kuid(&init_user_ns, audit_get_loginuid(current)), > > > > + from_kuid(&init_user_ns, cred->uid), > > > > + from_kgid(&init_user_ns, cred->gid), > > > > + from_kuid(&init_user_ns, cred->euid), > > > > + from_kuid(&init_user_ns, cred->suid), > > > > + from_kuid(&init_user_ns, cred->fsuid), > > > > + from_kgid(&init_user_ns, cred->egid), > > > > + from_kgid(&init_user_ns, cred->sgid), > > > > + from_kgid(&init_user_ns, cred->fsgid)); > > > > > > The audit session ID is still important, relevant and qualifies auid. > > > In keeping with the SYSCALL record format, I think we want to keep > > > ses=audit_get_sessionid(current) in here. > > > > This might be another case of syscall/io_uring confusion. An io_uring > > op doesn't necessarily have an audit session ID or an audit UID in the > > conventional sense; for example think about SQPOLL works, shared > > rings, etc. > > Right, but those syscalls are what instigate io_uring operations, so > whatever process starts that operation, or gets handed that handle > should be tracked with auid and sessionid (the two work together to > track) unless we can easily track io_uring ops to connect them to a > previous setup syscall. If we see a need to keep the auid, then the > sessionid goes with it. As a reminder, once the io_uring is created appropriately one can issue io_uring operations without making a syscall. Further, sharing a io_uring across process boundaries means that both the audit session ID and audit login UID used to create the io_uring might not be the same as the subject which issues operations to the io_uring. Any io_uring operations that happen synchronously as the result of a syscall should be associated with the SYSCALL record so the session ID and login UID will be part of the event. Asynchronous operations will not have that information because we don't have a way to get it. -- paul moore www.paul-moore.com