Hi, Here is the v8 series of probe-event to support user-space access. Previous version is here. https://lkml.kernel.org/r/155732230159.12756.15040196512285621636.stgit@devnote2 In this version, I fixed some typos/style issues and renamed fields according to Ingo's comment, and added Ack from Steve. Also this version is rebased on the latest -tip/master tree. Changes in v8: [2/6] Fix style issues and typos according to Ingo's comment. [3/6] Fix style issues according to Ingo's comment. [6/6] Fix a typo and rename user field to user_access field. In summary, strncpy_from_user() should work as below - strncpy_from_user() can access user memory with set_fs(USER_DS) in task context - strncpy_from_user() can access kernel memory with set_fs(KERNEL_DS) in task context (e.g. devtmpfsd and init) - strncpy_from_user() can access user/kernel memory (depends on DS) in IRQ context if pagefault is disabled. (both verified) Note that this changes the warning behavior when CONFIG_DEBUG_ATOMIC_SLEEP=y, it still warns when __copy_from_user_inatomic() is called in IRQ context, but don't warn if pagefault is disabled because it will not sleep in atomic. ==== Kprobe event user-space memory access features: For user-space access extension, this series adds 2 features, "ustring" type and user-space dereference syntax. "ustring" is used for recording a null-terminated string in user-space from kprobe events. "ustring" type is easy, it is able to use instead of "string" type, so if you want to record a user-space string via "__user char *", you can use ustring type instead of string. For example, echo 'p do_sys_open path=+0($arg2):ustring' >> kprobe_events will record the path string from user-space. The user-space dereference syntax is also simple. Thi just adds 'u' prefix before an offset value. +|-u<OFFSET>(<FETCHARG>) e.g. +u8(%ax), +u0(+0(%si)) This is more generic. If you want to refer the variable in user- space from its address or access a field in data structure in user-space, you need to use this. For example, if you probe do_sched_setscheduler(pid, policy, param) and record param->sched_priority, you can add new probe as below; p do_sched_setscheduler priority=+u0($arg3) Actually, with this feature, "ustring" type is not absolutely necessary, because these are same meanings. +0($arg2):ustring == +u0($arg2):string Note that kprobe event provides these methods, but it doesn't change it from kernel to user automatically because we do not know whether the given address is in userspace or kernel on some arch. Thank you, --- Masami Hiramatsu (6): x86/uaccess: Allow access_ok() in irq context if pagefault_disabled uaccess: Add non-pagefault user-space read functions tracing/probe: Add ustring type for user-space string tracing/probe: Support user-space dereference selftests/ftrace: Add user-memory access syntax testcase perf-probe: Add user memory access attribute support Documentation/trace/kprobetrace.rst | 28 ++++- Documentation/trace/uprobetracer.rst | 10 +- arch/x86/include/asm/uaccess.h | 4 - include/linux/uaccess.h | 19 +++ kernel/trace/trace.c | 7 + kernel/trace/trace_kprobe.c | 37 ++++++ kernel/trace/trace_probe.c | 37 +++++- kernel/trace/trace_probe.h | 3 kernel/trace/trace_probe_tmpl.h | 37 +++++- kernel/trace/trace_uprobe.c | 19 +++ mm/maccess.c | 122 +++++++++++++++++++- tools/perf/Documentation/perf-probe.txt | 3 tools/perf/util/probe-event.c | 11 ++ tools/perf/util/probe-event.h | 2 tools/perf/util/probe-file.c | 7 + tools/perf/util/probe-file.h | 1 tools/perf/util/probe-finder.c | 19 ++- .../ftrace/test.d/kprobe/kprobe_args_user.tc | 32 +++++ 18 files changed, 357 insertions(+), 41 deletions(-) create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc -- Masami Hiramatsu (Linaro) <mhiramat@kernel.org>
WARN_ON_IN_IRQ() assumes that the access_ok() and following user memory access can sleep. But this assumption is not always correct; when the pagefault is disabled, following memory access will just returns -EFAULT and never sleep. Add pagefault_disabled() check in WARN_ON_ONCE() so that it can ignore the case we call it with disabling pagefault. For this purpose, this modified pagefault_disabled() as an inline function. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> --- arch/x86/include/asm/uaccess.h | 4 +++- include/linux/uaccess.h | 5 ++++- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index c82abd6e4ca3..9c4435307ff8 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -66,7 +66,9 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un }) #ifdef CONFIG_DEBUG_ATOMIC_SLEEP -# define WARN_ON_IN_IRQ() WARN_ON_ONCE(!in_task()) +static inline bool pagefault_disabled(void); +# define WARN_ON_IN_IRQ() \ + WARN_ON_ONCE(!in_task() && !pagefault_disabled()) #else # define WARN_ON_IN_IRQ() #endif diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 2b70130af585..5a43ef7db492 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -203,7 +203,10 @@ static inline void pagefault_enable(void) /* * Is the pagefault handler disabled? If so, user access methods will not sleep. */ -#define pagefault_disabled() (current->pagefault_disabled != 0) +static inline bool pagefault_disabled(void) +{ + return current->pagefault_disabled != 0; +} /* * The pagefault handler is in general disabled by pagefault_disable() or
Add probe_user_read(), strncpy_from_unsafe_user() and strnlen_unsafe_user() which allows caller to access user-space in IRQ context. Current probe_kernel_read() and strncpy_from_unsafe() are not available for user-space memory, because it sets KERNEL_DS while accessing data. On some arch, user address space and kernel address space can be co-exist, but others can not. In that case, setting KERNEL_DS means given address is treated as a kernel address space. Also strnlen_user() is only available from user context since it can sleep if pagefault is enabled. To access user-space memory without pagefault, we need these new functions which sets USER_DS while accessing the data. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> --- Changes in v8: - Fix style issues and typos according to Ingo's comment (Thanks!) Changes in v6: - Remove user_access_ok() Changes in v5: - Simplify probe_user_read() (Thanks, Peter!) - Add strnlen_unsafe_user() Changes in v3: - Use user_access_ok() for probe_user_read(). Changes in v2: - Simplify strncpy_from_unsafe_user() using strncpy_from_user() according to Linus's suggestion. - Simplify probe_user_read() not using intermediate function. --- include/linux/uaccess.h | 14 +++++ mm/maccess.c | 122 +++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 130 insertions(+), 6 deletions(-) diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 5a43ef7db492..9c435c3f2105 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -242,6 +242,17 @@ static inline unsigned long __copy_from_user_inatomic_nocache(void *to, extern long probe_kernel_read(void *dst, const void *src, size_t size); extern long __probe_kernel_read(void *dst, const void *src, size_t size); +/* + * probe_user_read(): safely attempt to read from a location in user space + * @dst: pointer to the buffer that shall take the data + * @src: address to read from + * @size: size of the data chunk + * + * Safely read from address @src to the buffer at @dst. If a kernel fault + * happens, handle that and return -EFAULT. + */ +extern long probe_user_read(void *dst, const void __user *src, size_t size); + /* * probe_kernel_write(): safely attempt to write to a location * @dst: address to write to @@ -255,6 +266,9 @@ extern long notrace probe_kernel_write(void *dst, const void *src, size_t size); extern long notrace __probe_kernel_write(void *dst, const void *src, size_t size); extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); +extern long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, + long count); +extern long strnlen_unsafe_user(const void __user *unsafe_addr, long count); /** * probe_kernel_address(): safely attempt to read from a location diff --git a/mm/maccess.c b/mm/maccess.c index ec00be51a24f..19c8c3dc14df 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -5,8 +5,20 @@ #include <linux/mm.h> #include <linux/uaccess.h> +static __always_inline long +probe_read_common(void *dst, const void __user *src, size_t size) +{ + long ret; + + pagefault_disable(); + ret = __copy_from_user_inatomic(dst, src, size); + pagefault_enable(); + + return ret ? -EFAULT : 0; +} + /** - * probe_kernel_read(): safely attempt to read from a location + * probe_kernel_read(): safely attempt to read from a kernel-space location * @dst: pointer to the buffer that shall take the data * @src: address to read from * @size: size of the data chunk @@ -29,16 +41,40 @@ long __probe_kernel_read(void *dst, const void *src, size_t size) mm_segment_t old_fs = get_fs(); set_fs(KERNEL_DS); - pagefault_disable(); - ret = __copy_from_user_inatomic(dst, - (__force const void __user *)src, size); - pagefault_enable(); + ret = probe_read_common(dst, (__force const void __user *)src, size); set_fs(old_fs); - return ret ? -EFAULT : 0; + return ret; } EXPORT_SYMBOL_GPL(probe_kernel_read); +/** + * probe_user_read(): safely attempt to read from a user-space location + * @dst: pointer to the buffer that shall take the data + * @src: address to read from. This must be a user address. + * @size: size of the data chunk + * + * Safely read from user address @src to the buffer at @dst. If a kernel fault + * happens, handle that and return -EFAULT. + */ + +long __weak probe_user_read(void *dst, const void __user *src, size_t size) + __attribute__((alias("__probe_user_read"))); + +long __probe_user_read(void *dst, const void __user *src, size_t size) +{ + long ret = -EFAULT; + mm_segment_t old_fs = get_fs(); + + set_fs(USER_DS); + if (access_ok(src, size)) + ret = probe_read_common(dst, src, size); + set_fs(old_fs); + + return ret; +} +EXPORT_SYMBOL_GPL(probe_user_read); + /** * probe_kernel_write(): safely attempt to write to a location * @dst: address to write to @@ -66,6 +102,7 @@ long __probe_kernel_write(void *dst, const void *src, size_t size) } EXPORT_SYMBOL_GPL(probe_kernel_write); + /** * strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address. * @dst: Destination address, in kernel space. This buffer must be at @@ -105,3 +142,76 @@ long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count) return ret ? -EFAULT : src - unsafe_addr; } + +/** + * strncpy_from_unsafe_user: - Copy a NUL terminated string from unsafe user + * address. + * @dst: Destination address, in kernel space. This buffer must be at + * least @count bytes long. + * @unsafe_addr: Unsafe user address. + * @count: Maximum number of bytes to copy, including the trailing NUL. + * + * Copies a NUL-terminated string from unsafe user address to kernel buffer. + * + * On success, returns the length of the string INCLUDING the trailing NUL. + * + * If access fails, returns -EFAULT (some data may have been copied + * and the trailing NUL added). + * + * If @count is smaller than the length of the string, copies @count-1 bytes, + * sets the last byte of @dst buffer to NUL and returns @count. + */ +long strncpy_from_unsafe_user(char *dst, const void __user *unsafe_addr, + long count) +{ + mm_segment_t old_fs = get_fs(); + long ret; + + if (unlikely(count <= 0)) + return 0; + + set_fs(USER_DS); + pagefault_disable(); + ret = strncpy_from_user(dst, unsafe_addr, count); + pagefault_enable(); + set_fs(old_fs); + + if (ret >= count) { + ret = count; + dst[ret - 1] = '\0'; + } else if (ret > 0) { + ret++; + } + + return ret; +} + +/** + * strnlen_unsafe_user: - Get the size of a user string INCLUDING final NUL. + * @unsafe_addr: The string to measure. + * @count: Maximum count (including NUL) + * + * Get the size of a NUL-terminated string in user space without pagefault. + * + * Returns the size of the string INCLUDING the terminating NUL. + * + * If the string is too long, returns a number larger than @count. User + * has to check the return value against "> count". + * On exception (or invalid count), returns 0. + * + * Unlike strnlen_user, this can be used from IRQ handler etc. because + * it disables pagefaults. + */ +long strnlen_unsafe_user(const void __user *unsafe_addr, long count) +{ + mm_segment_t old_fs = get_fs(); + int ret; + + set_fs(USER_DS); + pagefault_disable(); + ret = strnlen_user(unsafe_addr, count); + pagefault_enable(); + set_fs(old_fs); + + return ret; +}
Add "ustring" type for fetching user-space string from kprobe event. User can specify ustring type at uprobe event, and it is same as "string" for uprobe. Note that probe-event provides this option but it doesn't choose the correct type automatically since we have not way to decide the address is in user-space or not on some arch (and on some other arch, you can fetch the string by "string" type). So user must carefully check the target code (e.g. if you see __user on the target variable) and use this new type. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> --- Changes in v8: - Fix style issues according to Ingo's comment (Thanks!) Changes in v5: - Use strnlen_unsafe_user() in fetch_store_strlen_user(). Changes in v2: - Use strnlen_user() instead of open code for fetch_store_strlen_user(). --- Documentation/trace/kprobetrace.rst | 9 +++++++-- kernel/trace/trace.c | 2 +- kernel/trace/trace_kprobe.c | 31 +++++++++++++++++++++++++++++++ kernel/trace/trace_probe.c | 14 +++++++++++--- kernel/trace/trace_probe.h | 1 + kernel/trace/trace_probe_tmpl.h | 14 +++++++++++++- kernel/trace/trace_uprobe.c | 12 ++++++++++++ 7 files changed, 76 insertions(+), 7 deletions(-) diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst index 235ce2ab131a..a3ac7c9ac242 100644 --- a/Documentation/trace/kprobetrace.rst +++ b/Documentation/trace/kprobetrace.rst @@ -55,7 +55,8 @@ Synopsis of kprobe_events NAME=FETCHARG : Set NAME as the argument name of FETCHARG. FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types - (x8/x16/x32/x64), "string" and bitfield are supported. + (x8/x16/x32/x64), "string", "ustring" and bitfield + are supported. (\*1) only for the probe on function entry (offs == 0). (\*2) only for return probe. @@ -77,7 +78,11 @@ apply it to registers/stack-entries etc. (for example, '$stack1:x8[8]' is wrong, but '+8($stack):x8[8]' is OK.) String type is a special type, which fetches a "null-terminated" string from kernel space. This means it will fail and store NULL if the string container -has been paged out. +has been paged out. "ustring" type is an alternative of string for user-space. +Note that kprobe-event provides string/ustring types, but doesn't change it +automatically. So user has to decide if the targe string in kernel or in user +space carefully. On some arch, if you choose wrong one, it always fails to +record string data. The string array type is a bit different from other types. For other base types, <base-type>[1] is equal to <base-type> (e.g. +0(%di):x32[1] is same as +0(%di):x32.) But string[1] is not equal to string. The string type itself diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index ec439999f387..69687ad83228 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4812,7 +4812,7 @@ static const char readme_msg[] = "\t $stack<index>, $stack, $retval, $comm\n" #endif "\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, string, symbol,\n" - "\t b<bit-width>@<bit-offset>/<container-size>,\n" + "\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n" "\t <type>\\[<array-size>\\]\n" #ifdef CONFIG_HIST_TRIGGERS "\t field: <stype> <name>;\n" diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 5d5129b05df7..6279ed26a739 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -871,6 +871,14 @@ fetch_store_strlen(unsigned long addr) return (ret < 0) ? ret : len; } +/* Return the length of string -- including null terminal byte */ +static nokprobe_inline int +fetch_store_strlen_user(unsigned long addr) +{ + return strnlen_unsafe_user((__force const void __user *)addr, + MAX_STRING_SIZE); +} + /* * Fetch a null-terminated string. Caller MUST set *(u32 *)buf with max * length and relative data location. @@ -892,6 +900,29 @@ fetch_store_string(unsigned long addr, void *dest, void *base) if (ret >= 0) *(u32 *)dest = make_data_loc(ret, (void *)dst - base); + + return ret; +} + +/* + * Fetch a null-terminated string from user. Caller MUST set *(u32 *)buf + * with max length and relative data location. + */ +static nokprobe_inline int +fetch_store_string_user(unsigned long addr, void *dest, void *base) +{ + const void __user *uaddr = (__force const void __user *)addr; + int maxlen = get_loc_len(*(u32 *)dest); + u8 *dst = get_loc_data(dest, base); + long ret; + + if (unlikely(!maxlen)) + return -ENOMEM; + ret = strncpy_from_unsafe_user(dst, uaddr, maxlen); + + if (ret >= 0) + *(u32 *)dest = make_data_loc(ret, (void *)dst - base); + return ret; } diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 8f8411e7835f..30054136cfde 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -73,6 +73,8 @@ static const struct fetch_type probe_fetch_types[] = { /* Special types */ __ASSIGN_FETCH_TYPE("string", string, string, sizeof(u32), 1, "__data_loc char[]"), + __ASSIGN_FETCH_TYPE("ustring", string, string, sizeof(u32), 1, + "__data_loc char[]"), /* Basic types */ ASSIGN_FETCH_TYPE(u8, u8, 0), ASSIGN_FETCH_TYPE(u16, u16, 0), @@ -455,7 +457,8 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, goto fail; /* Store operation */ - if (!strcmp(parg->type->name, "string")) { + if (!strcmp(parg->type->name, "string") || + !strcmp(parg->type->name, "ustring")) { if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_IMM && code->op != FETCH_OP_COMM) { pr_info("string only accepts memory or address.\n"); @@ -474,7 +477,11 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, goto fail; } } - code->op = FETCH_OP_ST_STRING; /* In DEREF case, replace it */ + /* If op == DEREF, replace it with STRING */ + if (!strcmp(parg->type->name, "ustring")) + code->op = FETCH_OP_ST_USTRING; + else + code->op = FETCH_OP_ST_STRING; code->size = parg->type->size; parg->dynamic = true; } else if (code->op == FETCH_OP_DEREF) { @@ -499,7 +506,8 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, /* Loop(Array) operation */ if (parg->count) { if (scode->op != FETCH_OP_ST_MEM && - scode->op != FETCH_OP_ST_STRING) { + scode->op != FETCH_OP_ST_STRING && + scode->op != FETCH_OP_ST_USTRING) { pr_info("array only accepts memory or address\n"); ret = -EINVAL; goto fail; diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index 2177c206de15..94cdcfdaced0 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -96,6 +96,7 @@ enum fetch_op { FETCH_OP_ST_RAW, /* Raw: .size */ FETCH_OP_ST_MEM, /* Mem: .offset, .size */ FETCH_OP_ST_STRING, /* String: .offset, .size */ + FETCH_OP_ST_USTRING, /* User String: .offset, .size */ // Stage 4 (modify) op FETCH_OP_MOD_BF, /* Bitfield: .basesize, .lshift, .rshift */ // Stage 5 (loop) op diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h index 4737bb8c07a3..7526f6f8d7b0 100644 --- a/kernel/trace/trace_probe_tmpl.h +++ b/kernel/trace/trace_probe_tmpl.h @@ -59,6 +59,9 @@ process_fetch_insn(struct fetch_insn *code, struct pt_regs *regs, static nokprobe_inline int fetch_store_strlen(unsigned long addr); static nokprobe_inline int fetch_store_string(unsigned long addr, void *dest, void *base); +static nokprobe_inline int fetch_store_strlen_user(unsigned long addr); +static nokprobe_inline int +fetch_store_string_user(unsigned long addr, void *dest, void *base); static nokprobe_inline int probe_mem_read(void *dest, void *src, size_t size); @@ -91,6 +94,10 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val, ret += fetch_store_strlen(val + code->offset); code++; goto array; + } else if (code->op == FETCH_OP_ST_USTRING) { + ret += fetch_store_strlen_user(val + code->offset); + code++; + goto array; } else return -EILSEQ; } @@ -106,6 +113,10 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val, loc = *(u32 *)dest; ret = fetch_store_string(val + code->offset, dest, base); break; + case FETCH_OP_ST_USTRING: + loc = *(u32 *)dest; + ret = fetch_store_string_user(val + code->offset, dest, base); + break; default: return -EILSEQ; } @@ -123,7 +134,8 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val, total += ret; if (++i < code->param) { code = s3; - if (s3->op != FETCH_OP_ST_STRING) { + if (s3->op != FETCH_OP_ST_STRING && + s3->op != FETCH_OP_ST_USTRING) { dest += s3->size; val += s3->size; goto stage3; diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index be78d99ee6bc..f4e37c4f8a21 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -173,6 +173,12 @@ fetch_store_string(unsigned long addr, void *dest, void *base) return ret; } +static nokprobe_inline int +fetch_store_string_user(unsigned long addr, void *dest, void *base) +{ + return fetch_store_string(addr, dest, base); +} + /* Return the length of string -- including null terminal byte */ static nokprobe_inline int fetch_store_strlen(unsigned long addr) @@ -185,6 +191,12 @@ fetch_store_strlen(unsigned long addr) return (len > MAX_STRING_SIZE) ? 0 : len; } +static nokprobe_inline int +fetch_store_strlen_user(unsigned long addr) +{ + return fetch_store_strlen(addr); +} + static unsigned long translate_user_vaddr(unsigned long file_offset) { unsigned long base_addr;
Support user-space dereference syntax for probe event arguments to dereference the data-structure or array in user-space. The syntax is just adding 'u' before an offset value. +|-u<OFFSET>(<FETCHARG>) e.g. +u8(%ax), +u0(+0(%si)) For example, if you probe do_sched_setscheduler(pid, policy, param) and record param->sched_priority, you can add new probe as below; p do_sched_setscheduler priority=+u0($arg3) Note that kprobe event provides this and it doesn't change the dereference method automatically because we do not know whether the given address is in userspace or kernel on some archs. So as same as "ustring", this is an option for user, who has to carefully choose the dereference method. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> --- Changes in v7: - Fix typos and update document according to Steve's comment. Changes in v4: - Fix a bug for parsing argument and simplify code. - Fix documents accroding to Steve's comment. Changes in v3: - Add a section for user memory access to the document. --- Documentation/trace/kprobetrace.rst | 27 ++++++++++++++++++++++----- Documentation/trace/uprobetracer.rst | 10 ++++++---- kernel/trace/trace.c | 5 +++-- kernel/trace/trace_kprobe.c | 6 ++++++ kernel/trace/trace_probe.c | 25 ++++++++++++++++++------- kernel/trace/trace_probe.h | 2 ++ kernel/trace/trace_probe_tmpl.h | 23 ++++++++++++++++++----- kernel/trace/trace_uprobe.c | 7 +++++++ 8 files changed, 82 insertions(+), 23 deletions(-) diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst index a3ac7c9ac242..09ff474493e1 100644 --- a/Documentation/trace/kprobetrace.rst +++ b/Documentation/trace/kprobetrace.rst @@ -51,7 +51,7 @@ Synopsis of kprobe_events $argN : Fetch the Nth function argument. (N >= 1) (\*1) $retval : Fetch return value.(\*2) $comm : Fetch current task comm. - +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(\*3) + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4) NAME=FETCHARG : Set NAME as the argument name of FETCHARG. FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types @@ -61,6 +61,7 @@ Synopsis of kprobe_events (\*1) only for the probe on function entry (offs == 0). (\*2) only for return probe. (\*3) this is useful for fetching a field of data structures. + (\*4) "u" means user-space dereference. See :ref:`user_mem_access`. Types ----- @@ -79,10 +80,7 @@ wrong, but '+8($stack):x8[8]' is OK.) String type is a special type, which fetches a "null-terminated" string from kernel space. This means it will fail and store NULL if the string container has been paged out. "ustring" type is an alternative of string for user-space. -Note that kprobe-event provides string/ustring types, but doesn't change it -automatically. So user has to decide if the targe string in kernel or in user -space carefully. On some arch, if you choose wrong one, it always fails to -record string data. +See :ref:`user_mem_access` for more info.. The string array type is a bit different from other types. For other base types, <base-type>[1] is equal to <base-type> (e.g. +0(%di):x32[1] is same as +0(%di):x32.) But string[1] is not equal to string. The string type itself @@ -97,6 +95,25 @@ Symbol type('symbol') is an alias of u32 or u64 type (depends on BITS_PER_LONG) which shows given pointer in "symbol+offset" style. For $comm, the default type is "string"; any other type is invalid. +.. _user_mem_access: +User Memory Access +------------------ +Kprobe events supports user-space memory access. For that purpose, you can use +either user-space dereference syntax or 'ustring' type. + +The user-space dereference syntax allows you to access a field of a data +structure in user-space. This is done by adding the "u" prefix to the +dereference syntax. For example, +u4(%si) means it will read memory from the +address in the register %si offset by 4, and the memory is expected to be in +user-space. You can use this for strings too, e.g. +u0(%si):string will read +a string from the address in the register %si that is expected to be in user- +space. 'ustring' is a shortcut way of performing the same task. That is, ++0(%si):ustring is equivalent to +u0(%si):string. + +Note that kprobe-event provides the user-memory access syntax but it doesn't +use it transparently. This means if you use normal dereference or string type +for user memory, it might fail, and may always fail on some archs. The user +has to carefully check if the target data is in kernel or user space. Per-Probe Event Filtering ------------------------- diff --git a/Documentation/trace/uprobetracer.rst b/Documentation/trace/uprobetracer.rst index 4346e23e3ae7..ab13319c66ac 100644 --- a/Documentation/trace/uprobetracer.rst +++ b/Documentation/trace/uprobetracer.rst @@ -42,16 +42,18 @@ Synopsis of uprobe_tracer @+OFFSET : Fetch memory at OFFSET (OFFSET from same file as PATH) $stackN : Fetch Nth entry of stack (N >= 0) $stack : Fetch stack address. - $retval : Fetch return value.(*) + $retval : Fetch return value.(\*1) $comm : Fetch current task comm. - +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*2)(\*3) NAME=FETCHARG : Set NAME as the argument name of FETCHARG. FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types (x8/x16/x32/x64), "string" and bitfield are supported. - (*) only for return probe. - (**) this is useful for fetching a field of data structures. + (\*1) only for return probe. + (\*2) this is useful for fetching a field of data structures. + (\*3) Unlike kprobe event, "u" prefix will just be ignored, becuse uprobe + events can access only user-space memory. Types ----- diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 69687ad83228..a5ec61866cd6 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4807,10 +4807,11 @@ static const char readme_msg[] = "\t args: <name>=fetcharg[:type]\n" "\t fetcharg: %<register>, @<address>, @<symbol>[+|-<offset>],\n" #ifdef CONFIG_HAVE_FUNCTION_ARG_ACCESS_API - "\t $stack<index>, $stack, $retval, $comm, $arg<N>\n" + "\t $stack<index>, $stack, $retval, $comm, $arg<N>,\n" #else - "\t $stack<index>, $stack, $retval, $comm\n" + "\t $stack<index>, $stack, $retval, $comm,\n" #endif + "\t +|-[u]<offset>(<fetcharg>)\n" "\t type: s8/16/32/64, u8/16/32/64, x8/16/32/64, string, symbol,\n" "\t b<bit-width>@<bit-offset>/<container-size>, ustring,\n" "\t <type>\\[<array-size>\\]\n" diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 6279ed26a739..3663ae6bd735 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -932,6 +932,12 @@ probe_mem_read(void *dest, void *src, size_t size) return probe_kernel_read(dest, src, size); } +static nokprobe_inline int +probe_mem_read_user(void *dest, void *src, size_t size) +{ + return probe_user_read(dest, src, size); +} + /* Note that we don't verify it, since the code does not come from user space */ static int process_fetch_insn(struct fetch_insn *code, struct pt_regs *regs, void *dest, diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 30054136cfde..00771e7b6ef8 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -253,6 +253,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type, { struct fetch_insn *code = *pcode; unsigned long param; + int deref = FETCH_OP_DEREF; long offset = 0; char *tmp; int ret = 0; @@ -315,9 +316,14 @@ parse_probe_arg(char *arg, const struct fetch_type *type, break; case '+': /* deref memory */ - arg++; /* Skip '+', because kstrtol() rejects it. */ - /* fall through */ case '-': + if (arg[1] == 'u') { + deref = FETCH_OP_UDEREF; + arg[1] = arg[0]; + arg++; + } + if (arg[0] == '+') + arg++; /* Skip '+', because kstrtol() rejects it. */ tmp = strchr(arg, '('); if (!tmp) return -EINVAL; @@ -343,7 +349,7 @@ parse_probe_arg(char *arg, const struct fetch_type *type, return -E2BIG; *pcode = code; - code->op = FETCH_OP_DEREF; + code->op = deref; code->offset = offset; } break; @@ -459,13 +465,14 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, /* Store operation */ if (!strcmp(parg->type->name, "string") || !strcmp(parg->type->name, "ustring")) { - if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_IMM && - code->op != FETCH_OP_COMM) { + if (code->op != FETCH_OP_DEREF && code->op != FETCH_OP_UDEREF + && code->op != FETCH_OP_IMM && code->op != FETCH_OP_COMM) { pr_info("string only accepts memory or address.\n"); ret = -EINVAL; goto fail; } - if (code->op != FETCH_OP_DEREF || parg->count) { + if ((code->op == FETCH_OP_IMM || code->op == FETCH_OP_COMM) + || parg->count) { /* * IMM and COMM is pointing actual address, those must * be kept, and if parg->count != 0, this is an array @@ -478,7 +485,8 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, } } /* If op == DEREF, replace it with STRING */ - if (!strcmp(parg->type->name, "ustring")) + if (!strcmp(parg->type->name, "ustring") || + code->op == FETCH_OP_UDEREF) code->op = FETCH_OP_ST_USTRING; else code->op = FETCH_OP_ST_STRING; @@ -487,6 +495,9 @@ static int traceprobe_parse_probe_arg_body(char *arg, ssize_t *size, } else if (code->op == FETCH_OP_DEREF) { code->op = FETCH_OP_ST_MEM; code->size = parg->type->size; + } else if (code->op == FETCH_OP_UDEREF) { + code->op = FETCH_OP_ST_UMEM; + code->size = parg->type->size; } else { code++; if (code->op != FETCH_OP_NOP) { diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index 94cdcfdaced0..0feac0a81f82 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -92,9 +92,11 @@ enum fetch_op { FETCH_OP_FOFFS, /* File offset: .immediate */ // Stage 2 (dereference) op FETCH_OP_DEREF, /* Dereference: .offset */ + FETCH_OP_UDEREF, /* User-space Dereference: .offset */ // Stage 3 (store) ops FETCH_OP_ST_RAW, /* Raw: .size */ FETCH_OP_ST_MEM, /* Mem: .offset, .size */ + FETCH_OP_ST_UMEM, /* Mem: .offset, .size */ FETCH_OP_ST_STRING, /* String: .offset, .size */ FETCH_OP_ST_USTRING, /* User String: .offset, .size */ // Stage 4 (modify) op diff --git a/kernel/trace/trace_probe_tmpl.h b/kernel/trace/trace_probe_tmpl.h index 7526f6f8d7b0..06f2d901c4cf 100644 --- a/kernel/trace/trace_probe_tmpl.h +++ b/kernel/trace/trace_probe_tmpl.h @@ -64,6 +64,8 @@ static nokprobe_inline int fetch_store_string_user(unsigned long addr, void *dest, void *base); static nokprobe_inline int probe_mem_read(void *dest, void *src, size_t size); +static nokprobe_inline int +probe_mem_read_user(void *dest, void *src, size_t size); /* From the 2nd stage, routine is same */ static nokprobe_inline int @@ -77,14 +79,21 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val, stage2: /* 2nd stage: dereference memory if needed */ - while (code->op == FETCH_OP_DEREF) { - lval = val; - ret = probe_mem_read(&val, (void *)val + code->offset, - sizeof(val)); + do { + if (code->op == FETCH_OP_DEREF) { + lval = val; + ret = probe_mem_read(&val, (void *)val + code->offset, + sizeof(val)); + } else if (code->op == FETCH_OP_UDEREF) { + lval = val; + ret = probe_mem_read_user(&val, + (void *)val + code->offset, sizeof(val)); + } else + break; if (ret) return ret; code++; - } + } while (1); s3 = code; stage3: @@ -109,6 +118,10 @@ process_fetch_insn_bottom(struct fetch_insn *code, unsigned long val, case FETCH_OP_ST_MEM: probe_mem_read(dest, (void *)val + code->offset, code->size); break; + case FETCH_OP_ST_UMEM: + probe_mem_read_user(dest, (void *)val + code->offset, + code->size); + break; case FETCH_OP_ST_STRING: loc = *(u32 *)dest; ret = fetch_store_string(val + code->offset, dest, base); diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index f4e37c4f8a21..5bc8c3686f6f 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -140,6 +140,13 @@ probe_mem_read(void *dest, void *src, size_t size) return copy_from_user(dest, vaddr, size) ? -EFAULT : 0; } + +static nokprobe_inline int +probe_mem_read_user(void *dest, void *src, size_t size) +{ + return probe_mem_read(dest, src, size); +} + /* * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max * length and relative data location.
Add a user-memory access syntax testcase which checks new user-memory access syntax and ustring type. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> --- Changes in v6: - Add $argN availability check --- .../ftrace/test.d/kprobe/kprobe_args_user.tc | 32 ++++++++++++++++++++ 1 file changed, 32 insertions(+) create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc new file mode 100644 index 000000000000..0f60087583d8 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc @@ -0,0 +1,32 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Kprobe event user-memory access + +[ -f kprobe_events ] || exit_unsupported # this is configurable + +grep -q '\$arg<N>' README || exit_unresolved # depends on arch +grep -A10 "fetcharg:" README | grep -q 'ustring' || exit_unsupported +grep -A10 "fetcharg:" README | grep -q '\[u\]<offset>' || exit_unsupported + +:;: "user-memory access syntax and ustring working on user memory";: +echo 'p:myevent do_sys_open path=+0($arg2):ustring path2=+u0($arg2):string' \ + > kprobe_events + +grep myevent kprobe_events | \ + grep -q 'path=+0($arg2):ustring path2=+u0($arg2):string' +echo 1 > events/kprobes/myevent/enable +echo > /dev/null +echo 0 > events/kprobes/myevent/enable + +grep myevent trace | grep -q 'path="/dev/null" path2="/dev/null"' + +:;: "user-memory access syntax and ustring not working with kernel memory";: +echo 'p:myevent vfs_symlink path=+0($arg3):ustring path2=+u0($arg3):string' \ + > kprobe_events +echo 1 > events/kprobes/myevent/enable +ln -s foo $TMPDIR/bar +echo 0 > events/kprobes/myevent/enable + +grep myevent trace | grep -q 'path=(fault) path2=(fault)' + +exit 0
Add user memory access attribute for kprobe event arguments. If a given 'local variable' is in user-space, User can specify memory access method by '@user' suffix. This is not only for string but also for data structure. If we access a field of data structure in user memory from kernel on some arch, it will fail. e.g. perf probe -a "sched_setscheduler param->sched_priority" This will fail to access the "param->sched_priority" because the param is __user pointer. Instead, we can now specify @user suffix for such argument. perf probe -a "sched_setscheduler param->sched_priority@user" Note that kernel memory access with "@user" must always fail on any arch. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> --- Changes in v8 - Fix a typo and rename user field to user_access field. (Thanks Ingo!) --- tools/perf/Documentation/perf-probe.txt | 3 ++- tools/perf/util/probe-event.c | 11 +++++++++++ tools/perf/util/probe-event.h | 2 ++ tools/perf/util/probe-file.c | 7 +++++++ tools/perf/util/probe-file.h | 1 + tools/perf/util/probe-finder.c | 19 ++++++++++++------- 6 files changed, 35 insertions(+), 8 deletions(-) diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt index b6866a05edd2..ed3ecfa422e1 100644 --- a/tools/perf/Documentation/perf-probe.txt +++ b/tools/perf/Documentation/perf-probe.txt @@ -194,12 +194,13 @@ PROBE ARGUMENT -------------- Each probe argument follows below syntax. - [NAME=]LOCALVAR|$retval|%REG|@SYMBOL[:TYPE] + [NAME=]LOCALVAR|$retval|%REG|@SYMBOL[:TYPE][@user] 'NAME' specifies the name of this argument (optional). You can use the name of local variable, local data structure member (e.g. var->field, var.field2), local array with fixed index (e.g. array[1], var->array[0], var->pointer[2]), or kprobe-tracer argument format (e.g. $retval, %ax, etc). Note that the name of this argument will be set as the last member name if you specify a local data structure member (e.g. field2 for 'var->field1.field2'.) '$vars' and '$params' special arguments are also available for NAME, '$vars' is expanded to the local variables (including function parameters) which can access at given probe point. '$params' is expanded to only the function parameters. 'TYPE' casts the type of this argument (optional). If omitted, perf probe automatically set the type based on debuginfo (*). Currently, basic types (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal integers (x/x8/x16/x32/x64), signedness casting (u/s), "string" and bitfield are supported. (see TYPES for detail) On x86 systems %REG is always the short form of the register: for example %AX. %RAX or %EAX is not valid. +"@user" is a special attribute which means the LOCALVAR will be treated as a user-space memory. This is only valid for kprobe event. TYPES ----- diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index 198e09ff611e..a7ca17be5fc5 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -1577,6 +1577,17 @@ static int parse_perf_probe_arg(char *str, struct perf_probe_arg *arg) str = tmp + 1; } + tmp = strchr(str, '@'); + if (tmp && tmp != str && strcmp(tmp + 1, "user")) { /* user attr */ + if (!user_access_is_supported()) { + semantic_error("ftrace does not support user access\n"); + return -EINVAL; + } + *tmp = '\0'; + arg->user_access = true; + pr_debug("user_access "); + } + tmp = strchr(str, ':'); if (tmp) { /* Type setting */ *tmp = '\0'; diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h index 05c8d571a901..96a319cd2378 100644 --- a/tools/perf/util/probe-event.h +++ b/tools/perf/util/probe-event.h @@ -37,6 +37,7 @@ struct probe_trace_point { struct probe_trace_arg_ref { struct probe_trace_arg_ref *next; /* Next reference */ long offset; /* Offset value */ + bool user_access; /* User-memory access */ }; /* kprobe-tracer and uprobe-tracer tracing argument */ @@ -82,6 +83,7 @@ struct perf_probe_arg { char *var; /* Variable name */ char *type; /* Type name */ struct perf_probe_arg_field *field; /* Structure fields */ + bool user_access; /* User-memory access */ }; /* Perf probe probing event (point + arg) */ diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c index 4062bc4412a9..89ce1a9c3798 100644 --- a/tools/perf/util/probe-file.c +++ b/tools/perf/util/probe-file.c @@ -1015,6 +1015,7 @@ enum ftrace_readme { FTRACE_README_PROBE_TYPE_X = 0, FTRACE_README_KRETPROBE_OFFSET, FTRACE_README_UPROBE_REF_CTR, + FTRACE_README_USER_ACCESS, FTRACE_README_END, }; @@ -1027,6 +1028,7 @@ static struct { DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"), DEFINE_TYPE(FTRACE_README_KRETPROBE_OFFSET, "*place (kretprobe): *"), DEFINE_TYPE(FTRACE_README_UPROBE_REF_CTR, "*ref_ctr_offset*"), + DEFINE_TYPE(FTRACE_README_USER_ACCESS, "*[u]<offset>*"), }; static bool scan_ftrace_readme(enum ftrace_readme type) @@ -1087,3 +1089,8 @@ bool uprobe_ref_ctr_is_supported(void) { return scan_ftrace_readme(FTRACE_README_UPROBE_REF_CTR); } + +bool user_access_is_supported(void) +{ + return scan_ftrace_readme(FTRACE_README_USER_ACCESS); +} diff --git a/tools/perf/util/probe-file.h b/tools/perf/util/probe-file.h index 2a249182f2a6..986c1c94f64f 100644 --- a/tools/perf/util/probe-file.h +++ b/tools/perf/util/probe-file.h @@ -70,6 +70,7 @@ int probe_cache__show_all_caches(struct strfilter *filter); bool probe_type_is_available(enum probe_type type); bool kretprobe_offset_is_supported(void); bool uprobe_ref_ctr_is_supported(void); +bool user_access_is_supported(void); #else /* ! HAVE_LIBELF_SUPPORT */ static inline struct probe_cache *probe_cache__new(const char *tgt __maybe_unused, struct nsinfo *nsi __maybe_unused) { diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c index c37fbef1711d..c202027716d0 100644 --- a/tools/perf/util/probe-finder.c +++ b/tools/perf/util/probe-finder.c @@ -294,7 +294,7 @@ static int convert_variable_location(Dwarf_Die *vr_die, Dwarf_Addr addr, static int convert_variable_type(Dwarf_Die *vr_die, struct probe_trace_arg *tvar, - const char *cast) + const char *cast, bool user_access) { struct probe_trace_arg_ref **ref_ptr = &tvar->ref; Dwarf_Die type; @@ -334,7 +334,8 @@ static int convert_variable_type(Dwarf_Die *vr_die, pr_debug("%s type is %s.\n", dwarf_diename(vr_die), dwarf_diename(&type)); - if (cast && strcmp(cast, "string") == 0) { /* String type */ + if (cast && (!strcmp(cast, "string") || !strcmp(cast, "ustring"))) { + /* String type */ ret = dwarf_tag(&type); if (ret != DW_TAG_pointer_type && ret != DW_TAG_array_type) { @@ -357,6 +358,7 @@ static int convert_variable_type(Dwarf_Die *vr_die, pr_warning("Out of memory error\n"); return -ENOMEM; } + (*ref_ptr)->user_access = user_access; } if (!die_compare_name(&type, "char") && !die_compare_name(&type, "unsigned char")) { @@ -411,7 +413,7 @@ static int convert_variable_type(Dwarf_Die *vr_die, static int convert_variable_fields(Dwarf_Die *vr_die, const char *varname, struct perf_probe_arg_field *field, struct probe_trace_arg_ref **ref_ptr, - Dwarf_Die *die_mem) + Dwarf_Die *die_mem, bool user_access) { struct probe_trace_arg_ref *ref = *ref_ptr; Dwarf_Die type; @@ -448,6 +450,7 @@ static int convert_variable_fields(Dwarf_Die *vr_die, const char *varname, *ref_ptr = ref; } ref->offset += dwarf_bytesize(&type) * field->index; + ref->user_access = user_access; goto next; } else if (tag == DW_TAG_pointer_type) { /* Check the pointer and dereference */ @@ -519,17 +522,18 @@ static int convert_variable_fields(Dwarf_Die *vr_die, const char *varname, } } ref->offset += (long)offs; + ref->user_access = user_access; /* If this member is unnamed, we need to reuse this field */ if (!dwarf_diename(die_mem)) return convert_variable_fields(die_mem, varname, field, - &ref, die_mem); + &ref, die_mem, user_access); next: /* Converting next field */ if (field->next) return convert_variable_fields(die_mem, field->name, - field->next, &ref, die_mem); + field->next, &ref, die_mem, user_access); else return 0; } @@ -555,11 +559,12 @@ static int convert_variable(Dwarf_Die *vr_die, struct probe_finder *pf) else if (ret == 0 && pf->pvar->field) { ret = convert_variable_fields(vr_die, pf->pvar->var, pf->pvar->field, &pf->tvar->ref, - &die_mem); + &die_mem, pf->pvar->user_access); vr_die = &die_mem; } if (ret == 0) - ret = convert_variable_type(vr_die, pf->tvar, pf->pvar->type); + ret = convert_variable_type(vr_die, pf->tvar, pf->pvar->type, + pf->pvar->user_access); /* *expr will be cached in libdw. Don't free it. */ return ret; }
Em Fri, May 10, 2019 at 12:12:49AM +0900, Masami Hiramatsu escreveu: > Hi, > > Here is the v8 series of probe-event to support user-space access. > Previous version is here. > > https://lkml.kernel.org/r/155732230159.12756.15040196512285621636.stgit@devnote2 > > In this version, I fixed some typos/style issues and renamed fields > according to Ingo's comment, and added Ack from Steve. > > Also this version is rebased on the latest -tip/master tree. Ingo, since this touches 'perf probe' and Steven already provided an Acked-by, if you're ok with it I can process these, testing the 'perf probe' changes and then ship it to you in my next pull req, ok? - Arnaldo > Changes in v8: > [2/6] Fix style issues and typos according to Ingo's comment. > [3/6] Fix style issues according to Ingo's comment. > [6/6] Fix a typo and rename user field to user_access field. > > > In summary, strncpy_from_user() should work as below > > - strncpy_from_user() can access user memory with set_fs(USER_DS) > in task context > > - strncpy_from_user() can access kernel memory with set_fs(KERNEL_DS) > in task context (e.g. devtmpfsd and init) > > - strncpy_from_user() can access user/kernel memory (depends on DS) > in IRQ context if pagefault is disabled. (both verified) > > Note that this changes the warning behavior when > CONFIG_DEBUG_ATOMIC_SLEEP=y, it still warns when > __copy_from_user_inatomic() is called in IRQ context, but don't > warn if pagefault is disabled because it will not sleep in > atomic. > > ==== > Kprobe event user-space memory access features: > > For user-space access extension, this series adds 2 features, > "ustring" type and user-space dereference syntax. "ustring" is > used for recording a null-terminated string in user-space from > kprobe events. > > "ustring" type is easy, it is able to use instead of "string" > type, so if you want to record a user-space string via > "__user char *", you can use ustring type instead of string. > For example, > > echo 'p do_sys_open path=+0($arg2):ustring' >> kprobe_events > > will record the path string from user-space. > > The user-space dereference syntax is also simple. Thi just > adds 'u' prefix before an offset value. > > +|-u<OFFSET>(<FETCHARG>) > > e.g. +u8(%ax), +u0(+0(%si)) > > This is more generic. If you want to refer the variable in user- > space from its address or access a field in data structure in > user-space, you need to use this. > > For example, if you probe do_sched_setscheduler(pid, policy, > param) and record param->sched_priority, you can add new > probe as below; > > p do_sched_setscheduler priority=+u0($arg3) > > Actually, with this feature, "ustring" type is not absolutely > necessary, because these are same meanings. > > +0($arg2):ustring == +u0($arg2):string > > Note that kprobe event provides these methods, but it doesn't > change it from kernel to user automatically because we do not > know whether the given address is in userspace or kernel on > some arch. > > > Thank you, > > --- > > Masami Hiramatsu (6): > x86/uaccess: Allow access_ok() in irq context if pagefault_disabled > uaccess: Add non-pagefault user-space read functions > tracing/probe: Add ustring type for user-space string > tracing/probe: Support user-space dereference > selftests/ftrace: Add user-memory access syntax testcase > perf-probe: Add user memory access attribute support > > > Documentation/trace/kprobetrace.rst | 28 ++++- > Documentation/trace/uprobetracer.rst | 10 +- > arch/x86/include/asm/uaccess.h | 4 - > include/linux/uaccess.h | 19 +++ > kernel/trace/trace.c | 7 + > kernel/trace/trace_kprobe.c | 37 ++++++ > kernel/trace/trace_probe.c | 37 +++++- > kernel/trace/trace_probe.h | 3 > kernel/trace/trace_probe_tmpl.h | 37 +++++- > kernel/trace/trace_uprobe.c | 19 +++ > mm/maccess.c | 122 +++++++++++++++++++- > tools/perf/Documentation/perf-probe.txt | 3 > tools/perf/util/probe-event.c | 11 ++ > tools/perf/util/probe-event.h | 2 > tools/perf/util/probe-file.c | 7 + > tools/perf/util/probe-file.h | 1 > tools/perf/util/probe-finder.c | 19 ++- > .../ftrace/test.d/kprobe/kprobe_args_user.tc | 32 +++++ > 18 files changed, 357 insertions(+), 41 deletions(-) > create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc > > -- > Masami Hiramatsu (Linaro) <mhiramat@kernel.org> -- - Arnaldo
On Mon, 13 May 2019 15:38:24 -0300 Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > Em Fri, May 10, 2019 at 12:12:49AM +0900, Masami Hiramatsu escreveu: > > Hi, > > > > Here is the v8 series of probe-event to support user-space access. > > Previous version is here. > > > > https://lkml.kernel.org/r/155732230159.12756.15040196512285621636.stgit@devnote2 > > > > In this version, I fixed some typos/style issues and renamed fields > > according to Ingo's comment, and added Ack from Steve. > > > > Also this version is rebased on the latest -tip/master tree. > > Ingo, since this touches 'perf probe' and Steven already provided an > Acked-by, if you're ok with it I can process these, testing the 'perf > probe' changes and then ship it to you in my next pull req, ok? Thanks Arnaldo! For the perf probe enhancement, it should work on the kernel which doesn't support 'u' prefix. :) Thank you, > > - Arnaldo > > > Changes in v8: > > [2/6] Fix style issues and typos according to Ingo's comment. > > [3/6] Fix style issues according to Ingo's comment. > > [6/6] Fix a typo and rename user field to user_access field. > > > > > > In summary, strncpy_from_user() should work as below > > > > - strncpy_from_user() can access user memory with set_fs(USER_DS) > > in task context > > > > - strncpy_from_user() can access kernel memory with set_fs(KERNEL_DS) > > in task context (e.g. devtmpfsd and init) > > > > - strncpy_from_user() can access user/kernel memory (depends on DS) > > in IRQ context if pagefault is disabled. (both verified) > > > > Note that this changes the warning behavior when > > CONFIG_DEBUG_ATOMIC_SLEEP=y, it still warns when > > __copy_from_user_inatomic() is called in IRQ context, but don't > > warn if pagefault is disabled because it will not sleep in > > atomic. > > > > ==== > > Kprobe event user-space memory access features: > > > > For user-space access extension, this series adds 2 features, > > "ustring" type and user-space dereference syntax. "ustring" is > > used for recording a null-terminated string in user-space from > > kprobe events. > > > > "ustring" type is easy, it is able to use instead of "string" > > type, so if you want to record a user-space string via > > "__user char *", you can use ustring type instead of string. > > For example, > > > > echo 'p do_sys_open path=+0($arg2):ustring' >> kprobe_events > > > > will record the path string from user-space. > > > > The user-space dereference syntax is also simple. Thi just > > adds 'u' prefix before an offset value. > > > > +|-u<OFFSET>(<FETCHARG>) > > > > e.g. +u8(%ax), +u0(+0(%si)) > > > > This is more generic. If you want to refer the variable in user- > > space from its address or access a field in data structure in > > user-space, you need to use this. > > > > For example, if you probe do_sched_setscheduler(pid, policy, > > param) and record param->sched_priority, you can add new > > probe as below; > > > > p do_sched_setscheduler priority=+u0($arg3) > > > > Actually, with this feature, "ustring" type is not absolutely > > necessary, because these are same meanings. > > > > +0($arg2):ustring == +u0($arg2):string > > > > Note that kprobe event provides these methods, but it doesn't > > change it from kernel to user automatically because we do not > > know whether the given address is in userspace or kernel on > > some arch. > > > > > > Thank you, > > > > --- > > > > Masami Hiramatsu (6): > > x86/uaccess: Allow access_ok() in irq context if pagefault_disabled > > uaccess: Add non-pagefault user-space read functions > > tracing/probe: Add ustring type for user-space string > > tracing/probe: Support user-space dereference > > selftests/ftrace: Add user-memory access syntax testcase > > perf-probe: Add user memory access attribute support > > > > > > Documentation/trace/kprobetrace.rst | 28 ++++- > > Documentation/trace/uprobetracer.rst | 10 +- > > arch/x86/include/asm/uaccess.h | 4 - > > include/linux/uaccess.h | 19 +++ > > kernel/trace/trace.c | 7 + > > kernel/trace/trace_kprobe.c | 37 ++++++ > > kernel/trace/trace_probe.c | 37 +++++- > > kernel/trace/trace_probe.h | 3 > > kernel/trace/trace_probe_tmpl.h | 37 +++++- > > kernel/trace/trace_uprobe.c | 19 +++ > > mm/maccess.c | 122 +++++++++++++++++++- > > tools/perf/Documentation/perf-probe.txt | 3 > > tools/perf/util/probe-event.c | 11 ++ > > tools/perf/util/probe-event.h | 2 > > tools/perf/util/probe-file.c | 7 + > > tools/perf/util/probe-file.h | 1 > > tools/perf/util/probe-finder.c | 19 ++- > > .../ftrace/test.d/kprobe/kprobe_args_user.tc | 32 +++++ > > 18 files changed, 357 insertions(+), 41 deletions(-) > > create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_user.tc > > > > -- > > Masami Hiramatsu (Linaro) <mhiramat@kernel.org> > > -- > > - Arnaldo -- Masami Hiramatsu <mhiramat@kernel.org>
* Masami Hiramatsu <mhiramat@kernel.org> wrote: > +/* Return the length of string -- including null terminal byte */ > +static nokprobe_inline int > +fetch_store_strlen_user(unsigned long addr) > +{ > + return strnlen_unsafe_user((__force const void __user *)addr, > + MAX_STRING_SIZE); Pointless line break that doesn't improve readability. > +/* > + * Fetch a null-terminated string from user. Caller MUST set *(u32 *)buf > + * with max length and relative data location. > + */ > +static nokprobe_inline int > +fetch_store_string_user(unsigned long addr, void *dest, void *base) > +{ > + const void __user *uaddr = (__force const void __user *)addr; > + int maxlen = get_loc_len(*(u32 *)dest); > + u8 *dst = get_loc_data(dest, base); > + long ret; > + > + if (unlikely(!maxlen)) > + return -ENOMEM; > + ret = strncpy_from_unsafe_user(dst, uaddr, maxlen); > + > + if (ret >= 0) > + *(u32 *)dest = make_data_loc(ret, (void *)dst - base); > + > return ret; Firstly, why is there a 'dest' and a 'dst' variable name as well - the two are very similar and the difference not explained at all. Secondly, a style nit: if you group statements then please group statements based on the usual logic - which is the group them by the flow of logic. In the above case you grouped the 'maxlen' check with the strncpy_from_unsafe_user() call, while the grouping should be the other way around: if (unlikely(!maxlen)) return -ENOMEM; ret = strncpy_from_unsafe_user(dst, uaddr, maxlen); if (ret >= 0) *(u32 *)dest = make_data_loc(ret, (void *)dst - base); return ret; Third, hiding the get_loc_data() call within variable initialization is bad style - we usually only put 'trivial' (constant) initializations there. Fourth, 'dst' is independent of 'maxlen', so it should probably calculated *after* maxlen. I.e. the whole sequence should be: maxlen = get_loc_len(*(u32 *)dest); if (unlikely(!maxlen)) return -ENOMEM; dst = get_loc_data(dest, base); ret = strncpy_from_unsafe_user(dst, uaddr, maxlen); if (ret >= 0) *(u32 *)dest = make_data_loc(ret, (void *)dst - base); return ret; Fifth, we don't actually dereference 'dst', do we? So the whole type casting to 'void *' could be avoided by declaring 'dst' (or whatever its new, clearer name is) not as u8 *, but as void *. I.e. these are five problems in a short sequence of code, which it sad to see in a v8 submission. :-/ Please review the other patches and the whole code base for similar mishaps and small details as well. Thanks, Ingo
On Tue, 14 May 2019 09:24:26 +0200 Ingo Molnar <mingo@kernel.org> wrote: > > * Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > +/* Return the length of string -- including null terminal byte */ > > +static nokprobe_inline int > > +fetch_store_strlen_user(unsigned long addr) > > +{ > > + return strnlen_unsafe_user((__force const void __user *)addr, > > + MAX_STRING_SIZE); > > Pointless line break that doesn't improve readability. OK. > > > +/* > > + * Fetch a null-terminated string from user. Caller MUST set *(u32 *)buf > > + * with max length and relative data location. > > + */ > > +static nokprobe_inline int > > +fetch_store_string_user(unsigned long addr, void *dest, void *base) > > +{ > > + const void __user *uaddr = (__force const void __user *)addr; > > + int maxlen = get_loc_len(*(u32 *)dest); > > + u8 *dst = get_loc_data(dest, base); > > + long ret; > > + > > + if (unlikely(!maxlen)) > > + return -ENOMEM; > > + ret = strncpy_from_unsafe_user(dst, uaddr, maxlen); > > + > > + if (ret >= 0) > > + *(u32 *)dest = make_data_loc(ret, (void *)dst - base); > > + > > return ret; > > Firstly, why is there a 'dest' and a 'dst' variable name as well - the > two are very similar and the difference not explained at all. Agreed. My bad habit, maybe '__dest' would better. > Secondly, a style nit: if you group statements then please group > statements based on the usual logic - which is the group them by the flow > of logic. In the above case you grouped the 'maxlen' check with the > strncpy_from_unsafe_user() call, while the grouping should be the other > way around: > > if (unlikely(!maxlen)) > return -ENOMEM; > > ret = strncpy_from_unsafe_user(dst, uaddr, maxlen); > if (ret >= 0) > *(u32 *)dest = make_data_loc(ret, (void *)dst - base); > > return ret; OK. > > Third, hiding the get_loc_data() call within variable initialization is > bad style - we usually only put 'trivial' (constant) initializations > there. Hmm, it just decodes the location address from offset and start address. Shouldn't it a trivial? > Fourth, 'dst' is independent of 'maxlen', so it should probably > calculated *after* maxlen. Ah, OK. I see what you pointed. > > I.e. the whole sequence should be: > > > maxlen = get_loc_len(*(u32 *)dest); > if (unlikely(!maxlen)) > return -ENOMEM; > > dst = get_loc_data(dest, base); OK, in this case we can skip this conversion if maxlen == 0. > > ret = strncpy_from_unsafe_user(dst, uaddr, maxlen); > if (ret >= 0) > *(u32 *)dest = make_data_loc(ret, (void *)dst - base); > > return ret; > > Fifth, we don't actually dereference 'dst', do we? So the whole type > casting to 'void *' could be avoided by declaring 'dst' (or whatever its > new, clearer name is) not as u8 *, but as void *. OK, I'll use void* for that. > > I.e. these are five problems in a short sequence of code, which it sad to > see in a v8 submission. :-/ > > Please review the other patches and the whole code base for similar > mishaps and small details as well. OK, I'll update others too. Thank you, > > Thanks, > > Ingo -- Masami Hiramatsu <mhiramat@kernel.org>
Em Tue, May 14, 2019 at 02:02:53PM +0900, Masami Hiramatsu escreveu:
> On Mon, 13 May 2019 15:38:24 -0300 Arnaldo Carvalho de Melo <acme@kernel.org> wrote:
> > Em Fri, May 10, 2019 at 12:12:49AM +0900, Masami Hiramatsu escreveu:
> > > In this version, I fixed some typos/style issues and renamed fields
> > > according to Ingo's comment, and added Ack from Steve.
> > >
> > > Also this version is rebased on the latest -tip/master tree.
> >
> > Ingo, since this touches 'perf probe' and Steven already provided an
> > Acked-by, if you're ok with it I can process these, testing the 'perf
> > probe' changes and then ship it to you in my next pull req, ok?
>
> Thanks Arnaldo! For the perf probe enhancement, it should work on
> the kernel which doesn't support 'u' prefix. :)
Sure, I'll test that case too, i.e. old kernel + new tool, old tool +
new kernel, new tool + new kernel, as usual.
- Arnaldo