* [RFC PATCH 0/2] tracing/user_events: Remote write ABI @ 2022-10-27 22:40 Beau Belgrave 2022-10-27 22:40 ` [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement Beau Belgrave ` (3 more replies) 0 siblings, 4 replies; 31+ messages in thread From: Beau Belgrave @ 2022-10-27 22:40 UTC (permalink / raw) To: rostedt, mhiramat, mathieu.desnoyers, dcook, alanau Cc: linux-trace-devel, linux-kernel As part of the discussions for user_events aligned with user space tracers, it was determined that user programs should register a 32-bit value to set or clear a bit when an event becomes enabled. Currently a shared page is being used that requires mmap(). In this new model during the event registration from user programs 2 new values are specified. The first is the address to update when the event is either enabled or disabled. The second is the bit to set/clear to reflect the event being enabled. This allows for a local 32-bit value in user programs to support both kernel and user tracers. As an example, setting bit 31 for kernel tracers when the event becomes enabled allows for user tracers to use the other bits for ref counts or other flags. The kernel side updates the bit atomically, user programs need to also update these values atomically. User provided addresses must be aligned on a 32-bit boundary, this allows for single page checking and prevents odd behaviors such as a 32-bit value straddling 2 pages instead of a single page. When page faults are encountered they are done asyncly via a workqueue. If the page faults back in, the write update is attempted again. If the page cannot fault-in, then we log and wait until the next time the event is enabled/disabled. This is to prevent possible infinite loops resulting from bad user processes unmapping or changing protection values after registering the address. NOTE: User programs that wish to have the enable bit shared across forks either need to use a MAP_SHARED allocated address or register a new address and file descriptor. If MAP_SHARED cannot be used or new registrations cannot be done, then it's allowable to use MAP_PRIVATE as long as the forked children never update the page themselves. Once the page has been updated, the page from the parent will be copied over to the child. This new copy-on-write page will not receive updates from the kernel until another registration has been performed with this new address. Beau Belgrave (2): tracing/user_events: Use remote writes for event enablement tracing/user_events: Fixup enable faults asyncly include/linux/user_events.h | 10 +- kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++----------- 2 files changed, 270 insertions(+), 136 deletions(-) base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049 -- 2.25.1 ^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement 2022-10-27 22:40 [RFC PATCH 0/2] tracing/user_events: Remote write ABI Beau Belgrave @ 2022-10-27 22:40 ` Beau Belgrave 2022-10-29 14:44 ` Mathieu Desnoyers 2022-10-31 14:47 ` Masami Hiramatsu 2022-10-27 22:40 ` [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly Beau Belgrave ` (2 subsequent siblings) 3 siblings, 2 replies; 31+ messages in thread From: Beau Belgrave @ 2022-10-27 22:40 UTC (permalink / raw) To: rostedt, mhiramat, mathieu.desnoyers, dcook, alanau Cc: linux-trace-devel, linux-kernel As part of the discussions for user_events aligned with user space tracers, it was determined that user programs should register a 32-bit value to set or clear a bit when an event becomes enabled. Currently a shared page is being used that requires mmap(). Remove the shared page implementation and move to a user registered address implementation. In this new model during the event registration from user programs 2 new values are specified. The first is the address to update when the event is either enabled or disabled. The second is the bit to set/clear to reflect the event being enabled. This allows for a local 32-bit value in user programs to support both kernel and user tracers. As an example, setting bit 31 for kernel tracers when the event becomes enabled allows for user tracers to use the other bits for ref counts or other flags. The kernel side updates the bit atomically, user programs need to also update these values atomically. User provided addresses must be aligned on a 32-bit boundary, this allows for single page checking and prevents odd behaviors such as a 32-bit value straddling 2 pages instead of a single page. Currently page faults are only logged, future patches will handle these. NOTE: User programs that wish to have the enable bit shared across forks either need to use a MAP_SHARED allocated address or register a new address and file descriptor. If MAP_SHARED cannot be used or new registrations cannot be done, then it's allowable to use MAP_PRIVATE as long as the forked children never update the page themselves. Once the page has been updated, the page from the parent will be copied over to the child. This new copy-on-write page will not receive updates from the kernel until another registration has been performed with this new address. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> --- include/linux/user_events.h | 10 +- kernel/trace/trace_events_user.c | 279 ++++++++++++++++--------------- 2 files changed, 153 insertions(+), 136 deletions(-) diff --git a/include/linux/user_events.h b/include/linux/user_events.h index 592a3fbed98e..4c3bd16395a9 100644 --- a/include/linux/user_events.h +++ b/include/linux/user_events.h @@ -33,12 +33,16 @@ struct user_reg { /* Input: Size of the user_reg structure being used */ __u32 size; + /* Input: Flags/common settings */ + __u32 enable_bit : 5, /* Bit in enable address to use (0-31) */ + __reserved : 27; + + /* Input: Address to update when enabled */ + __u64 enable_addr; + /* Input: Pointer to string with event name, description and flags */ __u64 name_args; - /* Output: Bitwise index of the event within the status page */ - __u32 status_bit; - /* Output: Index of the event to use when writing data */ __u32 write_index; } __attribute__((__packed__)); diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index ae78c2d53c8a..633f24c2a1ac 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -19,6 +19,9 @@ #include <linux/tracefs.h> #include <linux/types.h> #include <linux/uaccess.h> +#include <linux/sched/mm.h> +#include <linux/mmap_lock.h> +#include <linux/highmem.h> /* Reminder to move to uapi when everything works */ #ifdef CONFIG_COMPILE_TEST #include <linux/user_events.h> @@ -34,34 +37,11 @@ #define FIELD_DEPTH_NAME 1 #define FIELD_DEPTH_SIZE 2 -/* - * Limits how many trace_event calls user processes can create: - * Must be a power of two of PAGE_SIZE. - */ -#define MAX_PAGE_ORDER 0 -#define MAX_PAGES (1 << MAX_PAGE_ORDER) -#define MAX_BYTES (MAX_PAGES * PAGE_SIZE) -#define MAX_EVENTS (MAX_BYTES * 8) - /* Limit how long of an event name plus args within the subsystem. */ #define MAX_EVENT_DESC 512 #define EVENT_NAME(user_event) ((user_event)->tracepoint.name) #define MAX_FIELD_ARRAY_SIZE 1024 -/* - * The MAP_STATUS_* macros are used for taking a index and determining the - * appropriate byte and the bit in the byte to set/reset for an event. - * - * The lower 3 bits of the index decide which bit to set. - * The remaining upper bits of the index decide which byte to use for the bit. - * - * This is used when an event has a probe attached/removed to reflect live - * status of the event wanting tracing or not to user-programs via shared - * memory maps. - */ -#define MAP_STATUS_BYTE(index) ((index) >> 3) -#define MAP_STATUS_MASK(index) BIT((index) & 7) - /* * Internal bits (kernel side only) to keep track of connected probes: * These are used when status is requested in text form about an event. These @@ -75,25 +55,37 @@ #define EVENT_STATUS_OTHER BIT(7) /* - * Stores the pages, tables, and locks for a group of events. - * Each logical grouping of events has its own group, with a - * matching page for status checks within user programs. This - * allows for isolation of events to user programs by various - * means. + * Stores the system name, tables, and locks for a group of events. This + * allows isolation for events by various means. */ struct user_event_group { - struct page *pages; - char *register_page_data; char *system_name; struct hlist_node node; struct mutex reg_mutex; DECLARE_HASHTABLE(register_table, 8); - DECLARE_BITMAP(page_bitmap, MAX_EVENTS); }; /* Group for init_user_ns mapping, top-most group */ static struct user_event_group *init_group; +/* + * Describes where to change a bit when an event becomes + * enabled/disabled. These are chained together to enable + * many processes being notified when an event changes. These + * have a lifetime tied to the data files that are used to + * register them. When these go away the ref count to the mm_struct + * is decremented to ensure mm_struct lifetime last as long as + * required for the enable bit set/clear. + */ +struct user_event_enabler { + struct list_head link; + struct mm_struct *mm; + struct file *file; + unsigned long enable_addr; + unsigned int enable_bit: 5, + __reserved: 27; +}; + /* * Stores per-event properties, as users register events * within a file a user_event might be created if it does not @@ -110,8 +102,8 @@ struct user_event { struct hlist_node node; struct list_head fields; struct list_head validators; + struct list_head enablers; refcount_t refcnt; - int index; int flags; int min_size; char status; @@ -155,28 +147,8 @@ static u32 user_event_key(char *name) return jhash(name, strlen(name), 0); } -static void set_page_reservations(char *pages, bool set) -{ - int page; - - for (page = 0; page < MAX_PAGES; ++page) { - void *addr = pages + (PAGE_SIZE * page); - - if (set) - SetPageReserved(virt_to_page(addr)); - else - ClearPageReserved(virt_to_page(addr)); - } -} - static void user_event_group_destroy(struct user_event_group *group) { - if (group->register_page_data) - set_page_reservations(group->register_page_data, false); - - if (group->pages) - __free_pages(group->pages, MAX_PAGE_ORDER); - kfree(group->system_name); kfree(group); } @@ -247,19 +219,6 @@ static struct user_event_group if (!group->system_name) goto error; - group->pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, MAX_PAGE_ORDER); - - if (!group->pages) - goto error; - - group->register_page_data = page_address(group->pages); - - set_page_reservations(group->register_page_data, true); - - /* Zero all bits beside 0 (which is reserved for failures) */ - bitmap_zero(group->page_bitmap, MAX_EVENTS); - set_bit(0, group->page_bitmap); - mutex_init(&group->reg_mutex); hash_init(group->register_table); @@ -271,20 +230,107 @@ static struct user_event_group return NULL; }; -static __always_inline -void user_event_register_set(struct user_event *user) +static void user_event_enabler_destroy(struct user_event_enabler *enabler) { - int i = user->index; + mmdrop(enabler->mm); + kfree(enabler); +} + +static void user_event_enabler_remove(struct file *file, + struct user_event *user) +{ + struct user_event_enabler *enabler, *next; + struct list_head *head = &user->enablers; + + /* Prevent racing with status changes and new events */ + mutex_lock(&event_mutex); + + list_for_each_entry_safe(enabler, next, head, link) { + if (enabler->file != file) + continue; + + list_del(&enabler->link); + user_event_enabler_destroy(enabler); + } + + mutex_unlock(&event_mutex); +} + +static void user_event_enabler_write(struct user_event_enabler *enabler, + struct user_event *user) +{ + struct mm_struct *mm = enabler->mm; + unsigned long uaddr = enabler->enable_addr; + unsigned long *ptr; + struct page *page; + void *kaddr; + int ret; + + mmap_read_lock(mm); + + ret = pin_user_pages_remote(mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, + &page, NULL, NULL); + + mmap_read_unlock(mm); + + if (ret <= 0) { + pr_warn("user_events: Enable write failed\n"); + return; + } + + kaddr = kmap_local_page(page); + ptr = kaddr + (uaddr & ~PAGE_MASK); + + if (user->status) + set_bit(enabler->enable_bit, ptr); + else + clear_bit(enabler->enable_bit, ptr); - user->group->register_page_data[MAP_STATUS_BYTE(i)] |= MAP_STATUS_MASK(i); + kunmap_local(kaddr); + unpin_user_pages_dirty_lock(&page, 1, true); } -static __always_inline -void user_event_register_clear(struct user_event *user) +static void user_event_enabler_update(struct user_event *user) { - int i = user->index; + struct list_head *head = &user->enablers; + struct user_event_enabler *enabler; - user->group->register_page_data[MAP_STATUS_BYTE(i)] &= ~MAP_STATUS_MASK(i); + list_for_each_entry(enabler, head, link) + user_event_enabler_write(enabler, user); +} + +static struct user_event_enabler +*user_event_enabler_create(struct file *file, struct user_reg *reg, + struct user_event *user) +{ + struct user_event_enabler *enabler; + + enabler = kzalloc(sizeof(*enabler), GFP_KERNEL); + + if (!enabler) + return NULL; + + /* + * This is grabbed for accounting purposes. This is to ensure if a + * process exits before the file is released a valid memory descriptor + * will exist for the enabler. + */ + mmgrab(current->mm); + + enabler->mm = current->mm; + enabler->file = file; + enabler->enable_addr = (unsigned long)reg->enable_addr; + enabler->enable_bit = reg->enable_bit; + + /* Prevents state changes from racing with new enablers */ + mutex_lock(&event_mutex); + + list_add(&enabler->link, &user->enablers); + user_event_enabler_write(enabler, user); + + mutex_unlock(&event_mutex); + + return enabler; } static __always_inline __must_check @@ -829,9 +875,6 @@ static int destroy_user_event(struct user_event *user) return ret; dyn_event_remove(&user->devent); - - user_event_register_clear(user); - clear_bit(user->index, user->group->page_bitmap); hash_del(&user->node); user_event_destroy_validators(user); @@ -977,9 +1020,9 @@ static void user_event_perf(struct user_event *user, struct iov_iter *i, #endif /* - * Update the register page that is shared between user processes. + * Update the enabled bit among all user processes. */ -static void update_reg_page_for(struct user_event *user) +static void update_enable_bit_for(struct user_event *user) { struct tracepoint *tp = &user->tracepoint; char status = 0; @@ -1010,12 +1053,9 @@ static void update_reg_page_for(struct user_event *user) rcu_read_unlock_sched(); } - if (status) - user_event_register_set(user); - else - user_event_register_clear(user); - user->status = status; + + user_event_enabler_update(user); } /* @@ -1072,10 +1112,10 @@ static int user_event_reg(struct trace_event_call *call, return ret; inc: refcount_inc(&user->refcnt); - update_reg_page_for(user); + update_enable_bit_for(user); return 0; dec: - update_reg_page_for(user); + update_enable_bit_for(user); refcount_dec(&user->refcnt); return 0; } @@ -1269,7 +1309,6 @@ static int user_event_parse(struct user_event_group *group, char *name, struct user_event **newuser) { int ret; - int index; u32 key; struct user_event *user; @@ -1288,11 +1327,6 @@ static int user_event_parse(struct user_event_group *group, char *name, return 0; } - index = find_first_zero_bit(group->page_bitmap, MAX_EVENTS); - - if (index == MAX_EVENTS) - return -EMFILE; - user = kzalloc(sizeof(*user), GFP_KERNEL); if (!user) @@ -1301,6 +1335,7 @@ static int user_event_parse(struct user_event_group *group, char *name, INIT_LIST_HEAD(&user->class.fields); INIT_LIST_HEAD(&user->fields); INIT_LIST_HEAD(&user->validators); + INIT_LIST_HEAD(&user->enablers); user->group = group; user->tracepoint.name = name; @@ -1338,14 +1373,11 @@ static int user_event_parse(struct user_event_group *group, char *name, if (ret) goto put_user_lock; - user->index = index; - /* Ensure we track self ref and caller ref (2) */ refcount_set(&user->refcnt, 2); dyn_event_init(&user->devent, &user_event_dops); dyn_event_add(&user->devent, &user->call); - set_bit(user->index, group->page_bitmap); hash_add(group->register_table, &user->node, key); mutex_unlock(&event_mutex); @@ -1561,6 +1593,14 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) if (ret) return ret; + /* Ensure natural alignment and sanity check on max bit */ + if (kreg->enable_addr % sizeof(__u32) || kreg->enable_bit > 31) + return -EINVAL; + + /* Ensure accessible */ + if (!access_ok((const void __user *)kreg->enable_addr, sizeof(__u32))) + return -EFAULT; + kreg->size = size; return 0; @@ -1570,11 +1610,12 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) * Registers a user_event on behalf of a user process. */ static long user_events_ioctl_reg(struct user_event_file_info *info, - unsigned long uarg) + struct file *file, unsigned long uarg) { struct user_reg __user *ureg = (struct user_reg __user *)uarg; struct user_reg reg; struct user_event *user; + struct user_event_enabler *enabler; char *name; long ret; @@ -1607,8 +1648,12 @@ static long user_events_ioctl_reg(struct user_event_file_info *info, if (ret < 0) return ret; + enabler = user_event_enabler_create(file, ®, user); + + if (!enabler) + return -ENOMEM; + put_user((u32)ret, &ureg->write_index); - put_user(user->index, &ureg->status_bit); return 0; } @@ -1651,7 +1696,7 @@ static long user_events_ioctl(struct file *file, unsigned int cmd, switch (cmd) { case DIAG_IOCSREG: mutex_lock(&group->reg_mutex); - ret = user_events_ioctl_reg(info, uarg); + ret = user_events_ioctl_reg(info, file, uarg); mutex_unlock(&group->reg_mutex); break; @@ -1700,8 +1745,10 @@ static int user_events_release(struct inode *node, struct file *file) for (i = 0; i < refs->count; ++i) { user = refs->events[i]; - if (user) + if (user) { + user_event_enabler_remove(file, user); refcount_dec(&user->refcnt); + } } out: file->private_data = NULL; @@ -1722,38 +1769,6 @@ static const struct file_operations user_data_fops = { .release = user_events_release, }; -static struct user_event_group *user_status_group(struct file *file) -{ - struct seq_file *m = file->private_data; - - if (!m) - return NULL; - - return m->private; -} - -/* - * Maps the shared page into the user process for checking if event is enabled. - */ -static int user_status_mmap(struct file *file, struct vm_area_struct *vma) -{ - char *pages; - struct user_event_group *group = user_status_group(file); - unsigned long size = vma->vm_end - vma->vm_start; - - if (size != MAX_BYTES) - return -EINVAL; - - if (!group) - return -EINVAL; - - pages = group->register_page_data; - - return remap_pfn_range(vma, vma->vm_start, - virt_to_phys(pages) >> PAGE_SHIFT, - size, vm_get_page_prot(VM_READ)); -} - static void *user_seq_start(struct seq_file *m, loff_t *pos) { if (*pos) @@ -1788,7 +1803,7 @@ static int user_seq_show(struct seq_file *m, void *p) status = user->status; flags = user->flags; - seq_printf(m, "%d:%s", user->index, EVENT_NAME(user)); + seq_printf(m, "%s", EVENT_NAME(user)); if (flags != 0 || status != 0) seq_puts(m, " #"); @@ -1813,7 +1828,6 @@ static int user_seq_show(struct seq_file *m, void *p) seq_puts(m, "\n"); seq_printf(m, "Active: %d\n", active); seq_printf(m, "Busy: %d\n", busy); - seq_printf(m, "Max: %ld\n", MAX_EVENTS); return 0; } @@ -1849,7 +1863,6 @@ static int user_status_open(struct inode *node, struct file *file) static const struct file_operations user_status_fops = { .open = user_status_open, - .mmap = user_status_mmap, .read = seq_read, .llseek = seq_lseek, .release = seq_release, -- 2.25.1 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement 2022-10-27 22:40 ` [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement Beau Belgrave @ 2022-10-29 14:44 ` Mathieu Desnoyers 2022-10-31 16:38 ` Beau Belgrave 2022-10-31 14:47 ` Masami Hiramatsu 1 sibling, 1 reply; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-29 14:44 UTC (permalink / raw) To: Beau Belgrave, rostedt, mhiramat, dcook, alanau Cc: linux-trace-devel, linux-kernel On 2022-10-27 18:40, Beau Belgrave wrote: [...] > diff --git a/include/linux/user_events.h b/include/linux/user_events.h > index 592a3fbed98e..4c3bd16395a9 100644 > --- a/include/linux/user_events.h > +++ b/include/linux/user_events.h > @@ -33,12 +33,16 @@ struct user_reg { > /* Input: Size of the user_reg structure being used */ > __u32 size; > > + /* Input: Flags/common settings */ > + __u32 enable_bit : 5, /* Bit in enable address to use (0-31) */ > + __reserved : 27; I'm always worried about using C/C++ bitfields in uapi, because some compilers (e.g. MS Windows Compiler) have different implementation of the bitfields. See gcc(1) -Wnopacked-bitfield-compat -mms-bitfields / -mno-ms-bitfields Thanks, Mathieu > + > + /* Input: Address to update when enabled */ > + __u64 enable_addr; > + > /* Input: Pointer to string with event name, description and flags */ > __u64 name_args; > > - /* Output: Bitwise index of the event within the status page */ > - __u32 status_bit; > - > /* Output: Index of the event to use when writing data */ > __u32 write_index; > } __attribute__((__packed__)); > diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c-- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement 2022-10-29 14:44 ` Mathieu Desnoyers @ 2022-10-31 16:38 ` Beau Belgrave 0 siblings, 0 replies; 31+ messages in thread From: Beau Belgrave @ 2022-10-31 16:38 UTC (permalink / raw) To: Mathieu Desnoyers Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On Sat, Oct 29, 2022 at 10:44:32AM -0400, Mathieu Desnoyers wrote: > On 2022-10-27 18:40, Beau Belgrave wrote: > > [...] > > diff --git a/include/linux/user_events.h b/include/linux/user_events.h > > index 592a3fbed98e..4c3bd16395a9 100644 > > --- a/include/linux/user_events.h > > +++ b/include/linux/user_events.h > > @@ -33,12 +33,16 @@ struct user_reg { > > /* Input: Size of the user_reg structure being used */ > > __u32 size; > > + /* Input: Flags/common settings */ > > + __u32 enable_bit : 5, /* Bit in enable address to use (0-31) */ > > + __reserved : 27; > > I'm always worried about using C/C++ bitfields in uapi, because some > compilers (e.g. MS Windows Compiler) have different implementation of the > bitfields. See > > gcc(1) > > -Wnopacked-bitfield-compat > -mms-bitfields / -mno-ms-bitfields > Sure, I'll change this __u8. > Thanks, > > Mathieu > > > + > > + /* Input: Address to update when enabled */ > > + __u64 enable_addr; > > + > > /* Input: Pointer to string with event name, description and flags */ > > __u64 name_args; > > - /* Output: Bitwise index of the event within the status page */ > > - __u32 status_bit; > > - > > /* Output: Index of the event to use when writing data */ > > __u32 write_index; > > } __attribute__((__packed__)); > > diff --git a/kernel/trace/trace_events_user.c > > b/kernel/trace/trace_events_user.c-- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement 2022-10-27 22:40 ` [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement Beau Belgrave 2022-10-29 14:44 ` Mathieu Desnoyers @ 2022-10-31 14:47 ` Masami Hiramatsu 2022-10-31 16:46 ` Beau Belgrave 1 sibling, 1 reply; 31+ messages in thread From: Masami Hiramatsu @ 2022-10-31 14:47 UTC (permalink / raw) To: Beau Belgrave Cc: rostedt, mathieu.desnoyers, dcook, alanau, linux-trace-devel, linux-kernel Hi, I have some comments. On Thu, 27 Oct 2022 15:40:10 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: [...] > @@ -1570,11 +1610,12 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) > * Registers a user_event on behalf of a user process. > */ > static long user_events_ioctl_reg(struct user_event_file_info *info, > - unsigned long uarg) > + struct file *file, unsigned long uarg) > { > struct user_reg __user *ureg = (struct user_reg __user *)uarg; > struct user_reg reg; > struct user_event *user; > + struct user_event_enabler *enabler; > char *name; > long ret; > > @@ -1607,8 +1648,12 @@ static long user_events_ioctl_reg(struct user_event_file_info *info, > if (ret < 0) > return ret; > > + enabler = user_event_enabler_create(file, ®, user); > + > + if (!enabler) Shouldn't we free the user_event if needed here? (I found the similar memory leak pattern in the above failure case of the user_events_ref_add().) > + return -ENOMEM; > + > put_user((u32)ret, &ureg->write_index); > - put_user(user->index, &ureg->status_bit); > > return 0; > } [...] > @@ -1849,7 +1863,6 @@ static int user_status_open(struct inode *node, struct file *file) > > static const struct file_operations user_status_fops = { > .open = user_status_open, > - .mmap = user_status_mmap, So, if this drops the mmap operation, can we drop the writable flag from the status tracefs file? static int create_user_tracefs(void) { [...] /* mmap with MAP_SHARED requires writable fd */ emmap = tracefs_create_file("user_events_status", TRACE_MODE_WRITE, NULL, NULL, &user_status_fops); Thank you, -- Masami Hiramatsu (Google) <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement 2022-10-31 14:47 ` Masami Hiramatsu @ 2022-10-31 16:46 ` Beau Belgrave 2022-10-31 23:55 ` Masami Hiramatsu 0 siblings, 1 reply; 31+ messages in thread From: Beau Belgrave @ 2022-10-31 16:46 UTC (permalink / raw) To: Masami Hiramatsu Cc: rostedt, mathieu.desnoyers, dcook, alanau, linux-trace-devel, linux-kernel On Mon, Oct 31, 2022 at 11:47:03PM +0900, Masami Hiramatsu wrote: > Hi, > > I have some comments. > > On Thu, 27 Oct 2022 15:40:10 -0700 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > [...] > > @@ -1570,11 +1610,12 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) > > * Registers a user_event on behalf of a user process. > > */ > > static long user_events_ioctl_reg(struct user_event_file_info *info, > > - unsigned long uarg) > > + struct file *file, unsigned long uarg) > > { > > struct user_reg __user *ureg = (struct user_reg __user *)uarg; > > struct user_reg reg; > > struct user_event *user; > > + struct user_event_enabler *enabler; > > char *name; > > long ret; > > > > @@ -1607,8 +1648,12 @@ static long user_events_ioctl_reg(struct user_event_file_info *info, > > if (ret < 0) > > return ret; > > > > + enabler = user_event_enabler_create(file, ®, user); > > + > > + if (!enabler) > > Shouldn't we free the user_event if needed here? > (I found the similar memory leak pattern in the above failure case > of the user_events_ref_add().) > user_events are shared across the entire group. They cannot be cleaned up until all references are gone. This is true both in this case and the in the user_events_ref_add() case. The pattern is to register events in the group's hashtable, then add them to the local file ref array that is RCU protected. If the file ref cannot be allocated, etc. the refcount on user is decremented. If we cannot create an enabler, the refcount is still held until file release. If the event has already been added to the local file ref array, it is returned to prevent another reference. > > + return -ENOMEM; > > + > > put_user((u32)ret, &ureg->write_index); > > - put_user(user->index, &ureg->status_bit); > > > > return 0; > > } > [...] > > @@ -1849,7 +1863,6 @@ static int user_status_open(struct inode *node, struct file *file) > > > > static const struct file_operations user_status_fops = { > > .open = user_status_open, > > - .mmap = user_status_mmap, > > So, if this drops the mmap operation, can we drop the writable flag from > the status tracefs file? > Good catch, yes I'll remove this. > static int create_user_tracefs(void) > { > [...] > /* mmap with MAP_SHARED requires writable fd */ > emmap = tracefs_create_file("user_events_status", TRACE_MODE_WRITE, > NULL, NULL, &user_status_fops); > > Thank you, > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org> Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement 2022-10-31 16:46 ` Beau Belgrave @ 2022-10-31 23:55 ` Masami Hiramatsu 2022-11-01 16:45 ` Beau Belgrave 0 siblings, 1 reply; 31+ messages in thread From: Masami Hiramatsu @ 2022-10-31 23:55 UTC (permalink / raw) To: Beau Belgrave Cc: rostedt, mathieu.desnoyers, dcook, alanau, linux-trace-devel, linux-kernel On Mon, 31 Oct 2022 09:46:03 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > On Mon, Oct 31, 2022 at 11:47:03PM +0900, Masami Hiramatsu wrote: > > Hi, > > > > I have some comments. > > > > On Thu, 27 Oct 2022 15:40:10 -0700 > > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > > [...] > > > @@ -1570,11 +1610,12 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) > > > * Registers a user_event on behalf of a user process. > > > */ > > > static long user_events_ioctl_reg(struct user_event_file_info *info, > > > - unsigned long uarg) > > > + struct file *file, unsigned long uarg) > > > { > > > struct user_reg __user *ureg = (struct user_reg __user *)uarg; > > > struct user_reg reg; > > > struct user_event *user; > > > + struct user_event_enabler *enabler; > > > char *name; > > > long ret; > > > > > > @@ -1607,8 +1648,12 @@ static long user_events_ioctl_reg(struct user_event_file_info *info, > > > if (ret < 0) > > > return ret; > > > > > > + enabler = user_event_enabler_create(file, ®, user); > > > + > > > + if (!enabler) > > > > Shouldn't we free the user_event if needed here? > > (I found the similar memory leak pattern in the above failure case > > of the user_events_ref_add().) > > > > user_events are shared across the entire group. They cannot be cleaned > up until all references are gone. This is true both in this case and the > in the user_events_ref_add() case. > > The pattern is to register events in the group's hashtable, then add > them to the local file ref array that is RCU protected. If the file ref > cannot be allocated, etc. the refcount on user is decremented. If we > cannot create an enabler, the refcount is still held until file release. OK, when the ioctl returns, there should be 3 cases; - Return success, a new(existing) user_event added. - Return error, no new user_event added. - Return error, a new user_event added but enabler is not initialized. And in the last case, the new user_event will be released when user closes the file. Could you comment it here? > > If the event has already been added to the local file ref array, it is > returned to prevent another reference. I'm not sure this point. Did you mean returning an error to prevent registering the same event again? > > > > + return -ENOMEM; > > > + > > > put_user((u32)ret, &ureg->write_index); > > > - put_user(user->index, &ureg->status_bit); > > > > > > return 0; > > > } > > [...] > > > @@ -1849,7 +1863,6 @@ static int user_status_open(struct inode *node, struct file *file) > > > > > > static const struct file_operations user_status_fops = { > > > .open = user_status_open, > > > - .mmap = user_status_mmap, > > > > So, if this drops the mmap operation, can we drop the writable flag from > > the status tracefs file? > > > > Good catch, yes I'll remove this. Thanks! > > > static int create_user_tracefs(void) > > { > > [...] > > /* mmap with MAP_SHARED requires writable fd */ > > emmap = tracefs_create_file("user_events_status", TRACE_MODE_WRITE, > > NULL, NULL, &user_status_fops); > > > > Thank you, > > > > -- > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > Thanks, > -Beau -- Masami Hiramatsu (Google) <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement 2022-10-31 23:55 ` Masami Hiramatsu @ 2022-11-01 16:45 ` Beau Belgrave 0 siblings, 0 replies; 31+ messages in thread From: Beau Belgrave @ 2022-11-01 16:45 UTC (permalink / raw) To: Masami Hiramatsu Cc: rostedt, mathieu.desnoyers, dcook, alanau, linux-trace-devel, linux-kernel On Tue, Nov 01, 2022 at 08:55:01AM +0900, Masami Hiramatsu wrote: > On Mon, 31 Oct 2022 09:46:03 -0700 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > On Mon, Oct 31, 2022 at 11:47:03PM +0900, Masami Hiramatsu wrote: > > > Hi, > > > > > > I have some comments. > > > > > > On Thu, 27 Oct 2022 15:40:10 -0700 > > > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > > > > [...] > > > > @@ -1570,11 +1610,12 @@ static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg) > > > > * Registers a user_event on behalf of a user process. > > > > */ > > > > static long user_events_ioctl_reg(struct user_event_file_info *info, > > > > - unsigned long uarg) > > > > + struct file *file, unsigned long uarg) > > > > { > > > > struct user_reg __user *ureg = (struct user_reg __user *)uarg; > > > > struct user_reg reg; > > > > struct user_event *user; > > > > + struct user_event_enabler *enabler; > > > > char *name; > > > > long ret; > > > > > > > > @@ -1607,8 +1648,12 @@ static long user_events_ioctl_reg(struct user_event_file_info *info, > > > > if (ret < 0) > > > > return ret; > > > > > > > > + enabler = user_event_enabler_create(file, ®, user); > > > > + > > > > + if (!enabler) > > > > > > Shouldn't we free the user_event if needed here? > > > (I found the similar memory leak pattern in the above failure case > > > of the user_events_ref_add().) > > > > > > > user_events are shared across the entire group. They cannot be cleaned > > up until all references are gone. This is true both in this case and the > > in the user_events_ref_add() case. > > > > The pattern is to register events in the group's hashtable, then add > > them to the local file ref array that is RCU protected. If the file ref > > cannot be allocated, etc. the refcount on user is decremented. If we > > cannot create an enabler, the refcount is still held until file release. > > OK, when the ioctl returns, there should be 3 cases; > > - Return success, a new(existing) user_event added. > - Return error, no new user_event added. > - Return error, a new user_event added but enabler is not initialized. > > And in the last case, the new user_event will be released when user > closes the file. Could you comment it here? > Sure thing, I'll add it. > > > > If the event has already been added to the local file ref array, it is > > returned to prevent another reference. > > I'm not sure this point. Did you mean returning an error to prevent > registering the same event again? > If a process uses the same fd and registers the same event multiple times, then only 1 index is returned to the caller. If someone either purposely or accidentally does this, the appropriate index will be returned and we won't just keep allocating user_event objects. However, in a threaded application, there may be situations where thread A registers event E, and thread B registers event E as well. However, they may pass different enable address locations. In this case you will get 2 enablers for the single event. Right now the code does this all the time, it does not do it only if the enable address differs. I'm not sure how common this scenario is, but I think I should likely check if the address is different or not before creating another enabler in this case. Thoughts? > > > > > > > + return -ENOMEM; > > > > + > > > > put_user((u32)ret, &ureg->write_index); > > > > - put_user(user->index, &ureg->status_bit); > > > > > > > > return 0; > > > > } > > > [...] > > > > @@ -1849,7 +1863,6 @@ static int user_status_open(struct inode *node, struct file *file) > > > > > > > > static const struct file_operations user_status_fops = { > > > > .open = user_status_open, > > > > - .mmap = user_status_mmap, > > > > > > So, if this drops the mmap operation, can we drop the writable flag from > > > the status tracefs file? > > > > > > > Good catch, yes I'll remove this. > > Thanks! > > > > > > static int create_user_tracefs(void) > > > { > > > [...] > > > /* mmap with MAP_SHARED requires writable fd */ > > > emmap = tracefs_create_file("user_events_status", TRACE_MODE_WRITE, > > > NULL, NULL, &user_status_fops); > > > > > > Thank you, > > > > > > -- > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > Thanks, > > -Beau > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org> Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-27 22:40 [RFC PATCH 0/2] tracing/user_events: Remote write ABI Beau Belgrave 2022-10-27 22:40 ` [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement Beau Belgrave @ 2022-10-27 22:40 ` Beau Belgrave 2022-10-28 22:07 ` Mathieu Desnoyers 2022-10-28 22:19 ` Mathieu Desnoyers 2022-10-28 21:50 ` [RFC PATCH 0/2] tracing/user_events: Remote write ABI Mathieu Desnoyers 2022-10-31 14:15 ` Masami Hiramatsu 3 siblings, 2 replies; 31+ messages in thread From: Beau Belgrave @ 2022-10-27 22:40 UTC (permalink / raw) To: rostedt, mhiramat, mathieu.desnoyers, dcook, alanau Cc: linux-trace-devel, linux-kernel When events are enabled within the various tracing facilities, such as ftrace/perf, the event_mutex is held. As events are enabled pages are accessed. We do not want page faults to occur under this lock. Instead queue the fault to a workqueue to be handled in a process context safe way without the lock. The enable address is disabled while the async fault-in occurs. This ensures that we don't attempt fault-in more than is necessary. Once the page has been faulted in, the address write is attempted again. If the page couldn't fault-in, then we wait until the next time the event is enabled to prevent any potential infinite loops. Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> --- kernel/trace/trace_events_user.c | 125 ++++++++++++++++++++++++++++++- 1 file changed, 121 insertions(+), 4 deletions(-) diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c index 633f24c2a1ac..f1eb8101e053 100644 --- a/kernel/trace/trace_events_user.c +++ b/kernel/trace/trace_events_user.c @@ -81,11 +81,22 @@ struct user_event_enabler { struct list_head link; struct mm_struct *mm; struct file *file; + refcount_t refcnt; unsigned long enable_addr; unsigned int enable_bit: 5, - __reserved: 27; + __reserved: 26, + disabled: 1; }; +/* Used for asynchronous faulting in of pages */ +struct user_event_enabler_fault { + struct work_struct work; + struct user_event_enabler *enabler; + struct user_event *event; +}; + +static struct kmem_cache *fault_cache; + /* * Stores per-event properties, as users register events * within a file a user_event might be created if it does not @@ -236,6 +247,19 @@ static void user_event_enabler_destroy(struct user_event_enabler *enabler) kfree(enabler); } +static __always_inline struct user_event_enabler +*user_event_enabler_get(struct user_event_enabler *enabler) +{ + refcount_inc(&enabler->refcnt); + return enabler; +} + +static void user_event_enabler_put(struct user_event_enabler *enabler) +{ + if (refcount_dec_and_test(&enabler->refcnt)) + user_event_enabler_destroy(enabler); +} + static void user_event_enabler_remove(struct file *file, struct user_event *user) { @@ -249,13 +273,93 @@ static void user_event_enabler_remove(struct file *file, if (enabler->file != file) continue; + enabler->disabled = 0; list_del(&enabler->link); - user_event_enabler_destroy(enabler); + user_event_enabler_put(enabler); } mutex_unlock(&event_mutex); } +static void user_event_enabler_write(struct user_event_enabler *enabler, + struct user_event *user); + +static void user_event_enabler_fault_fixup(struct work_struct *work) +{ + struct user_event_enabler_fault *fault = container_of( + work, struct user_event_enabler_fault, work); + struct user_event_enabler *enabler = fault->enabler; + struct user_event *user = fault->event; + struct mm_struct *mm = enabler->mm; + unsigned long uaddr = enabler->enable_addr; + bool unlocked = false; + int ret; + + might_sleep(); + + mmap_read_lock(mm); + + ret = fixup_user_fault(mm, uaddr, FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE, + &unlocked); + + mmap_read_unlock(mm); + + if (ret) + pr_warn("user_events: Fixup fault failed with %d " + "for mm: 0x%pK offset: 0x%llx event: %s\n", ret, mm, + (unsigned long long)uaddr, EVENT_NAME(user)); + + /* Prevent state changes from racing */ + mutex_lock(&event_mutex); + + /* + * If we managed to get the page, re-issue the write. We do not + * want to get into a possible infinite loop, which is why we only + * attempt again directly if the page came in. If we couldn't get + * the page here, then we will try again the next time the event is + * enabled/disabled. + */ + enabler->disabled = 0; + + if (!ret) + user_event_enabler_write(enabler, user); + + mutex_unlock(&event_mutex); + + refcount_dec(&user->refcnt); + user_event_enabler_put(enabler); + kmem_cache_free(fault_cache, fault); +} + +static bool user_event_enabler_queue_fault(struct user_event_enabler *enabler, + struct user_event *user) +{ + struct user_event_enabler_fault *fault; + + fault = kmem_cache_zalloc(fault_cache, GFP_NOWAIT | __GFP_NOWARN); + + if (!fault) + return false; + + INIT_WORK(&fault->work, user_event_enabler_fault_fixup); + fault->enabler = user_event_enabler_get(enabler); + fault->event = user; + + refcount_inc(&user->refcnt); + enabler->disabled = 1; + + if (!schedule_work(&fault->work)) { + enabler->disabled = 0; + refcount_dec(&user->refcnt); + user_event_enabler_put(enabler); + kmem_cache_free(fault_cache, fault); + + return false; + } + + return true; +} + static void user_event_enabler_write(struct user_event_enabler *enabler, struct user_event *user) { @@ -266,6 +370,11 @@ static void user_event_enabler_write(struct user_event_enabler *enabler, void *kaddr; int ret; + lockdep_assert_held(&event_mutex); + + if (unlikely(enabler->disabled)) + return; + mmap_read_lock(mm); ret = pin_user_pages_remote(mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, @@ -273,8 +382,10 @@ static void user_event_enabler_write(struct user_event_enabler *enabler, mmap_read_unlock(mm); - if (ret <= 0) { - pr_warn("user_events: Enable write failed\n"); + if (unlikely(ret <= 0)) { + if (!user_event_enabler_queue_fault(enabler, user)) + pr_warn("user_events: Unable to queue fault handler\n"); + return; } @@ -321,6 +432,7 @@ static struct user_event_enabler enabler->file = file; enabler->enable_addr = (unsigned long)reg->enable_addr; enabler->enable_bit = reg->enable_bit; + refcount_set(&enabler->refcnt, 1); /* Prevents state changes from racing with new enablers */ mutex_lock(&event_mutex); @@ -1902,6 +2014,11 @@ static int __init trace_events_user_init(void) { int ret; + fault_cache = KMEM_CACHE(user_event_enabler_fault, 0); + + if (!fault_cache) + return -ENOMEM; + init_group = user_event_group_create(&init_user_ns); if (!init_group) -- 2.25.1 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-27 22:40 ` [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly Beau Belgrave @ 2022-10-28 22:07 ` Mathieu Desnoyers 2022-10-28 22:35 ` Beau Belgrave 2022-10-28 22:19 ` Mathieu Desnoyers 1 sibling, 1 reply; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-28 22:07 UTC (permalink / raw) To: Beau Belgrave, rostedt, mhiramat, dcook, alanau Cc: linux-trace-devel, linux-kernel On 2022-10-27 18:40, Beau Belgrave wrote: > When events are enabled within the various tracing facilities, such as > ftrace/perf, the event_mutex is held. As events are enabled pages are > accessed. We do not want page faults to occur under this lock. Instead > queue the fault to a workqueue to be handled in a process context safe > way without the lock. Good stuff! On my end, I've progressed on the "side" userspace instrumentation library prototype. It implements the static instrumentation layer to which a simple userspace printf-based tracer connects (for testing purposes). All moving parts are wired up now, including the type system and RCU to protect callback iteration against concurrent userspace tracer registration/unregistration. The top bit of the "enable" words are reserved for user events. So you can see the "TODO" where the user events ioctl/writev would be expected: https://github.com/compudj/side I'm still slightly unsure about using "uint32_t" for enable check, or going for "unsigned long". The core question there is whether a 32-bit word test would cause partial register stalls on 64-bit architectures. Going for unsigned long would require that user events receives information about the bitness of the word as input from userspace. (bit=63 rather than 31). Perhaps this is something the user events ABI should accommodate by reserving more than 5 bits to express the target bit ? > > The enable address is disabled while the async fault-in occurs. This > ensures that we don't attempt fault-in more than is necessary. Once the > page has been faulted in, the address write is attempted again. If the > page couldn't fault-in, then we wait until the next time the event is > enabled to prevent any potential infinite loops. So if the page is removed from the page cache between the point where it is faulted in and the moment the write is re-attempted, that will not trigger another attempt at paging in the page, am I correct ? I would think this is unexpected. I would expect that failing to fault in the page would stop any further attempts, but simply failing to pin the page after faulting it in should simply try again. Thoughts ? Thanks, Mathieu > > Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> > --- > kernel/trace/trace_events_user.c | 125 ++++++++++++++++++++++++++++++- > 1 file changed, 121 insertions(+), 4 deletions(-) > > diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c > index 633f24c2a1ac..f1eb8101e053 100644 > --- a/kernel/trace/trace_events_user.c > +++ b/kernel/trace/trace_events_user.c > @@ -81,11 +81,22 @@ struct user_event_enabler { > struct list_head link; > struct mm_struct *mm; > struct file *file; > + refcount_t refcnt; > unsigned long enable_addr; > unsigned int enable_bit: 5, > - __reserved: 27; > + __reserved: 26, > + disabled: 1; > }; > > +/* Used for asynchronous faulting in of pages */ > +struct user_event_enabler_fault { > + struct work_struct work; > + struct user_event_enabler *enabler; > + struct user_event *event; > +}; > + > +static struct kmem_cache *fault_cache; > + > /* > * Stores per-event properties, as users register events > * within a file a user_event might be created if it does not > @@ -236,6 +247,19 @@ static void user_event_enabler_destroy(struct user_event_enabler *enabler) > kfree(enabler); > } > > +static __always_inline struct user_event_enabler > +*user_event_enabler_get(struct user_event_enabler *enabler) > +{ > + refcount_inc(&enabler->refcnt); > + return enabler; > +} > + > +static void user_event_enabler_put(struct user_event_enabler *enabler) > +{ > + if (refcount_dec_and_test(&enabler->refcnt)) > + user_event_enabler_destroy(enabler); > +} > + > static void user_event_enabler_remove(struct file *file, > struct user_event *user) > { > @@ -249,13 +273,93 @@ static void user_event_enabler_remove(struct file *file, > if (enabler->file != file) > continue; > > + enabler->disabled = 0; > list_del(&enabler->link); > - user_event_enabler_destroy(enabler); > + user_event_enabler_put(enabler); > } > > mutex_unlock(&event_mutex); > } > > +static void user_event_enabler_write(struct user_event_enabler *enabler, > + struct user_event *user); > + > +static void user_event_enabler_fault_fixup(struct work_struct *work) > +{ > + struct user_event_enabler_fault *fault = container_of( > + work, struct user_event_enabler_fault, work); > + struct user_event_enabler *enabler = fault->enabler; > + struct user_event *user = fault->event; > + struct mm_struct *mm = enabler->mm; > + unsigned long uaddr = enabler->enable_addr; > + bool unlocked = false; > + int ret; > + > + might_sleep(); > + > + mmap_read_lock(mm); > + > + ret = fixup_user_fault(mm, uaddr, FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE, > + &unlocked); > + > + mmap_read_unlock(mm); > + > + if (ret) > + pr_warn("user_events: Fixup fault failed with %d " > + "for mm: 0x%pK offset: 0x%llx event: %s\n", ret, mm, > + (unsigned long long)uaddr, EVENT_NAME(user)); > + > + /* Prevent state changes from racing */ > + mutex_lock(&event_mutex); > + > + /* > + * If we managed to get the page, re-issue the write. We do not > + * want to get into a possible infinite loop, which is why we only > + * attempt again directly if the page came in. If we couldn't get > + * the page here, then we will try again the next time the event is > + * enabled/disabled. > + */ > + enabler->disabled = 0; > + > + if (!ret) > + user_event_enabler_write(enabler, user); > + > + mutex_unlock(&event_mutex); > + > + refcount_dec(&user->refcnt); > + user_event_enabler_put(enabler); > + kmem_cache_free(fault_cache, fault); > +} > + > +static bool user_event_enabler_queue_fault(struct user_event_enabler *enabler, > + struct user_event *user) > +{ > + struct user_event_enabler_fault *fault; > + > + fault = kmem_cache_zalloc(fault_cache, GFP_NOWAIT | __GFP_NOWARN); > + > + if (!fault) > + return false; > + > + INIT_WORK(&fault->work, user_event_enabler_fault_fixup); > + fault->enabler = user_event_enabler_get(enabler); > + fault->event = user; > + > + refcount_inc(&user->refcnt); > + enabler->disabled = 1; > + > + if (!schedule_work(&fault->work)) { > + enabler->disabled = 0; > + refcount_dec(&user->refcnt); > + user_event_enabler_put(enabler); > + kmem_cache_free(fault_cache, fault); > + > + return false; > + } > + > + return true; > +} > + > static void user_event_enabler_write(struct user_event_enabler *enabler, > struct user_event *user) > { > @@ -266,6 +370,11 @@ static void user_event_enabler_write(struct user_event_enabler *enabler, > void *kaddr; > int ret; > > + lockdep_assert_held(&event_mutex); > + > + if (unlikely(enabler->disabled)) > + return; > + > mmap_read_lock(mm); > > ret = pin_user_pages_remote(mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, > @@ -273,8 +382,10 @@ static void user_event_enabler_write(struct user_event_enabler *enabler, > > mmap_read_unlock(mm); > > - if (ret <= 0) { > - pr_warn("user_events: Enable write failed\n"); > + if (unlikely(ret <= 0)) { > + if (!user_event_enabler_queue_fault(enabler, user)) > + pr_warn("user_events: Unable to queue fault handler\n"); > + > return; > } > > @@ -321,6 +432,7 @@ static struct user_event_enabler > enabler->file = file; > enabler->enable_addr = (unsigned long)reg->enable_addr; > enabler->enable_bit = reg->enable_bit; > + refcount_set(&enabler->refcnt, 1); > > /* Prevents state changes from racing with new enablers */ > mutex_lock(&event_mutex); > @@ -1902,6 +2014,11 @@ static int __init trace_events_user_init(void) > { > int ret; > > + fault_cache = KMEM_CACHE(user_event_enabler_fault, 0); > + > + if (!fault_cache) > + return -ENOMEM; > + > init_group = user_event_group_create(&init_user_ns); > > if (!init_group) -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-28 22:07 ` Mathieu Desnoyers @ 2022-10-28 22:35 ` Beau Belgrave 2022-10-29 14:23 ` Mathieu Desnoyers 0 siblings, 1 reply; 31+ messages in thread From: Beau Belgrave @ 2022-10-28 22:35 UTC (permalink / raw) To: Mathieu Desnoyers Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On Fri, Oct 28, 2022 at 06:07:34PM -0400, Mathieu Desnoyers wrote: > On 2022-10-27 18:40, Beau Belgrave wrote: > > When events are enabled within the various tracing facilities, such as > > ftrace/perf, the event_mutex is held. As events are enabled pages are > > accessed. We do not want page faults to occur under this lock. Instead > > queue the fault to a workqueue to be handled in a process context safe > > way without the lock. > > Good stuff! On my end, I've progressed on the "side" userspace > instrumentation library prototype. It implements the static instrumentation > layer to which a simple userspace printf-based tracer connects (for testing > purposes). All moving parts are wired up now, including the type system and > RCU to protect callback iteration against concurrent userspace tracer > registration/unregistration. > Cool! > The top bit of the "enable" words are reserved for user events. So you can > see the "TODO" where the user events ioctl/writev would be expected: > > https://github.com/compudj/side > I'll check this out! > I'm still slightly unsure about using "uint32_t" for enable check, or going > for "unsigned long". The core question there is whether a 32-bit word test > would cause partial register stalls on 64-bit architectures. Going for > unsigned long would require that user events receives information about the > bitness of the word as input from userspace. (bit=63 rather than 31). > Perhaps this is something the user events ABI should accommodate by > reserving more than 5 bits to express the target bit ? > Yeah, I thought about this. From the user side you can actually use any arbitrary bits you want by passing a 32-bit aligned value. So you could take the lower or top half of a 64-bit value. The reason I limit to 0-31 bits is to ensure no page straddling occurs. Futex also does this, check out get_futex_key() in kernel/futex/core.c. I would recommend trying out uint64 but give the upper half to user_events ABI and checking if that works for you if you want say 63 bits to user space. You'd tell the ABI bit 31, but in user space it would be bit 63. > > > > The enable address is disabled while the async fault-in occurs. This > > ensures that we don't attempt fault-in more than is necessary. Once the > > page has been faulted in, the address write is attempted again. If the > > page couldn't fault-in, then we wait until the next time the event is > > enabled to prevent any potential infinite loops. > > So if the page is removed from the page cache between the point where it is > faulted in and the moment the write is re-attempted, that will not trigger > another attempt at paging in the page, am I correct ? > If we fault and the fixup user fault fails still with retries, then we give up until the next enablement of that site. However, if we fault and the fixup user fault with retries works, and then we fault again trying to write, that will retry. > I would think this is unexpected. I would expect that failing to fault in > the page would stop any further attempts, but simply failing to pin the page > after faulting it in should simply try again. > The issue I repro is mmap() register the enable at that location, then unmap(). All the faulting errors just tell you -EFAULT in this case, even though it was maliciously removed. In this case you could get the kernel to infinite loop trying to page it in. > Thoughts ? > We need to have some sanity toward giving up faulting in data that never will make it. The code currently assigns that line to if fixup_user_fault with retries fails, we give up. pin_user_pages_remote will not stop it from being attempted again. > Thanks, > > Mathieu > Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-28 22:35 ` Beau Belgrave @ 2022-10-29 14:23 ` Mathieu Desnoyers 2022-10-31 16:58 ` Beau Belgrave 0 siblings, 1 reply; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-29 14:23 UTC (permalink / raw) To: Beau Belgrave Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On 2022-10-28 18:35, Beau Belgrave wrote: > On Fri, Oct 28, 2022 at 06:07:34PM -0400, Mathieu Desnoyers wrote: >> On 2022-10-27 18:40, Beau Belgrave wrote: [...] > >> I'm still slightly unsure about using "uint32_t" for enable check, or going >> for "unsigned long". The core question there is whether a 32-bit word test >> would cause partial register stalls on 64-bit architectures. Going for >> unsigned long would require that user events receives information about the >> bitness of the word as input from userspace. (bit=63 rather than 31). >> Perhaps this is something the user events ABI should accommodate by >> reserving more than 5 bits to express the target bit ? >> > > Yeah, I thought about this. From the user side you can actually use any > arbitrary bits you want by passing a 32-bit aligned value. So you could > take the lower or top half of a 64-bit value. The reason I limit to 0-31 > bits is to ensure no page straddling occurs. Futex also does this, check > out get_futex_key() in kernel/futex/core.c. We can ensure no page straddling by checking the address alignment compared to the size of the integer (4 or 8 bytes) also received as input. > > I would recommend trying out uint64 but give the upper half to > user_events ABI and checking if that works for you if you want say 63 > bits to user space. You'd tell the ABI bit 31, but in user space it > would be bit 63. That's tricky because Linux supports both big and little endian, which will make the ABI tricky to use if we try to be too clever about this. Also, just stating a "pointer" and "bits offset" is not enough if we want to support 32-bit and 64-bit integer types, because the bit offset does not consider endianness. I would recommend to get the following information through the user events registration: - pointer address, - size of integer type (4 or 8 bytes), - bit offset (currently 31 or 63). This way we have all the information we need to set the right bit in both little and big endian. We also have all the information we need to validate natural alignment, e.g.: /* Check alignment */ if (addr & (type_size - 1)) return -EINVAL; /* Check bit offset range */ if (bit_offset > (type_size * CHAR_BIT) - 1) return -EINVAL; switch (type_size) { case 4: /* Update 32-bit integer */ break; #if BITS_PER_LONG >= 64 case 8: /* Update 64-bit integer */ break; #endif default: return -EINVAL; } > >>> >>> The enable address is disabled while the async fault-in occurs. This >>> ensures that we don't attempt fault-in more than is necessary. Once the >>> page has been faulted in, the address write is attempted again. If the >>> page couldn't fault-in, then we wait until the next time the event is >>> enabled to prevent any potential infinite loops. >> >> So if the page is removed from the page cache between the point where it is >> faulted in and the moment the write is re-attempted, that will not trigger >> another attempt at paging in the page, am I correct ? >> > > If we fault and the fixup user fault fails still with retries, then we > give up until the next enablement of that site. > > However, if we fault and the fixup user fault with retries works, and > then we fault again trying to write, that will retry. > >> I would think this is unexpected. I would expect that failing to fault in >> the page would stop any further attempts, but simply failing to pin the page >> after faulting it in should simply try again. >> > > The issue I repro is mmap() register the enable at that location, then > unmap(). All the faulting errors just tell you -EFAULT in this case, > even though it was maliciously removed. In this case you could get the > kernel to infinite loop trying to page it in. > >> Thoughts ? >> > > We need to have some sanity toward giving up faulting in data that never > will make it. The code currently assigns that line to if > fixup_user_fault with retries fails, we give up. pin_user_pages_remote > will not stop it from being attempted again. What are the legitimate cases that can make fixup_user_fault fail ? If there are none, and the only case that can make this fail is userspace unmapping memory, then we should very well kill the offending process. An example here is futex fault_in_user_writeable(). When it fails, it appears that the futex_wake_op simply returns -EFAULT to the caller. Thanks, Mathieu > >> Thanks, >> >> Mathieu >> > > Thanks, > -Beau -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-29 14:23 ` Mathieu Desnoyers @ 2022-10-31 16:58 ` Beau Belgrave 0 siblings, 0 replies; 31+ messages in thread From: Beau Belgrave @ 2022-10-31 16:58 UTC (permalink / raw) To: Mathieu Desnoyers Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On Sat, Oct 29, 2022 at 10:23:02AM -0400, Mathieu Desnoyers wrote: > On 2022-10-28 18:35, Beau Belgrave wrote: > > On Fri, Oct 28, 2022 at 06:07:34PM -0400, Mathieu Desnoyers wrote: > > > On 2022-10-27 18:40, Beau Belgrave wrote: > > [...] > > > > > > I'm still slightly unsure about using "uint32_t" for enable check, or going > > > for "unsigned long". The core question there is whether a 32-bit word test > > > would cause partial register stalls on 64-bit architectures. Going for > > > unsigned long would require that user events receives information about the > > > bitness of the word as input from userspace. (bit=63 rather than 31). > > > Perhaps this is something the user events ABI should accommodate by > > > reserving more than 5 bits to express the target bit ? > > > > > > > Yeah, I thought about this. From the user side you can actually use any > > arbitrary bits you want by passing a 32-bit aligned value. So you could > > take the lower or top half of a 64-bit value. The reason I limit to 0-31 > > bits is to ensure no page straddling occurs. Futex also does this, check > > out get_futex_key() in kernel/futex/core.c. > > We can ensure no page straddling by checking the address alignment compared > to the size of the integer (4 or 8 bytes) also received as input. > Ok. > > > > I would recommend trying out uint64 but give the upper half to > > user_events ABI and checking if that works for you if you want say 63 > > bits to user space. You'd tell the ABI bit 31, but in user space it > > would be bit 63. > > That's tricky because Linux supports both big and little endian, which will > make the ABI tricky to use if we try to be too clever about this. > > Also, just stating a "pointer" and "bits offset" is not enough if we want to > support 32-bit and 64-bit integer types, because the bit offset does not > consider endianness. > > I would recommend to get the following information through the user events > registration: > > - pointer address, > - size of integer type (4 or 8 bytes), > - bit offset (currently 31 or 63). > > This way we have all the information we need to set the right bit in both > little and big endian. We also have all the information we need to validate > natural alignment, e.g.: > > /* Check alignment */ > if (addr & (type_size - 1)) > return -EINVAL; > /* Check bit offset range */ > if (bit_offset > (type_size * CHAR_BIT) - 1) > return -EINVAL; > switch (type_size) { > case 4: > /* Update 32-bit integer */ > break; > #if BITS_PER_LONG >= 64 > case 8: > /* Update 64-bit integer */ > break; > #endif > default: > return -EINVAL; > } > Sounds good, I'll update this. > > > > > > > > > > > The enable address is disabled while the async fault-in occurs. This > > > > ensures that we don't attempt fault-in more than is necessary. Once the > > > > page has been faulted in, the address write is attempted again. If the > > > > page couldn't fault-in, then we wait until the next time the event is > > > > enabled to prevent any potential infinite loops. > > > > > > So if the page is removed from the page cache between the point where it is > > > faulted in and the moment the write is re-attempted, that will not trigger > > > another attempt at paging in the page, am I correct ? > > > > > > > If we fault and the fixup user fault fails still with retries, then we > > give up until the next enablement of that site. > > > > However, if we fault and the fixup user fault with retries works, and > > then we fault again trying to write, that will retry. > > > > > I would think this is unexpected. I would expect that failing to fault in > > > the page would stop any further attempts, but simply failing to pin the page > > > after faulting it in should simply try again. > > > > > > > The issue I repro is mmap() register the enable at that location, then > > unmap(). All the faulting errors just tell you -EFAULT in this case, > > even though it was maliciously removed. In this case you could get the > > kernel to infinite loop trying to page it in. > > > > > Thoughts ? > > > > > > > We need to have some sanity toward giving up faulting in data that never > > will make it. The code currently assigns that line to if > > fixup_user_fault with retries fails, we give up. pin_user_pages_remote > > will not stop it from being attempted again. > > What are the legitimate cases that can make fixup_user_fault fail ? If there > are none, and the only case that can make this fail is userspace unmapping > memory, then we should very well kill the offending process. > I believe only VM_FAULT_ERROR: #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | \ VM_FAULT_SIGSEGV | VM_FAULT_HWPOISON | \ VM_FAULT_HWPOISON_LARGE | VM_FAULT_FALLBACK) I don't believe in any of the above cases we should retry indefinitely for. > An example here is futex fault_in_user_writeable(). When it fails, it > appears that the futex_wake_op simply returns -EFAULT to the caller. > fault_in_user_writeable() calls fixup_user_fault(), so it should be the same scenario / failures as above. > Thanks, > > Mathieu > > > > > > Thanks, > > > > > > Mathieu > > > > > > > Thanks, > > -Beau > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-27 22:40 ` [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly Beau Belgrave 2022-10-28 22:07 ` Mathieu Desnoyers @ 2022-10-28 22:19 ` Mathieu Desnoyers 2022-10-28 22:42 ` Beau Belgrave 1 sibling, 1 reply; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-28 22:19 UTC (permalink / raw) To: Beau Belgrave, rostedt, mhiramat, dcook, alanau Cc: linux-trace-devel, linux-kernel On 2022-10-27 18:40, Beau Belgrave wrote: > When events are enabled within the various tracing facilities, such as > ftrace/perf, the event_mutex is held. As events are enabled pages are > accessed. We do not want page faults to occur under this lock. Instead > queue the fault to a workqueue to be handled in a process context safe > way without the lock. > > The enable address is disabled while the async fault-in occurs. This > ensures that we don't attempt fault-in more than is necessary. Once the > page has been faulted in, the address write is attempted again. If the > page couldn't fault-in, then we wait until the next time the event is > enabled to prevent any potential infinite loops. I'm also unclear about how the system call initiating the enabled state change is delayed (or not) when a page fault is queued. I would expect that when a page fault is needed, after enqueuing work to the worker thread, the system call initiating the state change would somehow wait for a completion (after releasing the user events mutex). That completion would be signaled by the worker thread either if the page fault fails, or if the state change is done. Thoughts ? Thanks, Mathieu > > Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> > --- > kernel/trace/trace_events_user.c | 125 ++++++++++++++++++++++++++++++- > 1 file changed, 121 insertions(+), 4 deletions(-) > > diff --git a/kernel/trace/trace_events_user.c b/kernel/trace/trace_events_user.c > index 633f24c2a1ac..f1eb8101e053 100644 > --- a/kernel/trace/trace_events_user.c > +++ b/kernel/trace/trace_events_user.c > @@ -81,11 +81,22 @@ struct user_event_enabler { > struct list_head link; > struct mm_struct *mm; > struct file *file; > + refcount_t refcnt; > unsigned long enable_addr; > unsigned int enable_bit: 5, > - __reserved: 27; > + __reserved: 26, > + disabled: 1; > }; > > +/* Used for asynchronous faulting in of pages */ > +struct user_event_enabler_fault { > + struct work_struct work; > + struct user_event_enabler *enabler; > + struct user_event *event; > +}; > + > +static struct kmem_cache *fault_cache; > + > /* > * Stores per-event properties, as users register events > * within a file a user_event might be created if it does not > @@ -236,6 +247,19 @@ static void user_event_enabler_destroy(struct user_event_enabler *enabler) > kfree(enabler); > } > > +static __always_inline struct user_event_enabler > +*user_event_enabler_get(struct user_event_enabler *enabler) > +{ > + refcount_inc(&enabler->refcnt); > + return enabler; > +} > + > +static void user_event_enabler_put(struct user_event_enabler *enabler) > +{ > + if (refcount_dec_and_test(&enabler->refcnt)) > + user_event_enabler_destroy(enabler); > +} > + > static void user_event_enabler_remove(struct file *file, > struct user_event *user) > { > @@ -249,13 +273,93 @@ static void user_event_enabler_remove(struct file *file, > if (enabler->file != file) > continue; > > + enabler->disabled = 0; > list_del(&enabler->link); > - user_event_enabler_destroy(enabler); > + user_event_enabler_put(enabler); > } > > mutex_unlock(&event_mutex); > } > > +static void user_event_enabler_write(struct user_event_enabler *enabler, > + struct user_event *user); > + > +static void user_event_enabler_fault_fixup(struct work_struct *work) > +{ > + struct user_event_enabler_fault *fault = container_of( > + work, struct user_event_enabler_fault, work); > + struct user_event_enabler *enabler = fault->enabler; > + struct user_event *user = fault->event; > + struct mm_struct *mm = enabler->mm; > + unsigned long uaddr = enabler->enable_addr; > + bool unlocked = false; > + int ret; > + > + might_sleep(); > + > + mmap_read_lock(mm); > + > + ret = fixup_user_fault(mm, uaddr, FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE, > + &unlocked); > + > + mmap_read_unlock(mm); > + > + if (ret) > + pr_warn("user_events: Fixup fault failed with %d " > + "for mm: 0x%pK offset: 0x%llx event: %s\n", ret, mm, > + (unsigned long long)uaddr, EVENT_NAME(user)); > + > + /* Prevent state changes from racing */ > + mutex_lock(&event_mutex); > + > + /* > + * If we managed to get the page, re-issue the write. We do not > + * want to get into a possible infinite loop, which is why we only > + * attempt again directly if the page came in. If we couldn't get > + * the page here, then we will try again the next time the event is > + * enabled/disabled. > + */ > + enabler->disabled = 0; > + > + if (!ret) > + user_event_enabler_write(enabler, user); > + > + mutex_unlock(&event_mutex); > + > + refcount_dec(&user->refcnt); > + user_event_enabler_put(enabler); > + kmem_cache_free(fault_cache, fault); > +} > + > +static bool user_event_enabler_queue_fault(struct user_event_enabler *enabler, > + struct user_event *user) > +{ > + struct user_event_enabler_fault *fault; > + > + fault = kmem_cache_zalloc(fault_cache, GFP_NOWAIT | __GFP_NOWARN); > + > + if (!fault) > + return false; > + > + INIT_WORK(&fault->work, user_event_enabler_fault_fixup); > + fault->enabler = user_event_enabler_get(enabler); > + fault->event = user; > + > + refcount_inc(&user->refcnt); > + enabler->disabled = 1; > + > + if (!schedule_work(&fault->work)) { > + enabler->disabled = 0; > + refcount_dec(&user->refcnt); > + user_event_enabler_put(enabler); > + kmem_cache_free(fault_cache, fault); > + > + return false; > + } > + > + return true; > +} > + > static void user_event_enabler_write(struct user_event_enabler *enabler, > struct user_event *user) > { > @@ -266,6 +370,11 @@ static void user_event_enabler_write(struct user_event_enabler *enabler, > void *kaddr; > int ret; > > + lockdep_assert_held(&event_mutex); > + > + if (unlikely(enabler->disabled)) > + return; > + > mmap_read_lock(mm); > > ret = pin_user_pages_remote(mm, uaddr, 1, FOLL_WRITE | FOLL_NOFAULT, > @@ -273,8 +382,10 @@ static void user_event_enabler_write(struct user_event_enabler *enabler, > > mmap_read_unlock(mm); > > - if (ret <= 0) { > - pr_warn("user_events: Enable write failed\n"); > + if (unlikely(ret <= 0)) { > + if (!user_event_enabler_queue_fault(enabler, user)) > + pr_warn("user_events: Unable to queue fault handler\n"); > + > return; > } > > @@ -321,6 +432,7 @@ static struct user_event_enabler > enabler->file = file; > enabler->enable_addr = (unsigned long)reg->enable_addr; > enabler->enable_bit = reg->enable_bit; > + refcount_set(&enabler->refcnt, 1); > > /* Prevents state changes from racing with new enablers */ > mutex_lock(&event_mutex); > @@ -1902,6 +2014,11 @@ static int __init trace_events_user_init(void) > { > int ret; > > + fault_cache = KMEM_CACHE(user_event_enabler_fault, 0); > + > + if (!fault_cache) > + return -ENOMEM; > + > init_group = user_event_group_create(&init_user_ns); > > if (!init_group) -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-28 22:19 ` Mathieu Desnoyers @ 2022-10-28 22:42 ` Beau Belgrave 2022-10-29 14:40 ` Mathieu Desnoyers 0 siblings, 1 reply; 31+ messages in thread From: Beau Belgrave @ 2022-10-28 22:42 UTC (permalink / raw) To: Mathieu Desnoyers Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On Fri, Oct 28, 2022 at 06:19:10PM -0400, Mathieu Desnoyers wrote: > On 2022-10-27 18:40, Beau Belgrave wrote: > > When events are enabled within the various tracing facilities, such as > > ftrace/perf, the event_mutex is held. As events are enabled pages are > > accessed. We do not want page faults to occur under this lock. Instead > > queue the fault to a workqueue to be handled in a process context safe > > way without the lock. > > > > The enable address is disabled while the async fault-in occurs. This > > ensures that we don't attempt fault-in more than is necessary. Once the > > page has been faulted in, the address write is attempted again. If the > > page couldn't fault-in, then we wait until the next time the event is > > enabled to prevent any potential infinite loops. > > I'm also unclear about how the system call initiating the enabled state > change is delayed (or not) when a page fault is queued. > It's not, if needed we could call schedule_delayed_work(). However, I don't think we need it. When pin_user_pages_remote is invoked, it's with FOLL_NOFAULT. This will tell us if we need to fault pages in, we then call fixup_user_fault with unlocked value passed. This will cause the fixup to retry and get the page in. It's called out in the comments for this exact purpose (lucked out here): mm/gup.c * This is meant to be called in the specific scenario where for locking reasons * we try to access user memory in atomic context (within a pagefault_disable() * section), this returns -EFAULT, and we want to resolve the user fault before * trying again. The fault in happens in a workqueue, this is the same way KVM does it's async page fault logic, so it's not a new pattern. As soon as the fault-in is done, we have a page we should be able to use, so we re-attempt the write immediately. If the write fails, another queue happens and we could loop, but not like the unmap() case I replied with earlier. > I would expect that when a page fault is needed, after enqueuing work to the > worker thread, the system call initiating the state change would somehow > wait for a completion (after releasing the user events mutex). That > completion would be signaled by the worker thread either if the page fault > fails, or if the state change is done. > I didn't see a generic fault-in + notify anywhere, do you know of one I could use? Otherwise, it seems the pattern used is check fault, fault-in via workqueue, re-attempt. > Thoughts ? > > Thanks, > > Mathieu > Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-28 22:42 ` Beau Belgrave @ 2022-10-29 14:40 ` Mathieu Desnoyers 2022-10-30 11:45 ` Mathieu Desnoyers 2022-10-31 17:12 ` Beau Belgrave 0 siblings, 2 replies; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-29 14:40 UTC (permalink / raw) To: Beau Belgrave Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On 2022-10-28 18:42, Beau Belgrave wrote: > On Fri, Oct 28, 2022 at 06:19:10PM -0400, Mathieu Desnoyers wrote: >> On 2022-10-27 18:40, Beau Belgrave wrote: >>> When events are enabled within the various tracing facilities, such as >>> ftrace/perf, the event_mutex is held. As events are enabled pages are >>> accessed. We do not want page faults to occur under this lock. Instead >>> queue the fault to a workqueue to be handled in a process context safe >>> way without the lock. >>> >>> The enable address is disabled while the async fault-in occurs. This >>> ensures that we don't attempt fault-in more than is necessary. Once the >>> page has been faulted in, the address write is attempted again. If the >>> page couldn't fault-in, then we wait until the next time the event is >>> enabled to prevent any potential infinite loops. >> >> I'm also unclear about how the system call initiating the enabled state >> change is delayed (or not) when a page fault is queued. >> > > It's not, if needed we could call schedule_delayed_work(). However, I > don't think we need it. When pin_user_pages_remote is invoked, it's with > FOLL_NOFAULT. This will tell us if we need to fault pages in, we then > call fixup_user_fault with unlocked value passed. This will cause the > fixup to retry and get the page in. > > It's called out in the comments for this exact purpose (lucked out > here): > mm/gup.c > * This is meant to be called in the specific scenario where for locking reasons > * we try to access user memory in atomic context (within a pagefault_disable() > * section), this returns -EFAULT, and we want to resolve the user fault before > * trying again. > > The fault in happens in a workqueue, this is the same way KVM does it's > async page fault logic, so it's not a new pattern. As soon as the > fault-in is done, we have a page we should be able to use, so we > re-attempt the write immediately. If the write fails, another queue > happens and we could loop, but not like the unmap() case I replied with > earlier. > >> I would expect that when a page fault is needed, after enqueuing work to the >> worker thread, the system call initiating the state change would somehow >> wait for a completion (after releasing the user events mutex). That >> completion would be signaled by the worker thread either if the page fault >> fails, or if the state change is done. >> > > I didn't see a generic fault-in + notify anywhere, do you know of one I > could use? Otherwise, it seems the pattern used is check fault, fault-in > via workqueue, re-attempt. I don't think we're talking about the same thing. I'll try stating my concern in a different way. user_event_enabler_write() calls user_event_enabler_queue_fault() when it cannot perform the enabled state update immediately (because a page fault is needed). But then AFAIU it returns immediately to the caller. The caller could very well expect that the event has been enabled, as requested, immediately when the enabler write returns. The fact that enabling the event can be delayed for an arbitrary amount of time due to page faults means that we have no hard guarantee that the event is enabled as requested upon return to the caller. I'd like to add a completion there, to be waited for after user_event_enabler_queue_fault(), but before returning from user_event_enabler_create(). Waiting for the completion should be done without any mutexes held, so after releasing event_mutex. The completion would be placed on the stack of user_event_enabler_create(), and a reference to the completion would be placed in the queued fault request. The completion notification would be emitted by the worker thread either when enabling is done, or if a page fault fails. See include/linux/completion.h Thoughts ? Thanks, Mathieu > >> Thoughts ? >> >> Thanks, >> >> Mathieu >> > > Thanks, > -Beau -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-29 14:40 ` Mathieu Desnoyers @ 2022-10-30 11:45 ` Mathieu Desnoyers 2022-10-31 17:18 ` Beau Belgrave 2022-10-31 17:12 ` Beau Belgrave 1 sibling, 1 reply; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-30 11:45 UTC (permalink / raw) To: Beau Belgrave Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On 2022-10-29 10:40, Mathieu Desnoyers wrote: > On 2022-10-28 18:42, Beau Belgrave wrote: >> On Fri, Oct 28, 2022 at 06:19:10PM -0400, Mathieu Desnoyers wrote: >>> On 2022-10-27 18:40, Beau Belgrave wrote: >>>> When events are enabled within the various tracing facilities, such as >>>> ftrace/perf, the event_mutex is held. As events are enabled pages are >>>> accessed. We do not want page faults to occur under this lock. Instead >>>> queue the fault to a workqueue to be handled in a process context safe >>>> way without the lock. >>>> >>>> The enable address is disabled while the async fault-in occurs. This >>>> ensures that we don't attempt fault-in more than is necessary. Once the >>>> page has been faulted in, the address write is attempted again. If the >>>> page couldn't fault-in, then we wait until the next time the event is >>>> enabled to prevent any potential infinite loops. >>> >>> I'm also unclear about how the system call initiating the enabled state >>> change is delayed (or not) when a page fault is queued. >>> >> >> It's not, if needed we could call schedule_delayed_work(). However, I >> don't think we need it. When pin_user_pages_remote is invoked, it's with >> FOLL_NOFAULT. This will tell us if we need to fault pages in, we then >> call fixup_user_fault with unlocked value passed. This will cause the >> fixup to retry and get the page in. >> >> It's called out in the comments for this exact purpose (lucked out >> here): >> mm/gup.c >> * This is meant to be called in the specific scenario where for >> locking reasons >> * we try to access user memory in atomic context (within a >> pagefault_disable() >> * section), this returns -EFAULT, and we want to resolve the user >> fault before >> * trying again. >> >> The fault in happens in a workqueue, this is the same way KVM does it's >> async page fault logic, so it's not a new pattern. As soon as the >> fault-in is done, we have a page we should be able to use, so we >> re-attempt the write immediately. If the write fails, another queue >> happens and we could loop, but not like the unmap() case I replied with >> earlier. >> >>> I would expect that when a page fault is needed, after enqueuing work >>> to the >>> worker thread, the system call initiating the state change would somehow >>> wait for a completion (after releasing the user events mutex). That >>> completion would be signaled by the worker thread either if the page >>> fault >>> fails, or if the state change is done. >>> >> >> I didn't see a generic fault-in + notify anywhere, do you know of one I >> could use? Otherwise, it seems the pattern used is check fault, fault-in >> via workqueue, re-attempt. > > I don't think we're talking about the same thing. I'll try stating my > concern in a different way. > > user_event_enabler_write() calls user_event_enabler_queue_fault() when > it cannot perform the enabled state update immediately (because a page > fault is needed). > > But then AFAIU it returns immediately to the caller. The caller could > very well expect that the event has been enabled, as requested, > immediately when the enabler write returns. The fact that enabling the > event can be delayed for an arbitrary amount of time due to page faults > means that we have no hard guarantee that the event is enabled as > requested upon return to the caller. > > I'd like to add a completion there, to be waited for after > user_event_enabler_queue_fault(), but before returning from > user_event_enabler_create(). Waiting for the completion should be done > without any mutexes held, so after releasing event_mutex. > > The completion would be placed on the stack of > user_event_enabler_create(), and a reference to the completion would be > placed in the queued fault request. The completion notification would be > emitted by the worker thread either when enabling is done, or if a page > fault fails. > > See include/linux/completion.h > > Thoughts ? Actually, after further thinking, I wonder if we need a worker thread at all to perform the page faults. Could we simply handle the page faults from user_event_enabler_create() by releasing the mutex and re-trying ? From what contexts is user_event_enabler_create() called ? (any locks taken ? system call context ? irqs and preemption enabled or disabled ?) Thanks, Mathieu > > Thanks, > > Mathieu > > >> >>> Thoughts ? >>> >>> Thanks, >>> >>> Mathieu >>> >> >> Thanks, >> -Beau > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-30 11:45 ` Mathieu Desnoyers @ 2022-10-31 17:18 ` Beau Belgrave 0 siblings, 0 replies; 31+ messages in thread From: Beau Belgrave @ 2022-10-31 17:18 UTC (permalink / raw) To: Mathieu Desnoyers Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On Sun, Oct 30, 2022 at 07:45:33AM -0400, Mathieu Desnoyers wrote: > On 2022-10-29 10:40, Mathieu Desnoyers wrote: > > On 2022-10-28 18:42, Beau Belgrave wrote: > > > On Fri, Oct 28, 2022 at 06:19:10PM -0400, Mathieu Desnoyers wrote: > > > > On 2022-10-27 18:40, Beau Belgrave wrote: > > > > > When events are enabled within the various tracing facilities, such as > > > > > ftrace/perf, the event_mutex is held. As events are enabled pages are > > > > > accessed. We do not want page faults to occur under this lock. Instead > > > > > queue the fault to a workqueue to be handled in a process context safe > > > > > way without the lock. > > > > > > > > > > The enable address is disabled while the async fault-in occurs. This > > > > > ensures that we don't attempt fault-in more than is necessary. Once the > > > > > page has been faulted in, the address write is attempted again. If the > > > > > page couldn't fault-in, then we wait until the next time the event is > > > > > enabled to prevent any potential infinite loops. > > > > > > > > I'm also unclear about how the system call initiating the enabled state > > > > change is delayed (or not) when a page fault is queued. > > > > > > > > > > It's not, if needed we could call schedule_delayed_work(). However, I > > > don't think we need it. When pin_user_pages_remote is invoked, it's with > > > FOLL_NOFAULT. This will tell us if we need to fault pages in, we then > > > call fixup_user_fault with unlocked value passed. This will cause the > > > fixup to retry and get the page in. > > > > > > It's called out in the comments for this exact purpose (lucked out > > > here): > > > mm/gup.c > > > * This is meant to be called in the specific scenario where for > > > locking reasons > > > * we try to access user memory in atomic context (within a > > > pagefault_disable() > > > * section), this returns -EFAULT, and we want to resolve the user > > > fault before > > > * trying again. > > > > > > The fault in happens in a workqueue, this is the same way KVM does it's > > > async page fault logic, so it's not a new pattern. As soon as the > > > fault-in is done, we have a page we should be able to use, so we > > > re-attempt the write immediately. If the write fails, another queue > > > happens and we could loop, but not like the unmap() case I replied with > > > earlier. > > > > > > > I would expect that when a page fault is needed, after enqueuing > > > > work to the > > > > worker thread, the system call initiating the state change would somehow > > > > wait for a completion (after releasing the user events mutex). That > > > > completion would be signaled by the worker thread either if the > > > > page fault > > > > fails, or if the state change is done. > > > > > > > > > > I didn't see a generic fault-in + notify anywhere, do you know of one I > > > could use? Otherwise, it seems the pattern used is check fault, fault-in > > > via workqueue, re-attempt. > > > > I don't think we're talking about the same thing. I'll try stating my > > concern in a different way. > > > > user_event_enabler_write() calls user_event_enabler_queue_fault() when > > it cannot perform the enabled state update immediately (because a page > > fault is needed). > > > > But then AFAIU it returns immediately to the caller. The caller could > > very well expect that the event has been enabled, as requested, > > immediately when the enabler write returns. The fact that enabling the > > event can be delayed for an arbitrary amount of time due to page faults > > means that we have no hard guarantee that the event is enabled as > > requested upon return to the caller. > > > > I'd like to add a completion there, to be waited for after > > user_event_enabler_queue_fault(), but before returning from > > user_event_enabler_create(). Waiting for the completion should be done > > without any mutexes held, so after releasing event_mutex. > > > > The completion would be placed on the stack of > > user_event_enabler_create(), and a reference to the completion would be > > placed in the queued fault request. The completion notification would be > > emitted by the worker thread either when enabling is done, or if a page > > fault fails. > > > > See include/linux/completion.h > > > > Thoughts ? > > Actually, after further thinking, I wonder if we need a worker thread at all > to perform the page faults. > > Could we simply handle the page faults from user_event_enabler_create() by > releasing the mutex and re-trying ? From what contexts is > user_event_enabler_create() called ? (any locks taken ? system call context > ? irqs and preemption enabled or disabled ?) > The create case is in process context, via the reg IOCTL on the user_events_data file. The only lock is the register lock, the event_mutex is taken within the user_event_enabler_create() call to ensure the enabler can safely be added to the shared user_event within the group. Shouldn't be a problem running it synchronously again or reporting back a fault happened and letting the user call decide what to do. > Thanks, > > Mathieu > [..] > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly 2022-10-29 14:40 ` Mathieu Desnoyers 2022-10-30 11:45 ` Mathieu Desnoyers @ 2022-10-31 17:12 ` Beau Belgrave 1 sibling, 0 replies; 31+ messages in thread From: Beau Belgrave @ 2022-10-31 17:12 UTC (permalink / raw) To: Mathieu Desnoyers Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On Sat, Oct 29, 2022 at 10:40:02AM -0400, Mathieu Desnoyers wrote: > On 2022-10-28 18:42, Beau Belgrave wrote: > > On Fri, Oct 28, 2022 at 06:19:10PM -0400, Mathieu Desnoyers wrote: > > > On 2022-10-27 18:40, Beau Belgrave wrote: > > > > When events are enabled within the various tracing facilities, such as > > > > ftrace/perf, the event_mutex is held. As events are enabled pages are > > > > accessed. We do not want page faults to occur under this lock. Instead > > > > queue the fault to a workqueue to be handled in a process context safe > > > > way without the lock. > > > > > > > > The enable address is disabled while the async fault-in occurs. This > > > > ensures that we don't attempt fault-in more than is necessary. Once the > > > > page has been faulted in, the address write is attempted again. If the > > > > page couldn't fault-in, then we wait until the next time the event is > > > > enabled to prevent any potential infinite loops. > > > > > > I'm also unclear about how the system call initiating the enabled state > > > change is delayed (or not) when a page fault is queued. > > > > > > > It's not, if needed we could call schedule_delayed_work(). However, I > > don't think we need it. When pin_user_pages_remote is invoked, it's with > > FOLL_NOFAULT. This will tell us if we need to fault pages in, we then > > call fixup_user_fault with unlocked value passed. This will cause the > > fixup to retry and get the page in. > > > > It's called out in the comments for this exact purpose (lucked out > > here): > > mm/gup.c > > * This is meant to be called in the specific scenario where for locking reasons > > * we try to access user memory in atomic context (within a pagefault_disable() > > * section), this returns -EFAULT, and we want to resolve the user fault before > > * trying again. > > > > The fault in happens in a workqueue, this is the same way KVM does it's > > async page fault logic, so it's not a new pattern. As soon as the > > fault-in is done, we have a page we should be able to use, so we > > re-attempt the write immediately. If the write fails, another queue > > happens and we could loop, but not like the unmap() case I replied with > > earlier. > > > > > I would expect that when a page fault is needed, after enqueuing work to the > > > worker thread, the system call initiating the state change would somehow > > > wait for a completion (after releasing the user events mutex). That > > > completion would be signaled by the worker thread either if the page fault > > > fails, or if the state change is done. > > > > > > > I didn't see a generic fault-in + notify anywhere, do you know of one I > > could use? Otherwise, it seems the pattern used is check fault, fault-in > > via workqueue, re-attempt. > > I don't think we're talking about the same thing. I'll try stating my > concern in a different way. > > user_event_enabler_write() calls user_event_enabler_queue_fault() when it > cannot perform the enabled state update immediately (because a page fault is > needed). > > But then AFAIU it returns immediately to the caller. The caller could very > well expect that the event has been enabled, as requested, immediately when > the enabler write returns. The fact that enabling the event can be delayed > for an arbitrary amount of time due to page faults means that we have no > hard guarantee that the event is enabled as requested upon return to the > caller. > Yeah, sorry, I misread your statement. If the user registers an event and user_event_enabler_write() is invoked at that point the enabler is registered and will update the event as it changes, even though the initial write fails (it will try again repeatedly until a terminal state of faulting is reached). I agree though, if the initial data is faulted out at that point, and it has an error faulting in, the user doesn't know that (although it appears the only time this would fail is if someone did something silly, malicous, or the device is out of memory). They likely want to know if it's the silly/forgetful case. > I'd like to add a completion there, to be waited for after > user_event_enabler_queue_fault(), but before returning from > user_event_enabler_create(). Waiting for the completion should be done > without any mutexes held, so after releasing event_mutex. > > The completion would be placed on the stack of user_event_enabler_create(), > and a reference to the completion would be placed in the queued fault > request. The completion notification would be emitted by the worker thread > either when enabling is done, or if a page fault fails. > > See include/linux/completion.h > > Thoughts ? > If the case you are worried about is just the initial register, then I can do this synchronously outside of the lock. I believe you had the same thought later in the day. The main scenario I am worried about is that we have say 20 processes that share the same event. They have already been registered and they are running. Then a few of them have page faults when perf/ftrace enable the event and the register callback is invoked (which triggers a user write to the enablement address/bit). If most of these processes don't page fault, I don't want them to wait on the account of 1 or 2 bad apples. If they all page fault, I don't want them to hold up the other parts the system that require event_mutex (since it is held by the register callback caller). So these should be as fast as possible while the event_mutex is being held. > Thanks, > > Mathieu > > > > > > > Thoughts ? > > > > > > Thanks, > > > > > > Mathieu > > > > > > > Thanks, > > -Beau > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-27 22:40 [RFC PATCH 0/2] tracing/user_events: Remote write ABI Beau Belgrave 2022-10-27 22:40 ` [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement Beau Belgrave 2022-10-27 22:40 ` [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly Beau Belgrave @ 2022-10-28 21:50 ` Mathieu Desnoyers 2022-10-28 22:17 ` Beau Belgrave 2022-10-31 14:15 ` Masami Hiramatsu 3 siblings, 1 reply; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-28 21:50 UTC (permalink / raw) To: Beau Belgrave, rostedt, mhiramat, dcook, alanau Cc: linux-trace-devel, linux-kernel On 2022-10-27 18:40, Beau Belgrave wrote: > As part of the discussions for user_events aligned with user space > tracers, it was determined that user programs should register a 32-bit > value to set or clear a bit when an event becomes enabled. Currently a > shared page is being used that requires mmap(). > > In this new model during the event registration from user programs 2 new > values are specified. The first is the address to update when the event > is either enabled or disabled. The second is the bit to set/clear to > reflect the event being enabled. This allows for a local 32-bit value in > user programs to support both kernel and user tracers. As an example, > setting bit 31 for kernel tracers when the event becomes enabled allows > for user tracers to use the other bits for ref counts or other flags. > The kernel side updates the bit atomically, user programs need to also > update these values atomically. Nice! > > User provided addresses must be aligned on a 32-bit boundary, this > allows for single page checking and prevents odd behaviors such as a > 32-bit value straddling 2 pages instead of a single page. > > When page faults are encountered they are done asyncly via a workqueue. > If the page faults back in, the write update is attempted again. If the > page cannot fault-in, then we log and wait until the next time the event > is enabled/disabled. This is to prevent possible infinite loops resulting > from bad user processes unmapping or changing protection values after > registering the address. I'll have a close look at this workqueue page fault scheme, probably next week. > > NOTE: > User programs that wish to have the enable bit shared across forks > either need to use a MAP_SHARED allocated address or register a new > address and file descriptor. If MAP_SHARED cannot be used or new > registrations cannot be done, then it's allowable to use MAP_PRIVATE > as long as the forked children never update the page themselves. Once > the page has been updated, the page from the parent will be copied over > to the child. This new copy-on-write page will not receive updates from > the kernel until another registration has been performed with this new > address. This seems rather odd. I would expect that if a parent process registers some instrumentation using private mappings for enabled state through the user events ioctl, and then forks, the child process would seamlessly be traced by the user events ABI while being able to also change the enabled state from the userspace tracer libraries (which would trigger COW). Requiring the child to re-register to user events is rather odd. What is preventing us from tracing the child without re-registration in this scenario ? Thanks, Mathieu > > Beau Belgrave (2): > tracing/user_events: Use remote writes for event enablement > tracing/user_events: Fixup enable faults asyncly > > include/linux/user_events.h | 10 +- > kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++----------- > 2 files changed, 270 insertions(+), 136 deletions(-) > > > base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049 -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-28 21:50 ` [RFC PATCH 0/2] tracing/user_events: Remote write ABI Mathieu Desnoyers @ 2022-10-28 22:17 ` Beau Belgrave 2022-10-29 13:58 ` Mathieu Desnoyers 0 siblings, 1 reply; 31+ messages in thread From: Beau Belgrave @ 2022-10-28 22:17 UTC (permalink / raw) To: Mathieu Desnoyers Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote: > On 2022-10-27 18:40, Beau Belgrave wrote: > > As part of the discussions for user_events aligned with user space > > tracers, it was determined that user programs should register a 32-bit > > value to set or clear a bit when an event becomes enabled. Currently a > > shared page is being used that requires mmap(). > > > > In this new model during the event registration from user programs 2 new > > values are specified. The first is the address to update when the event > > is either enabled or disabled. The second is the bit to set/clear to > > reflect the event being enabled. This allows for a local 32-bit value in > > user programs to support both kernel and user tracers. As an example, > > setting bit 31 for kernel tracers when the event becomes enabled allows > > for user tracers to use the other bits for ref counts or other flags. > > The kernel side updates the bit atomically, user programs need to also > > update these values atomically. > > Nice! > > > > > User provided addresses must be aligned on a 32-bit boundary, this > > allows for single page checking and prevents odd behaviors such as a > > 32-bit value straddling 2 pages instead of a single page. > > > > When page faults are encountered they are done asyncly via a workqueue. > > If the page faults back in, the write update is attempted again. If the > > page cannot fault-in, then we log and wait until the next time the event > > is enabled/disabled. This is to prevent possible infinite loops resulting > > from bad user processes unmapping or changing protection values after > > registering the address. > > I'll have a close look at this workqueue page fault scheme, probably next > week. > Excellent. > > > > NOTE: > > User programs that wish to have the enable bit shared across forks > > either need to use a MAP_SHARED allocated address or register a new > > address and file descriptor. If MAP_SHARED cannot be used or new > > registrations cannot be done, then it's allowable to use MAP_PRIVATE > > as long as the forked children never update the page themselves. Once > > the page has been updated, the page from the parent will be copied over > > to the child. This new copy-on-write page will not receive updates from > > the kernel until another registration has been performed with this new > > address. > > This seems rather odd. I would expect that if a parent process registers > some instrumentation using private mappings for enabled state through the > user events ioctl, and then forks, the child process would seamlessly be > traced by the user events ABI while being able to also change the enabled > state from the userspace tracer libraries (which would trigger COW). > Requiring the child to re-register to user events is rather odd. > It's the COW that is the problem, see below. > What is preventing us from tracing the child without re-registration in this > scenario ? > Largely knowing when the COW occurs on a specific page. We don't make the mappings, so I'm unsure if we can ask to be notified easily during these times or not. If we could, that would solve this. I'm glad you are thinking about this. The note here was exactly to trigger this discussion :) I believe this is the same as a Futex, I'll take another look at that code to see if they've come up with anything regarding this. Any ideas? Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-28 22:17 ` Beau Belgrave @ 2022-10-29 13:58 ` Mathieu Desnoyers 2022-10-31 16:53 ` Beau Belgrave 0 siblings, 1 reply; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-29 13:58 UTC (permalink / raw) To: Beau Belgrave Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On 2022-10-28 18:17, Beau Belgrave wrote: > On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote: >> On 2022-10-27 18:40, Beau Belgrave wrote: [...] > >>> >>> NOTE: >>> User programs that wish to have the enable bit shared across forks >>> either need to use a MAP_SHARED allocated address or register a new >>> address and file descriptor. If MAP_SHARED cannot be used or new >>> registrations cannot be done, then it's allowable to use MAP_PRIVATE >>> as long as the forked children never update the page themselves. Once >>> the page has been updated, the page from the parent will be copied over >>> to the child. This new copy-on-write page will not receive updates from >>> the kernel until another registration has been performed with this new >>> address. >> >> This seems rather odd. I would expect that if a parent process registers >> some instrumentation using private mappings for enabled state through the >> user events ioctl, and then forks, the child process would seamlessly be >> traced by the user events ABI while being able to also change the enabled >> state from the userspace tracer libraries (which would trigger COW). >> Requiring the child to re-register to user events is rather odd. >> > > It's the COW that is the problem, see below. > >> What is preventing us from tracing the child without re-registration in this >> scenario ? >> > > Largely knowing when the COW occurs on a specific page. We don't make > the mappings, so I'm unsure if we can ask to be notified easily during > these times or not. If we could, that would solve this. I'm glad you are > thinking about this. The note here was exactly to trigger this > discussion :) > > I believe this is the same as a Futex, I'll take another look at that > code to see if they've come up with anything regarding this. > > Any ideas? Based on your description of the symptoms, AFAIU, upon registration of a given user event associated with a mm_struct, the user events ioctl appears to translates the virtual address into a page pointer immediately, and keeps track of that page afterwards. This means it loses track of the page when COW occurs. Why not keep track of the registered virtual address and struct_mm associated with the event rather than the page ? Whenever a state change is needed, the virtual-address-to-page translation will be performed again. If it follows a COW, it will get the new copied page. If it happens that no COW was done, it should map to the original page. If the mapping is shared, the kernel would update that shared page. If the mapping is private, then the kernel would COW the page before updating it. Thoughts ? Thanks, Mathieu > > Thanks, > -Beau -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-29 13:58 ` Mathieu Desnoyers @ 2022-10-31 16:53 ` Beau Belgrave 2022-11-02 13:46 ` Mathieu Desnoyers 0 siblings, 1 reply; 31+ messages in thread From: Beau Belgrave @ 2022-10-31 16:53 UTC (permalink / raw) To: Mathieu Desnoyers Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On Sat, Oct 29, 2022 at 09:58:26AM -0400, Mathieu Desnoyers wrote: > On 2022-10-28 18:17, Beau Belgrave wrote: > > On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote: > > > On 2022-10-27 18:40, Beau Belgrave wrote: > > [...] > > > > > > > > > > NOTE: > > > > User programs that wish to have the enable bit shared across forks > > > > either need to use a MAP_SHARED allocated address or register a new > > > > address and file descriptor. If MAP_SHARED cannot be used or new > > > > registrations cannot be done, then it's allowable to use MAP_PRIVATE > > > > as long as the forked children never update the page themselves. Once > > > > the page has been updated, the page from the parent will be copied over > > > > to the child. This new copy-on-write page will not receive updates from > > > > the kernel until another registration has been performed with this new > > > > address. > > > > > > This seems rather odd. I would expect that if a parent process registers > > > some instrumentation using private mappings for enabled state through the > > > user events ioctl, and then forks, the child process would seamlessly be > > > traced by the user events ABI while being able to also change the enabled > > > state from the userspace tracer libraries (which would trigger COW). > > > Requiring the child to re-register to user events is rather odd. > > > > > > > It's the COW that is the problem, see below. > > > > > What is preventing us from tracing the child without re-registration in this > > > scenario ? > > > > > > > Largely knowing when the COW occurs on a specific page. We don't make > > the mappings, so I'm unsure if we can ask to be notified easily during > > these times or not. If we could, that would solve this. I'm glad you are > > thinking about this. The note here was exactly to trigger this > > discussion :) > > > > I believe this is the same as a Futex, I'll take another look at that > > code to see if they've come up with anything regarding this. > > > > Any ideas? > > Based on your description of the symptoms, AFAIU, upon registration of a > given user event associated with a mm_struct, the user events ioctl appears > to translates the virtual address into a page pointer immediately, and keeps > track of that page afterwards. This means it loses track of the page when > COW occurs. > No, we keep the memory descriptor and virtual address so we can properly resolve to page per-process. > Why not keep track of the registered virtual address and struct_mm > associated with the event rather than the page ? Whenever a state change is > needed, the virtual-address-to-page translation will be performed again. If > it follows a COW, it will get the new copied page. If it happens that no COW > was done, it should map to the original page. If the mapping is shared, the > kernel would update that shared page. If the mapping is private, then the > kernel would COW the page before updating it. > > Thoughts ? > I think you are forgetting about page table entries. My understanding is the process will have the VMAs copied on fork, but the page table entries will be marked read-only. Then when the write access occurs, the COW is created (since the PTE says readonly, but the VMA says writable). However, that COW page is now only mapped within that forked process page table. This requires tracking the child memory descriptors in addition to the parent. The most straightforward way I see this happening is requiring user side to mmap the user_event_data fd that is used for write. This way when fork occurs in dup_mm() / dup_mmap() that mmap'd user_event_data will get open() / close() called per-fork. I could then copy the enablers from the parent but with the child's memory descriptor to allow proper lookup. This is like fork before COW, it's a bummer I cannot see a way to do this per-page. Doing the above would work, but it requires copying all the enablers, not just the one that changed after the fork. > Thanks, > > Mathieu > > > > > Thanks, > > -Beau > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-31 16:53 ` Beau Belgrave @ 2022-11-02 13:46 ` Mathieu Desnoyers 2022-11-02 17:18 ` Beau Belgrave 0 siblings, 1 reply; 31+ messages in thread From: Mathieu Desnoyers @ 2022-11-02 13:46 UTC (permalink / raw) To: Beau Belgrave Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On 2022-10-31 12:53, Beau Belgrave wrote: > On Sat, Oct 29, 2022 at 09:58:26AM -0400, Mathieu Desnoyers wrote: >> On 2022-10-28 18:17, Beau Belgrave wrote: >>> On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote: >>>> On 2022-10-27 18:40, Beau Belgrave wrote: >> >> [...] >>> >>>>> >>>>> NOTE: >>>>> User programs that wish to have the enable bit shared across forks >>>>> either need to use a MAP_SHARED allocated address or register a new >>>>> address and file descriptor. If MAP_SHARED cannot be used or new >>>>> registrations cannot be done, then it's allowable to use MAP_PRIVATE >>>>> as long as the forked children never update the page themselves. Once >>>>> the page has been updated, the page from the parent will be copied over >>>>> to the child. This new copy-on-write page will not receive updates from >>>>> the kernel until another registration has been performed with this new >>>>> address. >>>> >>>> This seems rather odd. I would expect that if a parent process registers >>>> some instrumentation using private mappings for enabled state through the >>>> user events ioctl, and then forks, the child process would seamlessly be >>>> traced by the user events ABI while being able to also change the enabled >>>> state from the userspace tracer libraries (which would trigger COW). >>>> Requiring the child to re-register to user events is rather odd. >>>> >>> >>> It's the COW that is the problem, see below. >>> >>>> What is preventing us from tracing the child without re-registration in this >>>> scenario ? >>>> >>> >>> Largely knowing when the COW occurs on a specific page. We don't make >>> the mappings, so I'm unsure if we can ask to be notified easily during >>> these times or not. If we could, that would solve this. I'm glad you are >>> thinking about this. The note here was exactly to trigger this >>> discussion :) >>> >>> I believe this is the same as a Futex, I'll take another look at that >>> code to see if they've come up with anything regarding this. >>> >>> Any ideas? >> >> Based on your description of the symptoms, AFAIU, upon registration of a >> given user event associated with a mm_struct, the user events ioctl appears >> to translates the virtual address into a page pointer immediately, and keeps >> track of that page afterwards. This means it loses track of the page when >> COW occurs. >> > > No, we keep the memory descriptor and virtual address so we can properly > resolve to page per-process. > >> Why not keep track of the registered virtual address and struct_mm >> associated with the event rather than the page ? Whenever a state change is >> needed, the virtual-address-to-page translation will be performed again. If >> it follows a COW, it will get the new copied page. If it happens that no COW >> was done, it should map to the original page. If the mapping is shared, the >> kernel would update that shared page. If the mapping is private, then the >> kernel would COW the page before updating it. >> >> Thoughts ? >> > > I think you are forgetting about page table entries. My understanding is > the process will have the VMAs copied on fork, but the page table > entries will be marked read-only. Then when the write access occurs, the > COW is created (since the PTE says readonly, but the VMA says writable). > However, that COW page is now only mapped within that forked process > page table. > > This requires tracking the child memory descriptors in addition to the > parent. The most straightforward way I see this happening is requiring > user side to mmap the user_event_data fd that is used for write. This > way when fork occurs in dup_mm() / dup_mmap() that mmap'd > user_event_data will get open() / close() called per-fork. I could then > copy the enablers from the parent but with the child's memory descriptor > to allow proper lookup. > > This is like fork before COW, it's a bummer I cannot see a way to do > this per-page. Doing the above would work, but it requires copying all > the enablers, not just the one that changed after the fork. This brings an overall design concern I have with user-events: AFAIU, the lifetime of the user event registration appears to be linked to the lifetime of a file descriptor. What happens when that file descriptor is duplicated and send over to another process through unix sockets credentials ? Does it mean that the kernel have a handle on the wrong process to update the "enabled" state? Also, what happens on execve system call if the file descriptor representing the user event is not marked as close-on-exec ? Does it mean the kernel can corrupt user-space memory of the after-exec loaded binary when it attempts to update the "enabled" state ? If I get this right, I suspect we might want to move the lifetime of the user event registration to the memory space (mm_struct). Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-11-02 13:46 ` Mathieu Desnoyers @ 2022-11-02 17:18 ` Beau Belgrave 0 siblings, 0 replies; 31+ messages in thread From: Beau Belgrave @ 2022-11-02 17:18 UTC (permalink / raw) To: Mathieu Desnoyers Cc: rostedt, mhiramat, dcook, alanau, linux-trace-devel, linux-kernel On Wed, Nov 02, 2022 at 09:46:31AM -0400, Mathieu Desnoyers wrote: > On 2022-10-31 12:53, Beau Belgrave wrote: > > On Sat, Oct 29, 2022 at 09:58:26AM -0400, Mathieu Desnoyers wrote: > > > On 2022-10-28 18:17, Beau Belgrave wrote: > > > > On Fri, Oct 28, 2022 at 05:50:04PM -0400, Mathieu Desnoyers wrote: > > > > > On 2022-10-27 18:40, Beau Belgrave wrote: > > > > > > [...] > > > > > > > > > > > > > > > > NOTE: > > > > > > User programs that wish to have the enable bit shared across forks > > > > > > either need to use a MAP_SHARED allocated address or register a new > > > > > > address and file descriptor. If MAP_SHARED cannot be used or new > > > > > > registrations cannot be done, then it's allowable to use MAP_PRIVATE > > > > > > as long as the forked children never update the page themselves. Once > > > > > > the page has been updated, the page from the parent will be copied over > > > > > > to the child. This new copy-on-write page will not receive updates from > > > > > > the kernel until another registration has been performed with this new > > > > > > address. > > > > > > > > > > This seems rather odd. I would expect that if a parent process registers > > > > > some instrumentation using private mappings for enabled state through the > > > > > user events ioctl, and then forks, the child process would seamlessly be > > > > > traced by the user events ABI while being able to also change the enabled > > > > > state from the userspace tracer libraries (which would trigger COW). > > > > > Requiring the child to re-register to user events is rather odd. > > > > > > > > > > > > > It's the COW that is the problem, see below. > > > > > > > > > What is preventing us from tracing the child without re-registration in this > > > > > scenario ? > > > > > > > > > > > > > Largely knowing when the COW occurs on a specific page. We don't make > > > > the mappings, so I'm unsure if we can ask to be notified easily during > > > > these times or not. If we could, that would solve this. I'm glad you are > > > > thinking about this. The note here was exactly to trigger this > > > > discussion :) > > > > > > > > I believe this is the same as a Futex, I'll take another look at that > > > > code to see if they've come up with anything regarding this. > > > > > > > > Any ideas? > > > > > > Based on your description of the symptoms, AFAIU, upon registration of a > > > given user event associated with a mm_struct, the user events ioctl appears > > > to translates the virtual address into a page pointer immediately, and keeps > > > track of that page afterwards. This means it loses track of the page when > > > COW occurs. > > > > > > > No, we keep the memory descriptor and virtual address so we can properly > > resolve to page per-process. > > > > > Why not keep track of the registered virtual address and struct_mm > > > associated with the event rather than the page ? Whenever a state change is > > > needed, the virtual-address-to-page translation will be performed again. If > > > it follows a COW, it will get the new copied page. If it happens that no COW > > > was done, it should map to the original page. If the mapping is shared, the > > > kernel would update that shared page. If the mapping is private, then the > > > kernel would COW the page before updating it. > > > > > > Thoughts ? > > > > > > > I think you are forgetting about page table entries. My understanding is > > the process will have the VMAs copied on fork, but the page table > > entries will be marked read-only. Then when the write access occurs, the > > COW is created (since the PTE says readonly, but the VMA says writable). > > However, that COW page is now only mapped within that forked process > > page table. > > > > This requires tracking the child memory descriptors in addition to the > > parent. The most straightforward way I see this happening is requiring > > user side to mmap the user_event_data fd that is used for write. This > > way when fork occurs in dup_mm() / dup_mmap() that mmap'd > > user_event_data will get open() / close() called per-fork. I could then > > copy the enablers from the parent but with the child's memory descriptor > > to allow proper lookup. > > > > This is like fork before COW, it's a bummer I cannot see a way to do > > this per-page. Doing the above would work, but it requires copying all > > the enablers, not just the one that changed after the fork. > > This brings an overall design concern I have with user-events: AFAIU, the > lifetime of the user event registration appears to be linked to the lifetime > of a file descriptor. > The lifetime of the user_event is linked to the lifetime of the tracepoint. The tracepoint stays alive until someone explicitly tries to delete it via the del IOCTL. If the delete is attempted and there are references out to that user_event, either via perf/ftrace or any open files or mm_stucts it will not be allowed to go away. The user_event does not go away automatically upon file release (last close). However, when that file goes away, obviously the caller no longer can write to it. This is why there are user_events within the group, and then there are per-file user_event_refs. It allows for tracking these lifetimes and writes in isolation. > What happens when that file descriptor is duplicated and send over to > another process through unix sockets credentials ? Does it mean that the > kernel have a handle on the wrong process to update the "enabled" state? > You'll have to expand upon this more, if the FD is duplicated and installed into another process, then the "enabled" state is still at whatever mm_struct registered it. If that new process wants to have an enabled state, it must register in it's own process if it wasn't from a fork. The mm_struct can only change pages in that process, it cannot jump across to another process. This is why the fork case needs an enabler clone with a new mm_struct, and why if it gets duplicated in an odd way into another process, that process must register it's own enabler states. > Also, what happens on execve system call if the file descriptor representing > the user event is not marked as close-on-exec ? Does it mean the kernel can > corrupt user-space memory of the after-exec loaded binary when it attempts > to update the "enabled" state ? > I believe it could if the memory descriptor remains, callers should mark it close-on-exec to prevent this. None of this was a problem with the old ABI :) For clarity, since I cannot tell: Are you advocating for a different approach here or just calling out we need to add guidance to user space programs to do the right thing? > If I get this right, I suspect we might want to move the lifetime of the > user event registration to the memory space (mm_struct). > If that file or mmap I proposed stays open, the enablers will stay open. The enabler keeps the mm_struct alive, not the other way around. I'm not sure I follow under what condition you'd have an enabler / user_event around with a mm_struct that has gone away. The enabler keeps the mm_struct alive until the enabler goes away to prevent any funny business. That appears to fit more inline with what others have done in the kernel than trying to tie the enabler to the mm_struct lifetime. If a memory descriptor goes away, then it's FD's should close (Is there ever a case this is not true?). If the FD's go away, the enablers close down, if the enablers close down, the user_event ref's drop to 0. The user_event can then be deleted via an explicit IOCTL, and will only work if at that point in time the ref count is still 0. > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-27 22:40 [RFC PATCH 0/2] tracing/user_events: Remote write ABI Beau Belgrave ` (2 preceding siblings ...) 2022-10-28 21:50 ` [RFC PATCH 0/2] tracing/user_events: Remote write ABI Mathieu Desnoyers @ 2022-10-31 14:15 ` Masami Hiramatsu 2022-10-31 15:27 ` Mathieu Desnoyers 2022-10-31 17:27 ` Beau Belgrave 3 siblings, 2 replies; 31+ messages in thread From: Masami Hiramatsu @ 2022-10-31 14:15 UTC (permalink / raw) To: Beau Belgrave Cc: rostedt, mathieu.desnoyers, dcook, alanau, linux-trace-devel, linux-kernel Hi Beau, On Thu, 27 Oct 2022 15:40:09 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > As part of the discussions for user_events aligned with user space > tracers, it was determined that user programs should register a 32-bit > value to set or clear a bit when an event becomes enabled. Currently a > shared page is being used that requires mmap(). > > In this new model during the event registration from user programs 2 new > values are specified. The first is the address to update when the event > is either enabled or disabled. The second is the bit to set/clear to > reflect the event being enabled. This allows for a local 32-bit value in > user programs to support both kernel and user tracers. As an example, > setting bit 31 for kernel tracers when the event becomes enabled allows > for user tracers to use the other bits for ref counts or other flags. > The kernel side updates the bit atomically, user programs need to also > update these values atomically. I think you means the kernel tracer (ftrace/perf) and user tracers (e.g. LTTng) use the same 32bit data so that traced user-application only checks that data for checking an event is enabled, right? If so, who the user tracer threads updates the data bit? Is that thread safe to update both kernel tracer and user tracers at the same time? And what is the actual advantage of this change? Are there any issue to use mmaped page? I would like to know more background of this change. Could you also provide any sample program which I can play it? :) > User provided addresses must be aligned on a 32-bit boundary, this > allows for single page checking and prevents odd behaviors such as a > 32-bit value straddling 2 pages instead of a single page. > > When page faults are encountered they are done asyncly via a workqueue. > If the page faults back in, the write update is attempted again. If the > page cannot fault-in, then we log and wait until the next time the event > is enabled/disabled. This is to prevent possible infinite loops resulting > from bad user processes unmapping or changing protection values after > registering the address. > > NOTE: > User programs that wish to have the enable bit shared across forks > either need to use a MAP_SHARED allocated address or register a new > address and file descriptor. If MAP_SHARED cannot be used or new > registrations cannot be done, then it's allowable to use MAP_PRIVATE > as long as the forked children never update the page themselves. Once > the page has been updated, the page from the parent will be copied over > to the child. This new copy-on-write page will not receive updates from > the kernel until another registration has been performed with this new > address. > > Beau Belgrave (2): > tracing/user_events: Use remote writes for event enablement > tracing/user_events: Fixup enable faults asyncly > > include/linux/user_events.h | 10 +- > kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++----------- > 2 files changed, 270 insertions(+), 136 deletions(-) > > > base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049 > -- > 2.25.1 > -- Masami Hiramatsu (Google) <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-31 14:15 ` Masami Hiramatsu @ 2022-10-31 15:27 ` Mathieu Desnoyers 2022-10-31 17:27 ` Beau Belgrave 1 sibling, 0 replies; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-31 15:27 UTC (permalink / raw) To: Masami Hiramatsu (Google), Beau Belgrave Cc: rostedt, dcook, alanau, linux-trace-devel, linux-kernel On 2022-10-31 10:15, Masami Hiramatsu (Google) wrote: > Hi Beau, > > On Thu, 27 Oct 2022 15:40:09 -0700 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > >> As part of the discussions for user_events aligned with user space >> tracers, it was determined that user programs should register a 32-bit >> value to set or clear a bit when an event becomes enabled. Currently a >> shared page is being used that requires mmap(). >> >> In this new model during the event registration from user programs 2 new >> values are specified. The first is the address to update when the event >> is either enabled or disabled. The second is the bit to set/clear to >> reflect the event being enabled. This allows for a local 32-bit value in >> user programs to support both kernel and user tracers. As an example, >> setting bit 31 for kernel tracers when the event becomes enabled allows >> for user tracers to use the other bits for ref counts or other flags. >> The kernel side updates the bit atomically, user programs need to also >> update these values atomically. > > I think you means the kernel tracer (ftrace/perf) and user tracers (e.g. > LTTng) use the same 32bit data so that traced user-application only checks > that data for checking an event is enabled, right? > > If so, who the user tracer threads updates the data bit? Is that thread > safe to update both kernel tracer and user tracers at the same time? Yes, my plan is to have userspace tracer agent threads use atomic increments/decrements to update the "enabled" state, and the kernel use atomic bit set/clear to update the top bit. This should allow the state to be updated concurrently without issues. > > And what is the actual advantage of this change? Are there any issue > to use mmaped page? I would like to know more background of this > change. With this change we can allow a user-space process to manage userspace tracing on its own, without any kernel support. Registering to user events becomes entirely optional. So if a kernel does not provide user events (either an old kernel or a kernel with CONFIG_USER_EVENTS=n), userspace tracing still works. This also allows user-space tracers to co-exist with the user events ABI. > > Could you also provide any sample program which I can play it? :) I've been working on a user-space static instrumentation library in the recent weeks. I've left "TODO" items for integration with user events ioctl/writev in the userspace code. See https://github.com/compudj/side There is now a build dependency on librseq to provide fast RCU read-side to iterate on the array of userspace tracer callbacks: https://github.com/compudj/librseq (this dependency could be made optional in the future) I know Doug is working on his own private repository for userspace instrumentation, and we share a lot of common goals. Thanks, Mathieu > >> User provided addresses must be aligned on a 32-bit boundary, this >> allows for single page checking and prevents odd behaviors such as a >> 32-bit value straddling 2 pages instead of a single page. >> >> When page faults are encountered they are done asyncly via a workqueue. >> If the page faults back in, the write update is attempted again. If the >> page cannot fault-in, then we log and wait until the next time the event >> is enabled/disabled. This is to prevent possible infinite loops resulting >> from bad user processes unmapping or changing protection values after >> registering the address. >> >> NOTE: >> User programs that wish to have the enable bit shared across forks >> either need to use a MAP_SHARED allocated address or register a new >> address and file descriptor. If MAP_SHARED cannot be used or new >> registrations cannot be done, then it's allowable to use MAP_PRIVATE >> as long as the forked children never update the page themselves. Once >> the page has been updated, the page from the parent will be copied over >> to the child. This new copy-on-write page will not receive updates from >> the kernel until another registration has been performed with this new >> address. >> >> Beau Belgrave (2): >> tracing/user_events: Use remote writes for event enablement >> tracing/user_events: Fixup enable faults asyncly >> >> include/linux/user_events.h | 10 +- >> kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++----------- >> 2 files changed, 270 insertions(+), 136 deletions(-) >> >> >> base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049 >> -- >> 2.25.1 >> > > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-31 14:15 ` Masami Hiramatsu 2022-10-31 15:27 ` Mathieu Desnoyers @ 2022-10-31 17:27 ` Beau Belgrave 2022-10-31 18:25 ` Mathieu Desnoyers 2022-11-01 13:52 ` Masami Hiramatsu 1 sibling, 2 replies; 31+ messages in thread From: Beau Belgrave @ 2022-10-31 17:27 UTC (permalink / raw) To: Masami Hiramatsu Cc: rostedt, mathieu.desnoyers, dcook, alanau, linux-trace-devel, linux-kernel On Mon, Oct 31, 2022 at 11:15:56PM +0900, Masami Hiramatsu wrote: > Hi Beau, > > On Thu, 27 Oct 2022 15:40:09 -0700 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > As part of the discussions for user_events aligned with user space > > tracers, it was determined that user programs should register a 32-bit > > value to set or clear a bit when an event becomes enabled. Currently a > > shared page is being used that requires mmap(). > > > > In this new model during the event registration from user programs 2 new > > values are specified. The first is the address to update when the event > > is either enabled or disabled. The second is the bit to set/clear to > > reflect the event being enabled. This allows for a local 32-bit value in > > user programs to support both kernel and user tracers. As an example, > > setting bit 31 for kernel tracers when the event becomes enabled allows > > for user tracers to use the other bits for ref counts or other flags. > > The kernel side updates the bit atomically, user programs need to also > > update these values atomically. > > I think you means the kernel tracer (ftrace/perf) and user tracers (e.g. > LTTng) use the same 32bit data so that traced user-application only checks > that data for checking an event is enabled, right? > Yes, exactly, user code can just check a single uint32 or uint64 to tell if anything is enabled (kernel or user tracer). > If so, who the user tracer threads updates the data bit? Is that thread > safe to update both kernel tracer and user tracers at the same time? > This is why atomics are used to set the bit on the kernel side. The user side should do the same. This is like the futex code. Do you see a problem with atomics being used between user and kernel space on a shared 32/64-bit address? > And what is the actual advantage of this change? Are there any issue > to use mmaped page? I would like to know more background of this > change. > Without this change user tracers like LTTng will have to check 2 values instead of 1 to tell if the kernel tracer is enabled or not. Mathieu is working on a user side tracing library in an effort to align writing tracing code in user processes that works well for both kernel and user tracers without much effort. See here: https://github.com/compudj/side Are you proposing we keep the bitmap approach and have side library just hook another branch? Mathieu had issues with that approach during our talks. > Could you also provide any sample program which I can play it? :) > When I make the next patch version, I will update the user_events sample so you'll have something to try out. > > User provided addresses must be aligned on a 32-bit boundary, this > > allows for single page checking and prevents odd behaviors such as a > > 32-bit value straddling 2 pages instead of a single page. > > > > When page faults are encountered they are done asyncly via a workqueue. > > If the page faults back in, the write update is attempted again. If the > > page cannot fault-in, then we log and wait until the next time the event > > is enabled/disabled. This is to prevent possible infinite loops resulting > > from bad user processes unmapping or changing protection values after > > registering the address. > > > > NOTE: > > User programs that wish to have the enable bit shared across forks > > either need to use a MAP_SHARED allocated address or register a new > > address and file descriptor. If MAP_SHARED cannot be used or new > > registrations cannot be done, then it's allowable to use MAP_PRIVATE > > as long as the forked children never update the page themselves. Once > > the page has been updated, the page from the parent will be copied over > > to the child. This new copy-on-write page will not receive updates from > > the kernel until another registration has been performed with this new > > address. > > > > Beau Belgrave (2): > > tracing/user_events: Use remote writes for event enablement > > tracing/user_events: Fixup enable faults asyncly > > > > include/linux/user_events.h | 10 +- > > kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++----------- > > 2 files changed, 270 insertions(+), 136 deletions(-) > > > > > > base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049 > > -- > > 2.25.1 > > > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org> Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-31 17:27 ` Beau Belgrave @ 2022-10-31 18:25 ` Mathieu Desnoyers 2022-11-01 13:52 ` Masami Hiramatsu 1 sibling, 0 replies; 31+ messages in thread From: Mathieu Desnoyers @ 2022-10-31 18:25 UTC (permalink / raw) To: Beau Belgrave, Masami Hiramatsu Cc: rostedt, dcook, alanau, linux-trace-devel, linux-kernel On 2022-10-31 13:27, Beau Belgrave wrote: > On Mon, Oct 31, 2022 at 11:15:56PM +0900, Masami Hiramatsu wrote: [...] >> And what is the actual advantage of this change? Are there any issue >> to use mmaped page? I would like to know more background of this >> change. >> > > Without this change user tracers like LTTng will have to check 2 values > instead of 1 to tell if the kernel tracer is enabled or not. Mathieu is > working on a user side tracing library in an effort to align writing > tracing code in user processes that works well for both kernel and user > tracers without much effort. > > See here: > https://github.com/compudj/side > > Are you proposing we keep the bitmap approach and have side library just > hook another branch? Mathieu had issues with that approach during our > talks. As overhead of the disabled tracepoints was a key factor in having the Linux kernel adopt tracepoints when I created those back in 2008, I expect that having minimal overhead in the disabled case will also prove to be a key factor for adoption by user-space applications. Another aspect that seems to be very important for wide adoption by user-space is that the instrumentation library needs to have a license that is very convenient for inclusion into statically linked software without additional license requirements. This therefore excludes GPL and LGPL. I've used the MIT license for the "side" project for that purpose. Indeed, my ideal scenario is to use asm goto and implement something similar to jump labels in user-space so the instrumentation only costs a no-op or a jump when instrumentation is disabled. That can only be used in contexts where code patching is allowed though (not for Runtime Integrity Checked (RIC) processes). My next-to-best scenario is to have a single load (from fixed offset), test and conditional branch in the userspace fast-path instead. This approach will need to be made available as a fall-back for processes which are flagged as RIC-protected. I currently focus my efforts on the load+test+conditional branch scheme, which is somewhat simpler than the code patching approach in terms of needed infrastructure. If we go for the current user events bitmap approach, then anything we do from userspace will have more overhead (additional pointer chasing, loads, and masks to apply). And it pretty much rules out code patching. In terms of missing pieces to allow code patching to be done in userspace, here is what I think we'd need: - Extend the "side" (MIT-licensed) library to implement gadgets which support code patching, but fall-back to load+test+conditional branch if code patching is not available. Roughly, those would look like (this is really just pseudo-code): .pushsection side_jmp /* * &side_enabled_value is the key used to change the enabled/disabled state. * 1f is the address of the code to patch. * 3f is the address of branch target when disabled. * 4f is the address of branch target when enabled. */ .quad &side_enabled_value, 1f, 3f, 4f .popsection /* * Place all jump instructions that will be modified by code patching into a * single section. Therefore, this will minimize the amount of COW required when * patching code from executables and shared libraries that have instances in * many processes. */ .pushsection side_jmp_modify_code (executable section) 1: jump to 2f .popsection jump to 1b 2: load side_enabled_value test cond. branch to 4 3: -> disabled 4: -> enabled When loading the .so or the executable, the initial states uses the load, test, conditional branch. Then in a constructor, if code patching is available, the jump at label (1) can be updated to target (3) instead. Then when enabled, it can be updated to target (4) instead. - Implement a code patching system call in the kernel which takes care of all the details associated with code patching that supports concurrent execution (breakpoint bypass, or stopping target processes if required by the architecture). This system call could check whether the target process has Runtime Integrity Check enforced, and refuse code patching as needed. As a nice side-effect, this could allow us to implement things like "alternative" assembler instruction selection in user-space. - Figure out a way to let a user-space process let the kernel know that it needs to enforce Runtime Integrity Check. It could be either a prctl(), or perhaps a clone flag if this needs to be known very early in the process lifetime. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-10-31 17:27 ` Beau Belgrave 2022-10-31 18:25 ` Mathieu Desnoyers @ 2022-11-01 13:52 ` Masami Hiramatsu 2022-11-01 16:55 ` Beau Belgrave 1 sibling, 1 reply; 31+ messages in thread From: Masami Hiramatsu @ 2022-11-01 13:52 UTC (permalink / raw) To: Beau Belgrave Cc: rostedt, mathieu.desnoyers, dcook, alanau, linux-trace-devel, linux-kernel On Mon, 31 Oct 2022 10:27:06 -0700 Beau Belgrave <beaub@linux.microsoft.com> wrote: > On Mon, Oct 31, 2022 at 11:15:56PM +0900, Masami Hiramatsu wrote: > > Hi Beau, > > > > On Thu, 27 Oct 2022 15:40:09 -0700 > > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > > > As part of the discussions for user_events aligned with user space > > > tracers, it was determined that user programs should register a 32-bit > > > value to set or clear a bit when an event becomes enabled. Currently a > > > shared page is being used that requires mmap(). > > > > > > In this new model during the event registration from user programs 2 new > > > values are specified. The first is the address to update when the event > > > is either enabled or disabled. The second is the bit to set/clear to > > > reflect the event being enabled. This allows for a local 32-bit value in > > > user programs to support both kernel and user tracers. As an example, > > > setting bit 31 for kernel tracers when the event becomes enabled allows > > > for user tracers to use the other bits for ref counts or other flags. > > > The kernel side updates the bit atomically, user programs need to also > > > update these values atomically. > > > > I think you means the kernel tracer (ftrace/perf) and user tracers (e.g. > > LTTng) use the same 32bit data so that traced user-application only checks > > that data for checking an event is enabled, right? > > > > Yes, exactly, user code can just check a single uint32 or uint64 to tell > if anything is enabled (kernel or user tracer). > > > If so, who the user tracer threads updates the data bit? Is that thread > > safe to update both kernel tracer and user tracers at the same time? > > > > This is why atomics are used to set the bit on the kernel side. The user > side should do the same. This is like the futex code. Do you see a > problem with atomics being used between user and kernel space on a > shared 32/64-bit address? Ah, OK. set_bit()/clear_bit() are atomic ops. So the user tracer must use per-arch atomic ops implementation too. Hmm, can you comment it there? > > > And what is the actual advantage of this change? Are there any issue > > to use mmaped page? I would like to know more background of this > > change. > > > > Without this change user tracers like LTTng will have to check 2 values > instead of 1 to tell if the kernel tracer is enabled or not. Mathieu is > working on a user side tracing library in an effort to align writing > tracing code in user processes that works well for both kernel and user > tracers without much effort. > > See here: > https://github.com/compudj/side Thanks for pointing! > > Are you proposing we keep the bitmap approach and have side library just > hook another branch? Mathieu had issues with that approach during our > talks. No, that makes things more complicated. We should choose one. > > > Could you also provide any sample program which I can play it? :) > > > > When I make the next patch version, I will update the user_events sample > so you'll have something to try out. That's helpful for me. We can have the code under tools/tracing/user_events/. Thank you, > > > > User provided addresses must be aligned on a 32-bit boundary, this > > > allows for single page checking and prevents odd behaviors such as a > > > 32-bit value straddling 2 pages instead of a single page. > > > > > > When page faults are encountered they are done asyncly via a workqueue. > > > If the page faults back in, the write update is attempted again. If the > > > page cannot fault-in, then we log and wait until the next time the event > > > is enabled/disabled. This is to prevent possible infinite loops resulting > > > from bad user processes unmapping or changing protection values after > > > registering the address. > > > > > > NOTE: > > > User programs that wish to have the enable bit shared across forks > > > either need to use a MAP_SHARED allocated address or register a new > > > address and file descriptor. If MAP_SHARED cannot be used or new > > > registrations cannot be done, then it's allowable to use MAP_PRIVATE > > > as long as the forked children never update the page themselves. Once > > > the page has been updated, the page from the parent will be copied over > > > to the child. This new copy-on-write page will not receive updates from > > > the kernel until another registration has been performed with this new > > > address. > > > > > > Beau Belgrave (2): > > > tracing/user_events: Use remote writes for event enablement > > > tracing/user_events: Fixup enable faults asyncly > > > > > > include/linux/user_events.h | 10 +- > > > kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++----------- > > > 2 files changed, 270 insertions(+), 136 deletions(-) > > > > > > > > > base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049 > > > -- > > > 2.25.1 > > > > > > > > > -- > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > Thanks, > -Beau -- Masami Hiramatsu (Google) <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC PATCH 0/2] tracing/user_events: Remote write ABI 2022-11-01 13:52 ` Masami Hiramatsu @ 2022-11-01 16:55 ` Beau Belgrave 0 siblings, 0 replies; 31+ messages in thread From: Beau Belgrave @ 2022-11-01 16:55 UTC (permalink / raw) To: Masami Hiramatsu Cc: rostedt, mathieu.desnoyers, dcook, alanau, linux-trace-devel, linux-kernel On Tue, Nov 01, 2022 at 10:52:20PM +0900, Masami Hiramatsu wrote: > On Mon, 31 Oct 2022 10:27:06 -0700 > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > On Mon, Oct 31, 2022 at 11:15:56PM +0900, Masami Hiramatsu wrote: > > > Hi Beau, > > > > > > On Thu, 27 Oct 2022 15:40:09 -0700 > > > Beau Belgrave <beaub@linux.microsoft.com> wrote: > > > > > > > As part of the discussions for user_events aligned with user space > > > > tracers, it was determined that user programs should register a 32-bit > > > > value to set or clear a bit when an event becomes enabled. Currently a > > > > shared page is being used that requires mmap(). > > > > > > > > In this new model during the event registration from user programs 2 new > > > > values are specified. The first is the address to update when the event > > > > is either enabled or disabled. The second is the bit to set/clear to > > > > reflect the event being enabled. This allows for a local 32-bit value in > > > > user programs to support both kernel and user tracers. As an example, > > > > setting bit 31 for kernel tracers when the event becomes enabled allows > > > > for user tracers to use the other bits for ref counts or other flags. > > > > The kernel side updates the bit atomically, user programs need to also > > > > update these values atomically. > > > > > > I think you means the kernel tracer (ftrace/perf) and user tracers (e.g. > > > LTTng) use the same 32bit data so that traced user-application only checks > > > that data for checking an event is enabled, right? > > > > > > > Yes, exactly, user code can just check a single uint32 or uint64 to tell > > if anything is enabled (kernel or user tracer). > > > > > If so, who the user tracer threads updates the data bit? Is that thread > > > safe to update both kernel tracer and user tracers at the same time? > > > > > > > This is why atomics are used to set the bit on the kernel side. The user > > side should do the same. This is like the futex code. Do you see a > > problem with atomics being used between user and kernel space on a > > shared 32/64-bit address? > > Ah, OK. set_bit()/clear_bit() are atomic ops. So the user tracer must > use per-arch atomic ops implementation too. Hmm, can you comment it there? > I can add a comment here, I also plan to update our documentation. I really want to get good feedback on this first, so I avoid updating the documentation several times as we progress this conversation. Expect documentation updates when I flip from RFC to normal patchset. > > > > > And what is the actual advantage of this change? Are there any issue > > > to use mmaped page? I would like to know more background of this > > > change. > > > > > > > Without this change user tracers like LTTng will have to check 2 values > > instead of 1 to tell if the kernel tracer is enabled or not. Mathieu is > > working on a user side tracing library in an effort to align writing > > tracing code in user processes that works well for both kernel and user > > tracers without much effort. > > > > See here: > > https://github.com/compudj/side > > Thanks for pointing! > > > > > Are you proposing we keep the bitmap approach and have side library just > > hook another branch? Mathieu had issues with that approach during our > > talks. > > No, that makes things more complicated. We should choose one. > Agree, it seems we are settling behind the user provided address approach, as long as we can work through fork() and other scenarios. During the bi-weekly tracefs meetings we've been going back and forth on which approach to take. I promised a RFC patch to see how far I could get on this to see what edge cases exist that we need to work through. Currently fork() seems the hardest to do with private mappings, but I believe I have a path forward that I'll put in the next version of this patchset. > > > > > Could you also provide any sample program which I can play it? :) > > > > > > > When I make the next patch version, I will update the user_events sample > > so you'll have something to try out. > > That's helpful for me. We can have the code under tools/tracing/user_events/. > I was planning to update the existing sample at samples/user_events/. Any reason that location can't be used? > Thank you, > > > > > > > User provided addresses must be aligned on a 32-bit boundary, this > > > > allows for single page checking and prevents odd behaviors such as a > > > > 32-bit value straddling 2 pages instead of a single page. > > > > > > > > When page faults are encountered they are done asyncly via a workqueue. > > > > If the page faults back in, the write update is attempted again. If the > > > > page cannot fault-in, then we log and wait until the next time the event > > > > is enabled/disabled. This is to prevent possible infinite loops resulting > > > > from bad user processes unmapping or changing protection values after > > > > registering the address. > > > > > > > > NOTE: > > > > User programs that wish to have the enable bit shared across forks > > > > either need to use a MAP_SHARED allocated address or register a new > > > > address and file descriptor. If MAP_SHARED cannot be used or new > > > > registrations cannot be done, then it's allowable to use MAP_PRIVATE > > > > as long as the forked children never update the page themselves. Once > > > > the page has been updated, the page from the parent will be copied over > > > > to the child. This new copy-on-write page will not receive updates from > > > > the kernel until another registration has been performed with this new > > > > address. > > > > > > > > Beau Belgrave (2): > > > > tracing/user_events: Use remote writes for event enablement > > > > tracing/user_events: Fixup enable faults asyncly > > > > > > > > include/linux/user_events.h | 10 +- > > > > kernel/trace/trace_events_user.c | 396 ++++++++++++++++++++----------- > > > > 2 files changed, 270 insertions(+), 136 deletions(-) > > > > > > > > > > > > base-commit: 23758867219c8d84c8363316e6dd2f9fd7ae3049 > > > > -- > > > > 2.25.1 > > > > > > > > > > > > > -- > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > Thanks, > > -Beau > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org> Thanks, -Beau ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2022-11-02 17:18 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-10-27 22:40 [RFC PATCH 0/2] tracing/user_events: Remote write ABI Beau Belgrave 2022-10-27 22:40 ` [RFC PATCH 1/2] tracing/user_events: Use remote writes for event enablement Beau Belgrave 2022-10-29 14:44 ` Mathieu Desnoyers 2022-10-31 16:38 ` Beau Belgrave 2022-10-31 14:47 ` Masami Hiramatsu 2022-10-31 16:46 ` Beau Belgrave 2022-10-31 23:55 ` Masami Hiramatsu 2022-11-01 16:45 ` Beau Belgrave 2022-10-27 22:40 ` [RFC PATCH 2/2] tracing/user_events: Fixup enable faults asyncly Beau Belgrave 2022-10-28 22:07 ` Mathieu Desnoyers 2022-10-28 22:35 ` Beau Belgrave 2022-10-29 14:23 ` Mathieu Desnoyers 2022-10-31 16:58 ` Beau Belgrave 2022-10-28 22:19 ` Mathieu Desnoyers 2022-10-28 22:42 ` Beau Belgrave 2022-10-29 14:40 ` Mathieu Desnoyers 2022-10-30 11:45 ` Mathieu Desnoyers 2022-10-31 17:18 ` Beau Belgrave 2022-10-31 17:12 ` Beau Belgrave 2022-10-28 21:50 ` [RFC PATCH 0/2] tracing/user_events: Remote write ABI Mathieu Desnoyers 2022-10-28 22:17 ` Beau Belgrave 2022-10-29 13:58 ` Mathieu Desnoyers 2022-10-31 16:53 ` Beau Belgrave 2022-11-02 13:46 ` Mathieu Desnoyers 2022-11-02 17:18 ` Beau Belgrave 2022-10-31 14:15 ` Masami Hiramatsu 2022-10-31 15:27 ` Mathieu Desnoyers 2022-10-31 17:27 ` Beau Belgrave 2022-10-31 18:25 ` Mathieu Desnoyers 2022-11-01 13:52 ` Masami Hiramatsu 2022-11-01 16:55 ` Beau Belgrave
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).