linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Comments on new user events ABI
@ 2022-03-28 20:24 Mathieu Desnoyers
  2022-03-28 20:35 ` Mathieu Desnoyers
  2022-03-29  0:29 ` Beau Belgrave
  0 siblings, 2 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2022-03-28 20:24 UTC (permalink / raw)
  To: Beau Belgrave, rostedt; +Cc: linux-kernel, linux-trace-devel, linux-arch

Hi Beau, Hi Steven,

I've done a review of the trace events ABI, and I have a few comments.
Sorry for being late to the party, but I only noticed this new ABI recently.
Hopefully we can improve this ABI before the 5.18 release.

A bit of context: as you know, I maintain LTTng-UST, a user-space Linux
tracer. It does not implement much in the kernel because its goal is to
be purely user-space, mainly for performance reasons. However, when there
are relevant kernel facilities it can use, or if I need to extend the
kernel to expose a generally useful ABI to user-space, I do it. For
instance, I've contributed the rseq() system call for the sake of speeding
up the LTTng-UST ring buffer and sched_getcpu(). The work I am currently
doing on virtual-cpu-ids is also with a primary goal of improving the
LTTng-UST ring buffers (and all memory allocators as a side-effect) ;) .
Hopefully we can work together and make sure the new ABIs are useful to
everyone.


* user_events_status memory mapping

As I understand it, one part of the user events ABI is a memory mapping
which contains "flags" which indicates whether a given event is enabled.
It is indexed by byte, and each byte has this bitwise meaning:

/* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */
#define EVENT_BIT_FTRACE 0
#define EVENT_BIT_PERF 1
#define EVENT_BIT_OTHER 7

There are a few things I find odd here. First, to improve use of CPU cache,
I would have expected this memory mapping to expose enable flags as a
bitmap rather than an array of bytes, indexed bit-wise rather than byte-wise.
I also don't get what user-space is expected to do differently if FTRACE vs
PERF is enabled, considering that it gates a writev() to a file descriptor
associated with /sys/kernel/debug/tracing/user_events_data.

I would have rather thought that tracers implemented in user-space could register
themselves, and then there could be one /sys/kernel/debug/tracing/user_events_status
per tracer. Considering that all kernel tracers use the same ABI to write an event,
and then dispatch this event internally within the kernel to each registered
tracer, I would expect to have a single memory mapping for all those (e.g. a
/sys/kernel/debug/tracing/user_events_status/kernel_tracers file).

Then eventually if we have other user-space tracers such as lttng-ust with its
their own user-space code performing tracing in a shared memory ring buffer, it
would make sense to allow it to register its own
/sys/kernel/debug/tracing/user_events_status/lttng_ust file, with its own indexes.

If this facility is ever used by lttng-ust to enable user-space tracing, I would not
want to take the overhead of calling writev for the sake of kernel tracers if
those are disabled.

So perhaps in the short-term there is no need to implement the user-space tracer
registration ABI, but I would have expected a simple bitmap for
/sys/kernel/debug/tracing/user_events_data/kernel_tracers rather than the
bytewise index, because as far as the kernel tracers are concerned, providing
the bit to tell userspace instrumentation exactly which tracers are internally
enabled within the kernel does not appear to be of any use other than increasing
the footprint on the actively used cpu cache lines.


* user_events_data page faults

If my understanding is correct, when the user-space program's memory containing
the payload passed to writev() to a user_events_data file descriptor is kicked
out from the page cache between fault_in_iov_iter_readable and its use by the
tracers due to high memory pressure, the writev() will fail with -EFAULT and
the data will be discarded unless user-space somehow handles this error (which
is not handled in the samples/user_events/sample.c example program). It is good
that the memory is faulted in immediately before calling the tracers, but
considering that it is not mlock'd, should we make more effort to ensure the
tracers are able to handle page faults ?

Integration of the work done by Michael Jeanson and myself on faultable tracepoint
would allow the tracepoint probes to take page faults. Then, further modifications
in the kernel tracers would be needed to handle those page faults.


* user_reg name_args and write_index vs purely user-space tracers

That part of the user event registration (event layout and ID allocation) appears
to be intrinsically tied to the kernel tracers and the expected event layout. This
seems fine as long as the only users we consider are the kernel tracers, but it
appears to be less relevant for purely user-space tracers. Actually, tying the
mmap'd event enable mechanism with the event ID and description makes me wonder
whether it might be better to have LTTng-UST implement its own shared-memory based
"fast-event-enabling" mechanism rather than use this user-event ABI. The other
advantage of doing all of this in user-space would be to allow many instances
of this bitmap to exist on a given system, e.g. one per container in a multi-container
system, rather than requiring this to be a global kernel-wide singleton, and to use
it from a non-privileged user.


Some comments about the implementation:

kernel/trace/trace_events_user.c:
static ssize_t user_events_write(struct file *file, const char __user *ubuf,
                                 size_t count, loff_t *ppos)
{
        struct iovec iov;
        struct iov_iter i;

        if (unlikely(*ppos != 0))
                return -EFAULT;

        if (unlikely(import_single_range(READ, (char *)ubuf, count, &iov, &i)))
                return -EFAULT;
                                         ^ shouldn't this be "WRITE" ? This takes data from
                                           user-space and copies it into the kernel, similarly
                                           to fs/read_write.c:new_sync_write().

        return user_events_write_core(file, &i);
}

include/uapi/linux/user_events.h:

struct user_reg {

        /* Input: Size of the user_reg structure being used */
        __u32 size;

        /* Input: Pointer to string with event name, description and flags */
        __u64 name_args;

        /* Output: Byte index of the event within the status page */
        __u32 status_index;

        /* Output: Index of the event to use when writing data */
        __u32 write_index;
};

As this structure is expected to grow, and the user-space sample program uses "sizeof()"
to figure out its size (which includes padding), I would be more comfortable if this was
a packed structure rather than non-packed, because as fields are added, it's tricky to
figure out from the kernel perspective whether the size received are fields that user-space
is aware of, or if this is just padding.

include/uapi/linux/user_events.h:

struct user_bpf_iter {

        /* Offset of the data within the first iovec */
        __u32 iov_offset;

        /* Number of iovec structures */
        __u32 nr_segs;

        /* Pointer to iovec structures */
        const struct iovec *iov;

                           ^ a pointer in a uapi header is usually a no-go. This should be a u64.
};

include/uapi/linux/user_events.h:

struct user_bpf_context {

        /* Data type being passed (see union below) */
        __u32 data_type;

        /* Length of the data */
        __u32 data_len;

        /* Pointer to data, varies by data type */
        union {
                /* Kernel data (data_type == USER_BPF_DATA_KERNEL) */
                void *kdata;

                /* User data (data_type == USER_BPF_DATA_USER) */
                void *udata;

                /* Direct iovec (data_type == USER_BPF_DATA_ITER) */
                struct user_bpf_iter *iter;

                               ^ likewise for the 3 pointers above. Should be u64 in uapi headers.
        };
};

kernel/trace/trace_events_user.c:

static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg)
{
        u32 size;
        long ret;

        ret = get_user(size, &ureg->size);

        if (ret)
                return ret;


        if (size > PAGE_SIZE)
                return -E2BIG;

        ^ here I would be tempted to validate that the structure size at least provides room
          for the "v0" ABI, e.g.:

             if (size < offsetofend(struct user_reg, write_index))
                  return -EINVAL;

        return copy_struct_from_user(kreg, sizeof(*kreg), ureg, size);

              ^ I find it odd that the kernel copy of struct user_reg may contain a
                size field which contents differs from the size fetched by get_user().
                This can happen if a buggy or plainly hostile user-space attempts to
                confuse the kernel about the size of this structure. Fortunately, the
                size field does not seem to be used afterwards, but I would think it
                safer to copy back the "size" fetched by get_user into the reg->size
                after copy_struct_from_user in case future changes in the code end up
                relying on a consistent size field.
}

kernel/trace/trace_events_user.c:

static struct user_event *find_user_event(char *name, u32 *outkey)
{
        struct user_event *user;
        u32 key = user_event_key(name);

        *outkey = key;

        hash_for_each_possible(register_table, user, node, key)
                if (!strcmp(EVENT_NAME(user), name)) {
                        atomic_inc(&user->refcnt);

                        ^ what happens if an ill-intended user-space populates enough references
                          to overflow refcnt (atomic_t). I suspect it can make the kernel free
                          memory that is still in use, and trigger a use-after-free scenario.
                          Usually reference counters should use include/linux/refcount.h which
                          handles reference counter saturation. user_event_parse() has also a use
                          of atomic_inc() on that same refcnt which userspace can overflow.

                        return user;
                }

        return NULL;
}

kernel/trace/trace_events_user.c:

static int user_events_release(struct inode *node, struct file *file)
{
[...]
        /*
         * Ensure refs cannot change under any situation by taking the
         * register mutex during the final freeing of the references.
         */
        mutex_lock(&reg_mutex);
[...]
        mutex_unlock(&reg_mutex);

        kfree(refs);

        ^ AFAIU, the user_events_write() does not rely on reg_mutex to ensure mutual exclusion.
          Doing so would be prohibitive performance-wise. But I suspect that freeing "refs" here
          without waiting for a RCU grace period can be an issue if user_events_write_core is using
          refs concurrently with file descriptor close.

kernel/trace/trace_events_user.c:

static bool user_field_match(struct ftrace_event_field *field, int argc,
                             const char **argv, int *iout)
[...]

        for (; i < argc; ++i) {
[...]
                pos += snprintf(arg_name + pos, len - pos, argv[i]);

        ^ what happens if strlen(argv[i]) > (len - pos) ? Based on lib/vsprintf.c:

 * The return value is the number of characters which would be
 * generated for the given input, excluding the trailing null,
 * as per ISO C99.  If the return is greater than or equal to
 * @size, the resulting string is truncated.

        So the "pos" returned by the first call to sprintf would be greater than MAX_FIELD_ARG_NAME.
        Then the second call to snprintf passes a @size argument of "len - pos" using the pos value
        which is larger than len... which is a negative integer passed as argument to a size_t (unsigned).
        So it expects a very long string. And the @buf argument is out-of-bound (field_name + pos).
        Is this pattern for using snprintf() used elsewhere ? From a quick grep, I find this pattern in
        a few places where AFAIU the input is not user-controlled (as it seems to be the case here), but
        still it might be worth looking into:

             kernel/cgroup/cgroup.c:show_delegatable_files()
             kernel/time/clocksource.c:available_clocksource_show()

Also, passing a copy of a userspace string (argv[i]) as format string argument to
snprintf can be misused to leak kernel data to user-space.

The same function also appear to have similar issues with its use of the field->name userspace input
string.

Unfortunately this is all the time I have for review right now, but it is at least a good starting
point for discussion.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-28 20:24 Comments on new user events ABI Mathieu Desnoyers
@ 2022-03-28 20:35 ` Mathieu Desnoyers
  2022-03-28 23:30   ` Beau Belgrave
  2022-03-29  0:29 ` Beau Belgrave
  1 sibling, 1 reply; 11+ messages in thread
From: Mathieu Desnoyers @ 2022-03-28 20:35 UTC (permalink / raw)
  To: Beau Belgrave, rostedt; +Cc: linux-kernel, linux-trace-devel, linux-arch

----- On Mar 28, 2022, at 4:24 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> Hi Beau, Hi Steven,
> 
> I've done a review of the trace events ABI, and I have a few comments.
> Sorry for being late to the party, but I only noticed this new ABI recently.
> Hopefully we can improve this ABI before the 5.18 release.
> 

Also a bit of testing shows that dyn_event_add() is called without holding the event_mutex.
Has this been tested with lockdep ?

[  144.192299] ------------[ cut here ]------------
[  144.194026] WARNING: CPU: 10 PID: 2689 at kernel/trace/trace_dynevent.h:82 user_event_parse_cmd+0x972/0xa00
[  144.196850] Modules linked in:
[  144.197836] CPU: 10 PID: 2689 Comm: example Not tainted 5.17.0+ #269
[  144.199805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  144.202303] RIP: 0010:user_event_parse_cmd+0x972/0xa00
[  144.203899] Code: 48 00 00 00 00 e9 cf f8 ff ff 41 bf f4 ff ff ff e9 3a f7 ff ff be ff ff ff ff 48 c7 c7 08 bb f7 8a e8 b2 05 de 00 85 c0 75 02 <0f> 0b 48 83 bb 30 01 00 00 00 0f 84 54 fa ff ff e9 25 fa ff ff 48
[  144.209398] RSP: 0018:ffffb6264b87be10 EFLAGS: 00010246
[  144.211098] RAX: 0000000000000000 RBX: ffff9c3045cb7c00 RCX: 0000000000000001
[  144.213314] RDX: 0000000000000000 RSI: ffffffff8aa2d11e RDI: ffffffff8aac2f16
[  144.215577] RBP: ffff9c3045cb7d20 R08: 0000000000000001 R09: 0000000000000001
[  144.217723] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000011
[  144.221511] R13: ffffb6264b87bea8 R14: 000000000000000c R15: 0000000000000000
[  144.223760] FS:  00007ff6d10e54c0(0000) GS:ffff9c3627a80000(0000) knlGS:0000000000000000
[  144.226364] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  144.228203] CR2: 00007ff6d0b16a80 CR3: 00000006530ae004 CR4: 00000000001706e0
[  144.230349] Call Trace:
[  144.231307]  <TASK>
[  144.232140]  ? _copy_from_user+0x68/0xa0
[  144.233534]  user_events_ioctl+0xfe/0x4d0
[  144.234980]  __x64_sys_ioctl+0x8e/0xd0
[  144.236268]  ? lockdep_hardirqs_on+0x7d/0x100
[  144.237771]  do_syscall_64+0x3a/0x80
[  144.239036]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  144.240696] RIP: 0033:0x7ff6d0b16217
[  144.241938] Code: b3 66 90 48 8b 05 71 4c 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4c 2d 00 f7 d8 64 89 01 48
[  144.247797] RSP: 002b:00007ffce19eb3b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  144.252578] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff6d0b16217
[  144.254897] RDX: 00007ffce19eb3e0 RSI: 00000000c0082a00 RDI: 0000000000000003
[  144.257162] RBP: 00007ffce19eb400 R08: 0000000000000003 R09: 0000000000000000
[  144.259487] R10: 0000000000000000 R11: 0000000000000246 R12: 000056095fe00820
[  144.261817] R13: 00007ffce19eb550 R14: 0000000000000000 R15: 0000000000000000
[  144.264135]  </TASK>
[  144.264980] irq event stamp: 4515
[  144.266162] hardirqs last  enabled at (4523): [<ffffffff8916ab2e>] __up_console_sem+0x5e/0x70
[  144.268987] hardirqs last disabled at (4532): [<ffffffff8916ab13>] __up_console_sem+0x43/0x70
[  144.271739] softirqs last  enabled at (4390): [<ffffffff8a400361>] __do_softirq+0x361/0x4a8
[  144.274480] softirqs last disabled at (4385): [<ffffffff890e7554>] irq_exit_rcu+0x104/0x120
[  144.277220] ---[ end trace 0000000000000000 ]---


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-28 20:35 ` Mathieu Desnoyers
@ 2022-03-28 23:30   ` Beau Belgrave
  0 siblings, 0 replies; 11+ messages in thread
From: Beau Belgrave @ 2022-03-28 23:30 UTC (permalink / raw)
  To: mathieu.desnoyers
  Cc: linux-arch, linux-kernel, linux-trace-devel, rostedt, beaub

> ----- On Mar 28, 2022, at 4:24 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
> 
> > Hi Beau, Hi Steven,
> > 
> > I've done a review of the trace events ABI, and I have a few comments.
> > Sorry for being late to the party, but I only noticed this new ABI recently.
> > Hopefully we can improve this ABI before the 5.18 release.
> > 
> 
> Also a bit of testing shows that dyn_event_add() is called without holding the event_mutex.
> Has this been tested with lockdep ?

Sorry, looks like I've been running with LOCKDEP_SUPPORT without it on :(
I now have both CONFIG_LOCKDEP and CONFIG_LOCK_DEBUGGING_SUPPORT.

You are right about this, thanks! I've sent a patch.

Link: https://lore.kernel.org/all/20220328223225.1992-1-beaub@linux.microsoft.com/

Thanks,
-Beau

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-28 20:24 Comments on new user events ABI Mathieu Desnoyers
  2022-03-28 20:35 ` Mathieu Desnoyers
@ 2022-03-29  0:29 ` Beau Belgrave
  2022-03-29 16:17   ` Mathieu Desnoyers
  1 sibling, 1 reply; 11+ messages in thread
From: Beau Belgrave @ 2022-03-29  0:29 UTC (permalink / raw)
  To: mathieu.desnoyers
  Cc: beaub, linux-arch, linux-kernel, linux-trace-devel, rostedt, mhiramat

> ----- On Mar 28, 2022, at 4:24 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:
> Hi Beau, Hi Steven,
> 
> I've done a review of the trace events ABI, and I have a few comments.
> Sorry for being late to the party, but I only noticed this new ABI recently.
> Hopefully we can improve this ABI before the 5.18 release.
> 

Welcome to the party :)

> * user_events_status memory mapping
> 
> As I understand it, one part of the user events ABI is a memory mapping
> which contains "flags" which indicates whether a given event is enabled.
> It is indexed by byte, and each byte has this bitwise meaning:
> 
> /* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */
> #define EVENT_BIT_FTRACE 0
> #define EVENT_BIT_PERF 1
> #define EVENT_BIT_OTHER 7
> 
> There are a few things I find odd here. First, to improve use of CPU cache,
> I would have expected this memory mapping to expose enable flags as a
> bitmap rather than an array of bytes, indexed bit-wise rather than byte-wise.
> I also don't get what user-space is expected to do differently if FTRACE vs
> PERF is enabled, considering that it gates a writev() to a file descriptor
> associated with /sys/kernel/debug/tracing/user_events_data.
> 

The intention wasn't for user-space to check the byte other than non-zero.
User processes could do that, but it's more so administration tools can see
where the events are registered if they cannot be closed and the state of the
machine.

Maybe you have a slicker way to do this, but it seems to check a bit in the
page would involve at least a byte read followed by a mask or shift? That
seems more expensive than checking a byte?

> I would have rather thought that tracers implemented in user-space could register
> themselves, and then there could be one /sys/kernel/debug/tracing/user_events_status
> per tracer. Considering that all kernel tracers use the same ABI to write an event,
> and then dispatch this event internally within the kernel to each registered
> tracer, I would expect to have a single memory mapping for all those (e.g. a
> /sys/kernel/debug/tracing/user_events_status/kernel_tracers file).
> 
> Then eventually if we have other user-space tracers such as lttng-ust with its
> their own user-space code performing tracing in a shared memory ring buffer, it
> would make sense to allow it to register its own
> /sys/kernel/debug/tracing/user_events_status/lttng_ust file, with its own indexes.
> 

I don't follow that. The intention is to get user processes to participate with
trace_events and the built-in tooling. When would a user-space tracer be used
instead of perf/ftrace?

It seems like a feature request?

> If this facility is ever used by lttng-ust to enable user-space tracing, I would not
> want to take the overhead of calling writev for the sake of kernel tracers if
> those are disabled.
> 

If they were disabled the byte wouldn't be set, right? So no writev overhead.

Seems I'm missing something.

> So perhaps in the short-term there is no need to implement the user-space tracer
> registration ABI, but I would have expected a simple bitmap for
> /sys/kernel/debug/tracing/user_events_data/kernel_tracers rather than the
> bytewise index, because as far as the kernel tracers are concerned, providing
> the bit to tell userspace instrumentation exactly which tracers are internally
> enabled within the kernel does not appear to be of any use other than increasing
> the footprint on the actively used cpu cache lines.
> 
> 
> * user_events_data page faults
> 
> If my understanding is correct, when the user-space program's memory containing
> the payload passed to writev() to a user_events_data file descriptor is kicked
> out from the page cache between fault_in_iov_iter_readable and its use by the
> tracers due to high memory pressure, the writev() will fail with -EFAULT and
> the data will be discarded unless user-space somehow handles this error (which
> is not handled in the samples/user_events/sample.c example program). It is good
> that the memory is faulted in immediately before calling the tracers, but
> considering that it is not mlock'd, should we make more effort to ensure the
> tracers are able to handle page faults ?
> 
> Integration of the work done by Michael Jeanson and myself on faultable tracepoint
> would allow the tracepoint probes to take page faults. Then, further modifications
> in the kernel tracers would be needed to handle those page faults.
> 

Is this something that can be done later or does it require ABI changes?

I would love to never miss data due to page faults.

> 
> * user_reg name_args and write_index vs purely user-space tracers
> 
> That part of the user event registration (event layout and ID allocation) appears
> to be intrinsically tied to the kernel tracers and the expected event layout. This
> seems fine as long as the only users we consider are the kernel tracers, but it
> appears to be less relevant for purely user-space tracers. Actually, tying the
> mmap'd event enable mechanism with the event ID and description makes me wonder
> whether it might be better to have LTTng-UST implement its own shared-memory based
> "fast-event-enabling" mechanism rather than use this user-event ABI. The other
> advantage of doing all of this in user-space would be to allow many instances
> of this bitmap to exist on a given system, e.g. one per container in a multi-container
> system, rather than requiring this to be a global kernel-wide singleton, and to use
> it from a non-privileged user.
> 

We have some conversation going about using namespaces/cgroups to isolation
containers with bitmaps/status pages. The main thing I personally want to be
able to do is from the root namespace see all the events in the descendents
easily via perf, eBPF or ftrace.

Link: https://lore.kernel.org/linux-trace-devel/20220316232009.7952988633787ef1003f13b0@kernel.org/

> 
> Some comments about the implementation:
> 
> kernel/trace/trace_events_user.c:
> static ssize_t user_events_write(struct file *file, const char __user *ubuf,
>                                  size_t count, loff_t *ppos)
> {
>         struct iovec iov;
>         struct iov_iter i;
> 
>         if (unlikely(*ppos != 0))
>                 return -EFAULT;
> 
>         if (unlikely(import_single_range(READ, (char *)ubuf, count, &iov, &i)))
>                 return -EFAULT;
>                                          ^ shouldn't this be "WRITE" ? This takes data from
>                                            user-space and copies it into the kernel, similarly
>                                            to fs/read_write.c:new_sync_write().

I think so, I mis-took the direction/rw flags for what the intention/protection
was. It appears the direction in this case is the direction of the user.

> 
>         return user_events_write_core(file, &i);
> }
> 
> include/uapi/linux/user_events.h:
> 
> struct user_reg {
> 
>         /* Input: Size of the user_reg structure being used */
>         __u32 size;
> 
>         /* Input: Pointer to string with event name, description and flags */
>         __u64 name_args;
> 
>         /* Output: Byte index of the event within the status page */
>         __u32 status_index;
> 
>         /* Output: Index of the event to use when writing data */
>         __u32 write_index;
> };
> 
> As this structure is expected to grow, and the user-space sample program uses "sizeof()"
> to figure out its size (which includes padding), I would be more comfortable if this was
> a packed structure rather than non-packed, because as fields are added, it's tricky to
> figure out from the kernel perspective whether the size received are fields that user-space
> is aware of, or if this is just padding.
> 

I think that would be a good idea, Steven?

> include/uapi/linux/user_events.h:
> 
> struct user_bpf_iter {
> 
>         /* Offset of the data within the first iovec */
>         __u32 iov_offset;
> 
>         /* Number of iovec structures */
>         __u32 nr_segs;
> 
>         /* Pointer to iovec structures */
>         const struct iovec *iov;
> 
>                            ^ a pointer in a uapi header is usually a no-go. This should be a u64.
> };
> 
> include/uapi/linux/user_events.h:
> 
> struct user_bpf_context {
> 
>         /* Data type being passed (see union below) */
>         __u32 data_type;
> 
>         /* Length of the data */
>         __u32 data_len;
> 
>         /* Pointer to data, varies by data type */
>         union {
>                 /* Kernel data (data_type == USER_BPF_DATA_KERNEL) */
>                 void *kdata;
> 
>                 /* User data (data_type == USER_BPF_DATA_USER) */
>                 void *udata;
> 
>                 /* Direct iovec (data_type == USER_BPF_DATA_ITER) */
>                 struct user_bpf_iter *iter;
> 
>                                ^ likewise for the 3 pointers above. Should be u64 in uapi headers.
>         };
> };
> 

The bpf structs are only used within a BPF program. At that point the pointer
sizes should all align, right?

I wanted to ensure libbpf type scenarios have easy access to the BPF argument
structures.

I guess I could move them to the kernel side, but then users would have to get
access to them for BPF programs.

> kernel/trace/trace_events_user.c:
> 
> static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg)
> {
>         u32 size;
>         long ret;
> 
>         ret = get_user(size, &ureg->size);
> 
>         if (ret)
>                 return ret;
> 
> 
>         if (size > PAGE_SIZE)
>                 return -E2BIG;
> 
>         ^ here I would be tempted to validate that the structure size at least provides room
>           for the "v0" ABI, e.g.:
> 
>              if (size < offsetofend(struct user_reg, write_index))
>                   return -EINVAL;
> 
>         return copy_struct_from_user(kreg, sizeof(*kreg), ureg, size);
> 
>               ^ I find it odd that the kernel copy of struct user_reg may contain a
>                 size field which contents differs from the size fetched by get_user().
>                 This can happen if a buggy or plainly hostile user-space attempts to
>                 confuse the kernel about the size of this structure. Fortunately, the
>                 size field does not seem to be used afterwards, but I would think it
>                 safer to copy back the "size" fetched by get_user into the reg->size
>                 after copy_struct_from_user in case future changes in the code end up
>                 relying on a consistent size field.
> }
> 

Any size that the user doesn't fill in will zero out, if they enter a large
size then we error out. I tried to get the data out from the user once and use
it consistently from the kernel side. I didn't want users changing their values
at key places trying to get the kernel to do bad things.

This is why the size isn't used beyond the first copy (and why the data in the
write paths is only validated after paging in and copy the data to a buffer
that the user process no longer can modify mid-write).

> kernel/trace/trace_events_user.c:
> 
> static struct user_event *find_user_event(char *name, u32 *outkey)
> {
>         struct user_event *user;
>         u32 key = user_event_key(name);
> 
>         *outkey = key;
> 
>         hash_for_each_possible(register_table, user, node, key)
>                 if (!strcmp(EVENT_NAME(user), name)) {
>                         atomic_inc(&user->refcnt);
> 
>                         ^ what happens if an ill-intended user-space populates enough references
>                           to overflow refcnt (atomic_t). I suspect it can make the kernel free
>                           memory that is still in use, and trigger a use-after-free scenario.
>                           Usually reference counters should use include/linux/refcount.h which
>                           handles reference counter saturation. user_event_parse() has also a use
>                           of atomic_inc() on that same refcnt which userspace can overflow.
> 

This is a good catch. I'll look into refcount.h, thanks!

>                         return user;
>                 }
> 
>         return NULL;
> }
> 
> kernel/trace/trace_events_user.c:
> 
> static int user_events_release(struct inode *node, struct file *file)
> {
> [...]
>         /*
>          * Ensure refs cannot change under any situation by taking the
>          * register mutex during the final freeing of the references.
>          */
>         mutex_lock(&reg_mutex);
> [...]
>         mutex_unlock(&reg_mutex);
> 
>         kfree(refs);
> 
>         ^ AFAIU, the user_events_write() does not rely on reg_mutex to ensure mutual exclusion.
>           Doing so would be prohibitive performance-wise. But I suspect that freeing "refs" here
>           without waiting for a RCU grace period can be an issue if user_events_write_core is using
>           refs concurrently with file descriptor close.
> 

The refs is per-file, so if user_events_write() is running release cannot be
called for that file instance, since it's being used. Did I miss something?

> kernel/trace/trace_events_user.c:
> 
> static bool user_field_match(struct ftrace_event_field *field, int argc,
>                              const char **argv, int *iout)
> [...]
> 
>         for (; i < argc; ++i) {
> [...]
>                 pos += snprintf(arg_name + pos, len - pos, argv[i]);
> 
>         ^ what happens if strlen(argv[i]) > (len - pos) ? Based on lib/vsprintf.c:
> 
>  * The return value is the number of characters which would be
>  * generated for the given input, excluding the trailing null,
>  * as per ISO C99.  If the return is greater than or equal to
>  * @size, the resulting string is truncated.
> 
>         So the "pos" returned by the first call to sprintf would be greater than MAX_FIELD_ARG_NAME.
>         Then the second call to snprintf passes a @size argument of "len - pos" using the pos value
>         which is larger than len... which is a negative integer passed as argument to a size_t (unsigned).
>         So it expects a very long string. And the @buf argument is out-of-bound (field_name + pos).
>         Is this pattern for using snprintf() used elsewhere ? From a quick grep, I find this pattern in
>         a few places where AFAIU the input is not user-controlled (as it seems to be the case here), but
>         still it might be worth looking into:
> 
>              kernel/cgroup/cgroup.c:show_delegatable_files()
>              kernel/time/clocksource.c:available_clocksource_show()
> 
> Also, passing a copy of a userspace string (argv[i]) as format string argument to
> snprintf can be misused to leak kernel data to user-space.
> 
> The same function also appear to have similar issues with its use of the field->name userspace input
> string.
> 

The strings are copied first before coming down these paths, so they cannot
be forced to changed mid-reads, etc.

I think simply checking if any arg is beyond the MAX_FIELD_ARG_NAME would
prevent these types of issues, right?

> Unfortunately this is all the time I have for review right now, but it is at least a good starting
> point for discussion.
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

Thanks,
-Beau

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-29  0:29 ` Beau Belgrave
@ 2022-03-29 16:17   ` Mathieu Desnoyers
  2022-03-29 16:25     ` Alexei Starovoitov
  2022-03-29 21:54     ` Beau Belgrave
  0 siblings, 2 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2022-03-29 16:17 UTC (permalink / raw)
  To: Beau Belgrave
  Cc: Beau Belgrave, linux-arch, linux-kernel, linux-trace-devel,
	rostedt, Masami Hiramatsu, Alexei Starovoitov

----- On Mar 28, 2022, at 8:29 PM, Beau Belgrave beaub@linux.microsoft.com wrote:

>> ----- On Mar 28, 2022, at 4:24 PM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>> Hi Beau, Hi Steven,
>> 
>> I've done a review of the trace events ABI, and I have a few comments.
>> Sorry for being late to the party, but I only noticed this new ABI recently.
>> Hopefully we can improve this ABI before the 5.18 release.
>> 
> 
> Welcome to the party :)
> 
>> * user_events_status memory mapping
>> 
>> As I understand it, one part of the user events ABI is a memory mapping
>> which contains "flags" which indicates whether a given event is enabled.
>> It is indexed by byte, and each byte has this bitwise meaning:
>> 
>> /* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */
>> #define EVENT_BIT_FTRACE 0
>> #define EVENT_BIT_PERF 1
>> #define EVENT_BIT_OTHER 7
>> 
>> There are a few things I find odd here. First, to improve use of CPU cache,
>> I would have expected this memory mapping to expose enable flags as a
>> bitmap rather than an array of bytes, indexed bit-wise rather than byte-wise.
>> I also don't get what user-space is expected to do differently if FTRACE vs
>> PERF is enabled, considering that it gates a writev() to a file descriptor
>> associated with /sys/kernel/debug/tracing/user_events_data.
>> 
> 
> The intention wasn't for user-space to check the byte other than non-zero.
> User processes could do that, but it's more so administration tools can see
> where the events are registered if they cannot be closed and the state of the
> machine.

Administration tools could simply use a seqfile interface and parse text. I
don't expect those to be performance-sensitive at all. So we should not design
the overhead-sensitive mmap'd ABI (enable flags) for administration tools.

> Maybe you have a slicker way to do this, but it seems to check a bit in the
> page would involve at least a byte read followed by a mask or shift?

I suspect the minimum cost in terms of instruction would be to have the mask
prepared in advance (when the event is registered), and therefore the instructions
required on the user-space fast-path would be a load followed by a mask (&).

> That seems more expensive than checking a byte?

It is one single extra ALU instruction on the fast-path, indeed. But we have to
be careful about optimizing too much for a non-representative microbenchmark
and missing the big picture, which includes L1/L2 cache lines and TLB load overhead
when tracing a real-life workloads.

On modern x86 CPUs, an ALU instruction takes around 0.3ns, which is a few
orders of magnitude faster than a cache miss. Using bit-wise rather than
byte-wise indexing increases functional density of those enable flags by a
factor 8, which lessens the amount of cpu cache required by that same factor.

So as soon as the workload running on the system fills the CPU L1 or L2 caches,
when the number of user events registered scales up, and when the checks for
events enabled are done frequently enough, evicting fewer cache lines becomes
more important than the extra instruction required to apply the mask on the
fast-path.

> 
>> I would have rather thought that tracers implemented in user-space could
>> register
>> themselves, and then there could be one
>> /sys/kernel/debug/tracing/user_events_status
>> per tracer. Considering that all kernel tracers use the same ABI to write an
>> event,
>> and then dispatch this event internally within the kernel to each registered
>> tracer, I would expect to have a single memory mapping for all those (e.g. a
>> /sys/kernel/debug/tracing/user_events_status/kernel_tracers file).
>> 
>> Then eventually if we have other user-space tracers such as lttng-ust with its
>> their own user-space code performing tracing in a shared memory ring buffer, it
>> would make sense to allow it to register its own
>> /sys/kernel/debug/tracing/user_events_status/lttng_ust file, with its own
>> indexes.
>> 
> 
> I don't follow that. The intention is to get user processes to participate with
> trace_events and the built-in tooling. When would a user-space tracer be used
> instead of perf/ftrace?
> 
> It seems like a feature request?

It can very well be out of scope for the user events, and I'm fine with that.
I was merely considering how the user events could be leveraged by tracers
implemented purely in user-space. But if the stated goal of this feature is
really to call into kernel tracers through a writev(), then I suspect that
supporting purely user-space tracers is indeed out of scope.

> 
>> If this facility is ever used by lttng-ust to enable user-space tracing, I would
>> not
>> want to take the overhead of calling writev for the sake of kernel tracers if
>> those are disabled.
>> 
> 
> If they were disabled the byte wouldn't be set, right? So no writev overhead.
> 
> Seems I'm missing something.

I was wondering whether we could leverage the user events bit-enable ABI for
tracers which are not implemented within the Linux kernel (e.g. LTTng-UST).
But I suspect it might be better for me to re-implement this in user-space
over shared memory.

> 
>> So perhaps in the short-term there is no need to implement the user-space tracer
>> registration ABI, but I would have expected a simple bitmap for
>> /sys/kernel/debug/tracing/user_events_data/kernel_tracers rather than the
>> bytewise index, because as far as the kernel tracers are concerned, providing
>> the bit to tell userspace instrumentation exactly which tracers are internally
>> enabled within the kernel does not appear to be of any use other than increasing
>> the footprint on the actively used cpu cache lines.
>> 
>> 
>> * user_events_data page faults
>> 
>> If my understanding is correct, when the user-space program's memory containing
>> the payload passed to writev() to a user_events_data file descriptor is kicked
>> out from the page cache between fault_in_iov_iter_readable and its use by the
>> tracers due to high memory pressure, the writev() will fail with -EFAULT and
>> the data will be discarded unless user-space somehow handles this error (which
>> is not handled in the samples/user_events/sample.c example program). It is good
>> that the memory is faulted in immediately before calling the tracers, but
>> considering that it is not mlock'd, should we make more effort to ensure the
>> tracers are able to handle page faults ?
>> 
>> Integration of the work done by Michael Jeanson and myself on faultable
>> tracepoint
>> would allow the tracepoint probes to take page faults. Then, further
>> modifications
>> in the kernel tracers would be needed to handle those page faults.
>> 
> 
> Is this something that can be done later or does it require ABI changes?
> 
> I would love to never miss data due to page faults.

This is internal to the user events, tracepoint, and kernel tracers
implementation. I don't expect this to modify the user events ABI.

The only thing that might require some thinking ABI-wise is how the ring
buffers are exposed to user-space consumers, because we would want to
allow taking page faults between space reservation and commit.

The LTTng ring buffer has been supporting this out of the box for years
now, but this may be trickier for other kernel tracers, for which allowing
preemption between reserve and commit has never been a requirement until
now.

> 
>> 
>> * user_reg name_args and write_index vs purely user-space tracers
>> 
>> That part of the user event registration (event layout and ID allocation)
>> appears
>> to be intrinsically tied to the kernel tracers and the expected event layout.
>> This
>> seems fine as long as the only users we consider are the kernel tracers, but it
>> appears to be less relevant for purely user-space tracers. Actually, tying the
>> mmap'd event enable mechanism with the event ID and description makes me wonder
>> whether it might be better to have LTTng-UST implement its own shared-memory
>> based
>> "fast-event-enabling" mechanism rather than use this user-event ABI. The other
>> advantage of doing all of this in user-space would be to allow many instances
>> of this bitmap to exist on a given system, e.g. one per container in a
>> multi-container
>> system, rather than requiring this to be a global kernel-wide singleton, and to
>> use
>> it from a non-privileged user.
>> 
> 
> We have some conversation going about using namespaces/cgroups to isolation
> containers with bitmaps/status pages. The main thing I personally want to be
> able to do is from the root namespace see all the events in the descendents
> easily via perf, eBPF or ftrace.
> 
> Link:
> https://lore.kernel.org/linux-trace-devel/20220316232009.7952988633787ef1003f13b0@kernel.org/
> 

I see that a notion close to "tracing namespaces" comes up in this thread. This is something
I brought forward at the last Linux Plumbers Conference aiming to facilitate system call tracing
for a given container (or from a process hierarchy). I suspect the tracing namespace could also
be tied to a set of kernel buffers, and to a user events "domain". I think this concept could
neatly solve many of our container-related isolation issues here. As you say, it should probably
be a hierarchical namespace.

>> 
>> Some comments about the implementation:
[...]
> 
>> 
>>         return user_events_write_core(file, &i);
>> }
>> 
>> include/uapi/linux/user_events.h:
>> 
>> struct user_reg {
>> 
>>         /* Input: Size of the user_reg structure being used */
>>         __u32 size;
>> 
>>         /* Input: Pointer to string with event name, description and flags */
>>         __u64 name_args;
>> 
>>         /* Output: Byte index of the event within the status page */
>>         __u32 status_index;
>> 
>>         /* Output: Index of the event to use when writing data */
>>         __u32 write_index;
>> };
>> 
>> As this structure is expected to grow, and the user-space sample program uses
>> "sizeof()"
>> to figure out its size (which includes padding), I would be more comfortable if
>> this was
>> a packed structure rather than non-packed, because as fields are added, it's
>> tricky to
>> figure out from the kernel perspective whether the size received are fields that
>> user-space
>> is aware of, or if this is just padding.
>> 
> 
> I think that would be a good idea, Steven?

[leaving this for Steven to answer]

> 
>> include/uapi/linux/user_events.h:
>> 
>> struct user_bpf_iter {
>> 
>>         /* Offset of the data within the first iovec */
>>         __u32 iov_offset;
>> 
>>         /* Number of iovec structures */
>>         __u32 nr_segs;
>> 
>>         /* Pointer to iovec structures */
>>         const struct iovec *iov;
>> 
>>                            ^ a pointer in a uapi header is usually a no-go. This should be a u64.
>> };
>> 
>> include/uapi/linux/user_events.h:
>> 
>> struct user_bpf_context {
>> 
>>         /* Data type being passed (see union below) */
>>         __u32 data_type;
>> 
>>         /* Length of the data */
>>         __u32 data_len;
>> 
>>         /* Pointer to data, varies by data type */
>>         union {
>>                 /* Kernel data (data_type == USER_BPF_DATA_KERNEL) */
>>                 void *kdata;
>> 
>>                 /* User data (data_type == USER_BPF_DATA_USER) */
>>                 void *udata;
>> 
>>                 /* Direct iovec (data_type == USER_BPF_DATA_ITER) */
>>                 struct user_bpf_iter *iter;
>> 
>>                                ^ likewise for the 3 pointers above. Should be u64 in uapi headers.
>>         };
>> };
>> 
> 
> The bpf structs are only used within a BPF program. At that point the pointer
> sizes should all align, right?

I must admit I do not know enough about the eBPF uapi practices to answer this.
[CCing Alexei on this]

It's just that seeing a pointer in a uapi header raises red flags in terms of
32 vs 64 bit compatibility, so I am bringing this up.

> 
> I wanted to ensure libbpf type scenarios have easy access to the BPF argument
> structures.
> 
> I guess I could move them to the kernel side, but then users would have to get
> access to them for BPF programs.
> 
>> kernel/trace/trace_events_user.c:
>> 
>> static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg)
>> {
>>         u32 size;
>>         long ret;
>> 
>>         ret = get_user(size, &ureg->size);
>> 
>>         if (ret)
>>                 return ret;
>> 
>> 
>>         if (size > PAGE_SIZE)
>>                 return -E2BIG;
>> 
>>         ^ here I would be tempted to validate that the structure size at least provides
>>         room
>>           for the "v0" ABI, e.g.:
>> 
>>              if (size < offsetofend(struct user_reg, write_index))
>>                   return -EINVAL;
>> 
>>         return copy_struct_from_user(kreg, sizeof(*kreg), ureg, size);
>> 
>>               ^ I find it odd that the kernel copy of struct user_reg may contain a
>>                 size field which contents differs from the size fetched by get_user().
>>                 This can happen if a buggy or plainly hostile user-space attempts to
>>                 confuse the kernel about the size of this structure. Fortunately, the
>>                 size field does not seem to be used afterwards, but I would think it
>>                 safer to copy back the "size" fetched by get_user into the reg->size
>>                 after copy_struct_from_user in case future changes in the code end up
>>                 relying on a consistent size field.
>> }
>> 
> 
> Any size that the user doesn't fill in will zero out, if they enter a large
> size then we error out. I tried to get the data out from the user once and use
> it consistently from the kernel side. I didn't want users changing their values
> at key places trying to get the kernel to do bad things.
> 
> This is why the size isn't used beyond the first copy (and why the data in the
> write paths is only validated after paging in and copy the data to a buffer
> that the user process no longer can modify mid-write).

As long as the kreg->size value is not used afterwards, it's technically fine, but
having it there and inconsistent with the size value loaded by get_user() is error-prone.
I would recommend to either overwrite this kreg->size value with the size loaded by
get_user(), or document that this field should never be used because it is inconsistent
with the size used for copy_struct_from_user().

[...]

>> kernel/trace/trace_events_user.c:
>> 
>> static int user_events_release(struct inode *node, struct file *file)
>> {
>> [...]
>>         /*
>>          * Ensure refs cannot change under any situation by taking the
>>          * register mutex during the final freeing of the references.
>>          */
>>         mutex_lock(&reg_mutex);
>> [...]
>>         mutex_unlock(&reg_mutex);
>> 
>>         kfree(refs);
>> 
>>         ^ AFAIU, the user_events_write() does not rely on reg_mutex to ensure mutual
>>         exclusion.
>>           Doing so would be prohibitive performance-wise. But I suspect that freeing
>>           "refs" here
>>           without waiting for a RCU grace period can be an issue if user_events_write_core
>>           is using
>>           refs concurrently with file descriptor close.
>> 
> 
> The refs is per-file, so if user_events_write() is running release cannot be
> called for that file instance, since it's being used. Did I miss something?

I'm possibly the one missing something here, but what prevents 2 threads
from doing user_event_write() and close() concurrently on the same file descriptor ?

> 
>> kernel/trace/trace_events_user.c:
>> 
>> static bool user_field_match(struct ftrace_event_field *field, int argc,
>>                              const char **argv, int *iout)
>> [...]
>> 
>>         for (; i < argc; ++i) {
>> [...]
>>                 pos += snprintf(arg_name + pos, len - pos, argv[i]);
>> 
>>         ^ what happens if strlen(argv[i]) > (len - pos) ? Based on lib/vsprintf.c:
>> 
>>  * The return value is the number of characters which would be
>>  * generated for the given input, excluding the trailing null,
>>  * as per ISO C99.  If the return is greater than or equal to
>>  * @size, the resulting string is truncated.
>> 
>>         So the "pos" returned by the first call to sprintf would be greater than
>>         MAX_FIELD_ARG_NAME.
>>         Then the second call to snprintf passes a @size argument of "len - pos" using
>>         the pos value
>>         which is larger than len... which is a negative integer passed as argument to a
>>         size_t (unsigned).
>>         So it expects a very long string. And the @buf argument is out-of-bound
>>         (field_name + pos).
>>         Is this pattern for using snprintf() used elsewhere ? From a quick grep, I find
>>         this pattern in
>>         a few places where AFAIU the input is not user-controlled (as it seems to be the
>>         case here), but
>>         still it might be worth looking into:
>> 
>>              kernel/cgroup/cgroup.c:show_delegatable_files()
>>              kernel/time/clocksource.c:available_clocksource_show()
>> 
>> Also, passing a copy of a userspace string (argv[i]) as format string argument
>> to
>> snprintf can be misused to leak kernel data to user-space.
>> 
>> The same function also appear to have similar issues with its use of the
>> field->name userspace input
>> string.
>> 
> 
> The strings are copied first before coming down these paths, so they cannot
> be forced to changed mid-reads, etc.

That's correct. But their combined length can be larger than MAX_FIELD_ARG_NAME.

> I think simply checking if any arg is beyond the MAX_FIELD_ARG_NAME would
> prevent these types of issues, right?

No. The issue is when the combined length is larger than the max. If I look at
other tracing code, I notice that it often uses a pattern where snprintf is first
called on the input with a @size=0 to calculate the string length, then the
output string is dynamically allocated with the correct length, and the entire
code performing the snprintf formatting is called again with a nonzero size,
thus effectively populating the output string. There are other ways to prevent
these issues, but I would be tempted to use the same pattern used elsewhere in
the tracing code. See trace_events_user.c:user_event_set_print_fmt() as an example.

And don't forget that passing a copy of a user input as snprintf format string is
bad, because it allows user-space to provide event/fields names with e.g. "%s" or
"%p" and control the behavior of snprintf in unwanted ways, typically leading to
a kernel memory information leak to user-space.

Thanks,

Mathieu

> 
>> Unfortunately this is all the time I have for review right now, but it is at
>> least a good starting
>> point for discussion.
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
> 
> Thanks,
> -Beau

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-29 16:17   ` Mathieu Desnoyers
@ 2022-03-29 16:25     ` Alexei Starovoitov
  2022-03-29 16:57       ` Beau Belgrave
  2022-03-29 19:45       ` Steven Rostedt
  2022-03-29 21:54     ` Beau Belgrave
  1 sibling, 2 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2022-03-29 16:25 UTC (permalink / raw)
  To: Mathieu Desnoyers, Linus Torvalds, bpf, Network Development
  Cc: Beau Belgrave, Beau Belgrave, linux-arch, linux-kernel,
	linux-trace-devel, rostedt, Masami Hiramatsu, Alexei Starovoitov

On Tue, Mar 29, 2022 at 9:17 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> >
> >> include/uapi/linux/user_events.h:
> >>
> >> struct user_bpf_iter {
> >>
> >>         /* Offset of the data within the first iovec */
> >>         __u32 iov_offset;
> >>
> >>         /* Number of iovec structures */
> >>         __u32 nr_segs;
> >>
> >>         /* Pointer to iovec structures */
> >>         const struct iovec *iov;
> >>
> >>                            ^ a pointer in a uapi header is usually a no-go. This should be a u64.
> >> };
> >>
> >> include/uapi/linux/user_events.h:
> >>
> >> struct user_bpf_context {
> >>
> >>         /* Data type being passed (see union below) */
> >>         __u32 data_type;
> >>
> >>         /* Length of the data */
> >>         __u32 data_len;
> >>
> >>         /* Pointer to data, varies by data type */
> >>         union {
> >>                 /* Kernel data (data_type == USER_BPF_DATA_KERNEL) */
> >>                 void *kdata;
> >>
> >>                 /* User data (data_type == USER_BPF_DATA_USER) */
> >>                 void *udata;
> >>
> >>                 /* Direct iovec (data_type == USER_BPF_DATA_ITER) */
> >>                 struct user_bpf_iter *iter;
> >>
> >>                                ^ likewise for the 3 pointers above. Should be u64 in uapi headers.
> >>         };
> >> };
> >>
> >
> > The bpf structs are only used within a BPF program. At that point the pointer
> > sizes should all align, right?
>
> I must admit I do not know enough about the eBPF uapi practices to answer this.
> [CCing Alexei on this]

Mathieu,

Thanks for flagging.

Whoever added this user_bpf* stuff please remove it immediately.
It was never reviewed by bpf maintainers.

It's a hard Nack to add a bpf interface to user_events.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-29 16:25     ` Alexei Starovoitov
@ 2022-03-29 16:57       ` Beau Belgrave
  2022-03-29 19:45       ` Steven Rostedt
  1 sibling, 0 replies; 11+ messages in thread
From: Beau Belgrave @ 2022-03-29 16:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Linus Torvalds, bpf, Network Development,
	Beau Belgrave, linux-arch, linux-kernel, linux-trace-devel,
	rostedt, Masami Hiramatsu, Alexei Starovoitov

On Tue, Mar 29, 2022 at 09:25:52AM -0700, Alexei Starovoitov wrote:
> On Tue, Mar 29, 2022 at 9:17 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> > >
> > >> include/uapi/linux/user_events.h:
> > >>
> > >> struct user_bpf_iter {
> > >>
> > >>         /* Offset of the data within the first iovec */
> > >>         __u32 iov_offset;
> > >>
> > >>         /* Number of iovec structures */
> > >>         __u32 nr_segs;
> > >>
> > >>         /* Pointer to iovec structures */
> > >>         const struct iovec *iov;
> > >>
> > >>                            ^ a pointer in a uapi header is usually a no-go. This should be a u64.
> > >> };
> > >>
> > >> include/uapi/linux/user_events.h:
> > >>
> > >> struct user_bpf_context {
> > >>
> > >>         /* Data type being passed (see union below) */
> > >>         __u32 data_type;
> > >>
> > >>         /* Length of the data */
> > >>         __u32 data_len;
> > >>
> > >>         /* Pointer to data, varies by data type */
> > >>         union {
> > >>                 /* Kernel data (data_type == USER_BPF_DATA_KERNEL) */
> > >>                 void *kdata;
> > >>
> > >>                 /* User data (data_type == USER_BPF_DATA_USER) */
> > >>                 void *udata;
> > >>
> > >>                 /* Direct iovec (data_type == USER_BPF_DATA_ITER) */
> > >>                 struct user_bpf_iter *iter;
> > >>
> > >>                                ^ likewise for the 3 pointers above. Should be u64 in uapi headers.
> > >>         };
> > >> };
> > >>
> > >
> > > The bpf structs are only used within a BPF program. At that point the pointer
> > > sizes should all align, right?
> >
> > I must admit I do not know enough about the eBPF uapi practices to answer this.
> > [CCing Alexei on this]
> 
> Mathieu,
> 
> Thanks for flagging.
> 
> Whoever added this user_bpf* stuff please remove it immediately.
> It was never reviewed by bpf maintainers.
> 
> It's a hard Nack to add a bpf interface to user_events.

Sorry about that, I'm sending a patch to remove this.

I'll have another patch to add it back in out to bpf and trace-devel for
review.

Thanks,
-Beau

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-29 16:25     ` Alexei Starovoitov
  2022-03-29 16:57       ` Beau Belgrave
@ 2022-03-29 19:45       ` Steven Rostedt
  1 sibling, 0 replies; 11+ messages in thread
From: Steven Rostedt @ 2022-03-29 19:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Linus Torvalds, bpf, Network Development,
	Beau Belgrave, Beau Belgrave, linux-arch, linux-kernel,
	linux-trace-devel, Masami Hiramatsu, Alexei Starovoitov

On Tue, 29 Mar 2022 09:25:52 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Thanks for flagging.
> 
> Whoever added this user_bpf* stuff please remove it immediately.
> It was never reviewed by bpf maintainers.

Heh, now you know how the x86 maintainers feel ;-)

> 
> It's a hard Nack to add a bpf interface to user_events.

Agreed, I'm thinking of marking the entire thing as broken such that it can
be worked on a bit more without a total revert (but still remove the BPF
portion on your request).

Beau, I agree with Mathieu, I don't think it's a good idea to expose the
"ftrace/perf/etc" users. The only thing that the application needs is a bit
to say "call the write now". And let the work be done within the kernel.
I think a single bit may be better, that way you can have many more events
on a page, and since they do not get modified often, it will be in hot
cache and fast.

-- Steve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-29 16:17   ` Mathieu Desnoyers
  2022-03-29 16:25     ` Alexei Starovoitov
@ 2022-03-29 21:54     ` Beau Belgrave
  2022-03-30  6:31       ` Masami Hiramatsu
  2022-03-31 19:05       ` Mathieu Desnoyers
  1 sibling, 2 replies; 11+ messages in thread
From: Beau Belgrave @ 2022-03-29 21:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Beau Belgrave, linux-arch, linux-kernel, linux-trace-devel,
	rostedt, Masami Hiramatsu, Alexei Starovoitov

On Tue, Mar 29, 2022 at 12:17:14PM -0400, Mathieu Desnoyers wrote:
> ----- On Mar 28, 2022, at 8:29 PM, Beau Belgrave beaub@linux.microsoft.com wrote:
> 
> >> ----- On Mar 28, 2022, at 4:24 PM, Mathieu Desnoyers
> >> mathieu.desnoyers@efficios.com wrote:

[..]

> >> I also don't get what user-space is expected to do differently if FTRACE vs
> >> PERF is enabled, considering that it gates a writev() to a file descriptor
> >> associated with /sys/kernel/debug/tracing/user_events_data.
> >> 
> > 
> > The intention wasn't for user-space to check the byte other than non-zero.
> > User processes could do that, but it's more so administration tools can see
> > where the events are registered if they cannot be closed and the state of the
> > machine.
> 
> Administration tools could simply use a seqfile interface and parse text. I
> don't expect those to be performance-sensitive at all. So we should not design
> the overhead-sensitive mmap'd ABI (enable flags) for administration tools.
> 
> > Maybe you have a slicker way to do this, but it seems to check a bit in the
> > page would involve at least a byte read followed by a mask or shift?
> 
> I suspect the minimum cost in terms of instruction would be to have the mask
> prepared in advance (when the event is registered), and therefore the instructions
> required on the user-space fast-path would be a load followed by a mask (&).
> 
> > That seems more expensive than checking a byte?
> 
> It is one single extra ALU instruction on the fast-path, indeed. But we have to
> be careful about optimizing too much for a non-representative microbenchmark
> and missing the big picture, which includes L1/L2 cache lines and TLB load overhead
> when tracing a real-life workloads.
> 
> On modern x86 CPUs, an ALU instruction takes around 0.3ns, which is a few
> orders of magnitude faster than a cache miss. Using bit-wise rather than
> byte-wise indexing increases functional density of those enable flags by a
> factor 8, which lessens the amount of cpu cache required by that same factor.
> 
> So as soon as the workload running on the system fills the CPU L1 or L2 caches,
> when the number of user events registered scales up, and when the checks for
> events enabled are done frequently enough, evicting fewer cache lines becomes
> more important than the extra instruction required to apply the mask on the
> fast-path.
> 

Ok, makes sense. This will give us more events as well, something I hear
some complaints about.

> > 
> >> I would have rather thought that tracers implemented in user-space could
> >> register
> >> themselves, and then there could be one
> >> /sys/kernel/debug/tracing/user_events_status
> >> per tracer. Considering that all kernel tracers use the same ABI to write an
> >> event,
> >> and then dispatch this event internally within the kernel to each registered
> >> tracer, I would expect to have a single memory mapping for all those (e.g. a
> >> /sys/kernel/debug/tracing/user_events_status/kernel_tracers file).
> >> 
> >> Then eventually if we have other user-space tracers such as lttng-ust with its
> >> their own user-space code performing tracing in a shared memory ring buffer, it
> >> would make sense to allow it to register its own
> >> /sys/kernel/debug/tracing/user_events_status/lttng_ust file, with its own
> >> indexes.
> >> 
> > 
> > I don't follow that. The intention is to get user processes to participate with
> > trace_events and the built-in tooling. When would a user-space tracer be used
> > instead of perf/ftrace?
> > 
> > It seems like a feature request?
> 
> It can very well be out of scope for the user events, and I'm fine with that.
> I was merely considering how the user events could be leveraged by tracers
> implemented purely in user-space. But if the stated goal of this feature is
> really to call into kernel tracers through a writev(), then I suspect that
> supporting purely user-space tracers is indeed out of scope.
> 

That was the goal with this ABI, are there maybe ways we can change the
ABI to accomodate this later without shutting that out?

> > 
> >> If this facility is ever used by lttng-ust to enable user-space tracing, I would
> >> not
> >> want to take the overhead of calling writev for the sake of kernel tracers if
> >> those are disabled.
> >> 
> > 
> > If they were disabled the byte wouldn't be set, right? So no writev overhead.
> > 
> > Seems I'm missing something.
> 
> I was wondering whether we could leverage the user events bit-enable ABI for
> tracers which are not implemented within the Linux kernel (e.g. LTTng-UST).
> But I suspect it might be better for me to re-implement this in user-space
> over shared memory.
> 

Perhaps.

> > 
> >> So perhaps in the short-term there is no need to implement the user-space tracer
> >> registration ABI, but I would have expected a simple bitmap for
> >> /sys/kernel/debug/tracing/user_events_data/kernel_tracers rather than the
> >> bytewise index, because as far as the kernel tracers are concerned, providing
> >> the bit to tell userspace instrumentation exactly which tracers are internally
> >> enabled within the kernel does not appear to be of any use other than increasing
> >> the footprint on the actively used cpu cache lines.
> >> 
> >> 
> >> * user_events_data page faults
> >> 
> >> If my understanding is correct, when the user-space program's memory containing
> >> the payload passed to writev() to a user_events_data file descriptor is kicked
> >> out from the page cache between fault_in_iov_iter_readable and its use by the
> >> tracers due to high memory pressure, the writev() will fail with -EFAULT and
> >> the data will be discarded unless user-space somehow handles this error (which
> >> is not handled in the samples/user_events/sample.c example program). It is good
> >> that the memory is faulted in immediately before calling the tracers, but
> >> considering that it is not mlock'd, should we make more effort to ensure the
> >> tracers are able to handle page faults ?
> >> 
> >> Integration of the work done by Michael Jeanson and myself on faultable
> >> tracepoint
> >> would allow the tracepoint probes to take page faults. Then, further
> >> modifications
> >> in the kernel tracers would be needed to handle those page faults.
> >> 
> > 
> > Is this something that can be done later or does it require ABI changes?
> > 
> > I would love to never miss data due to page faults.
> 
> This is internal to the user events, tracepoint, and kernel tracers
> implementation. I don't expect this to modify the user events ABI.
> 
> The only thing that might require some thinking ABI-wise is how the ring
> buffers are exposed to user-space consumers, because we would want to
> allow taking page faults between space reservation and commit.
> 
> The LTTng ring buffer has been supporting this out of the box for years
> now, but this may be trickier for other kernel tracers, for which allowing
> preemption between reserve and commit has never been a requirement until
> now.
> 

I'll let Steven comment on that one.

> > 
> >> 
> >> * user_reg name_args and write_index vs purely user-space tracers
> >> 
> >> That part of the user event registration (event layout and ID allocation)
> >> appears
> >> to be intrinsically tied to the kernel tracers and the expected event layout.
> >> This
> >> seems fine as long as the only users we consider are the kernel tracers, but it
> >> appears to be less relevant for purely user-space tracers. Actually, tying the
> >> mmap'd event enable mechanism with the event ID and description makes me wonder
> >> whether it might be better to have LTTng-UST implement its own shared-memory
> >> based
> >> "fast-event-enabling" mechanism rather than use this user-event ABI. The other
> >> advantage of doing all of this in user-space would be to allow many instances
> >> of this bitmap to exist on a given system, e.g. one per container in a
> >> multi-container
> >> system, rather than requiring this to be a global kernel-wide singleton, and to
> >> use
> >> it from a non-privileged user.
> >> 
> > 
> > We have some conversation going about using namespaces/cgroups to isolation
> > containers with bitmaps/status pages. The main thing I personally want to be
> > able to do is from the root namespace see all the events in the descendents
> > easily via perf, eBPF or ftrace.
> > 
> > Link:
> > https://lore.kernel.org/linux-trace-devel/20220316232009.7952988633787ef1003f13b0@kernel.org/
> > 
> 
> I see that a notion close to "tracing namespaces" comes up in this thread. This is something
> I brought forward at the last Linux Plumbers Conference aiming to facilitate system call tracing
> for a given container (or from a process hierarchy). I suspect the tracing namespace could also
> be tied to a set of kernel buffers, and to a user events "domain". I think this concept could
> neatly solve many of our container-related isolation issues here. As you say, it should probably
> be a hierarchical namespace.
> 

Agreed.

> >> 
> >> Some comments about the implementation:
> [...]
> > 
> >> 
> >>         return user_events_write_core(file, &i);
> >> }
> >> 
> >> include/uapi/linux/user_events.h:
> >> 
> >> struct user_reg {
> >> 
> >>         /* Input: Size of the user_reg structure being used */
> >>         __u32 size;
> >> 
> >>         /* Input: Pointer to string with event name, description and flags */
> >>         __u64 name_args;
> >> 
> >>         /* Output: Byte index of the event within the status page */
> >>         __u32 status_index;
> >> 
> >>         /* Output: Index of the event to use when writing data */
> >>         __u32 write_index;
> >> };
> >> 
> >> As this structure is expected to grow, and the user-space sample program uses
> >> "sizeof()"
> >> to figure out its size (which includes padding), I would be more comfortable if
> >> this was
> >> a packed structure rather than non-packed, because as fields are added, it's
> >> tricky to
> >> figure out from the kernel perspective whether the size received are fields that
> >> user-space
> >> is aware of, or if this is just padding.
> >> 
> > 
> > I think that would be a good idea, Steven?
> 
> [leaving this for Steven to answer]
> 

[..]

> >> kernel/trace/trace_events_user.c:
> >> 
> >> static long user_reg_get(struct user_reg __user *ureg, struct user_reg *kreg)
> >> {
> >>         u32 size;
> >>         long ret;
> >> 
> >>         ret = get_user(size, &ureg->size);
> >> 
> >>         if (ret)
> >>                 return ret;
> >> 
> >> 
> >>         if (size > PAGE_SIZE)
> >>                 return -E2BIG;
> >> 
> >>         ^ here I would be tempted to validate that the structure size at least provides
> >>         room
> >>           for the "v0" ABI, e.g.:
> >> 
> >>              if (size < offsetofend(struct user_reg, write_index))
> >>                   return -EINVAL;
> >> 
> >>         return copy_struct_from_user(kreg, sizeof(*kreg), ureg, size);
> >> 
> >>               ^ I find it odd that the kernel copy of struct user_reg may contain a
> >>                 size field which contents differs from the size fetched by get_user().
> >>                 This can happen if a buggy or plainly hostile user-space attempts to
> >>                 confuse the kernel about the size of this structure. Fortunately, the
> >>                 size field does not seem to be used afterwards, but I would think it
> >>                 safer to copy back the "size" fetched by get_user into the reg->size
> >>                 after copy_struct_from_user in case future changes in the code end up
> >>                 relying on a consistent size field.
> >> }
> >> 
> > 
> > Any size that the user doesn't fill in will zero out, if they enter a large
> > size then we error out. I tried to get the data out from the user once and use
> > it consistently from the kernel side. I didn't want users changing their values
> > at key places trying to get the kernel to do bad things.
> > 
> > This is why the size isn't used beyond the first copy (and why the data in the
> > write paths is only validated after paging in and copy the data to a buffer
> > that the user process no longer can modify mid-write).
> 
> As long as the kreg->size value is not used afterwards, it's technically fine, but
> having it there and inconsistent with the size value loaded by get_user() is error-prone.
> I would recommend to either overwrite this kreg->size value with the size loaded by
> get_user(), or document that this field should never be used because it is inconsistent
> with the size used for copy_struct_from_user().
> 

Alright, I'll set it afterwards.

> [...]
> 
> >> kernel/trace/trace_events_user.c:
> >> 
> >> static int user_events_release(struct inode *node, struct file *file)
> >> {
> >> [...]
> >>         /*
> >>          * Ensure refs cannot change under any situation by taking the
> >>          * register mutex during the final freeing of the references.
> >>          */
> >>         mutex_lock(&reg_mutex);
> >> [...]
> >>         mutex_unlock(&reg_mutex);
> >> 
> >>         kfree(refs);
> >> 
> >>         ^ AFAIU, the user_events_write() does not rely on reg_mutex to ensure mutual
> >>         exclusion.
> >>           Doing so would be prohibitive performance-wise. But I suspect that freeing
> >>           "refs" here
> >>           without waiting for a RCU grace period can be an issue if user_events_write_core
> >>           is using
> >>           refs concurrently with file descriptor close.
> >> 
> > 
> > The refs is per-file, so if user_events_write() is running release cannot be
> > called for that file instance, since it's being used. Did I miss something?
> 
> I'm possibly the one missing something here, but what prevents 2 threads
> from doing user_event_write() and close() concurrently on the same file descriptor ?
> 

While nothing prevents it, my understanding is that the actual release()
on the file_operations won't be invoked until the file struct has been
ref'd down to zero. (fs/file_table.c:fput, fs/read_write.c:do_pwritev).

While the write call is being run, the file should have a non-zero ref
count. I believe looking at do_pwritev the fdget/fdput pair are largely
responsible for this, do you agree?

> > 
> >> kernel/trace/trace_events_user.c:
> >> 
> >> static bool user_field_match(struct ftrace_event_field *field, int argc,
> >>                              const char **argv, int *iout)
> >> [...]
> >> 
> >>         for (; i < argc; ++i) {
> >> [...]
> >>                 pos += snprintf(arg_name + pos, len - pos, argv[i]);
> >> 
> >>         ^ what happens if strlen(argv[i]) > (len - pos) ? Based on lib/vsprintf.c:
> >> 
> >>  * The return value is the number of characters which would be
> >>  * generated for the given input, excluding the trailing null,
> >>  * as per ISO C99.  If the return is greater than or equal to
> >>  * @size, the resulting string is truncated.
> >> 
> >>         So the "pos" returned by the first call to sprintf would be greater than
> >>         MAX_FIELD_ARG_NAME.
> >>         Then the second call to snprintf passes a @size argument of "len - pos" using
> >>         the pos value
> >>         which is larger than len... which is a negative integer passed as argument to a
> >>         size_t (unsigned).
> >>         So it expects a very long string. And the @buf argument is out-of-bound
> >>         (field_name + pos).
> >>         Is this pattern for using snprintf() used elsewhere ? From a quick grep, I find
> >>         this pattern in
> >>         a few places where AFAIU the input is not user-controlled (as it seems to be the
> >>         case here), but
> >>         still it might be worth looking into:
> >> 
> >>              kernel/cgroup/cgroup.c:show_delegatable_files()
> >>              kernel/time/clocksource.c:available_clocksource_show()
> >> 
> >> Also, passing a copy of a userspace string (argv[i]) as format string argument
> >> to
> >> snprintf can be misused to leak kernel data to user-space.
> >> 
> >> The same function also appear to have similar issues with its use of the
> >> field->name userspace input
> >> string.
> >> 
> > 
> > The strings are copied first before coming down these paths, so they cannot
> > be forced to changed mid-reads, etc.
> 
> That's correct. But their combined length can be larger than MAX_FIELD_ARG_NAME.
> 
> > I think simply checking if any arg is beyond the MAX_FIELD_ARG_NAME would
> > prevent these types of issues, right?
> 
> No. The issue is when the combined length is larger than the max. If I look at
> other tracing code, I notice that it often uses a pattern where snprintf is first
> called on the input with a @size=0 to calculate the string length, then the
> output string is dynamically allocated with the correct length, and the entire
> code performing the snprintf formatting is called again with a nonzero size,
> thus effectively populating the output string. There are other ways to prevent
> these issues, but I would be tempted to use the same pattern used elsewhere in
> the tracing code. See trace_events_user.c:user_event_set_print_fmt() as an example.
> 
> And don't forget that passing a copy of a user input as snprintf format string is
> bad, because it allows user-space to provide event/fields names with e.g. "%s" or
> "%p" and control the behavior of snprintf in unwanted ways, typically leading to
> a kernel memory information leak to user-space.
> 

Yes, agreed. I will address this.

> Thanks,
> 
> Mathieu
> 
> > 
> >> Unfortunately this is all the time I have for review right now, but it is at
> >> least a good starting
> >> point for discussion.
> >> 
> >> Thanks,
> >> 
> >> Mathieu
> >> 
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> >> http://www.efficios.com
> > 
> > Thanks,
> > -Beau
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

Thanks,
-Beau

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-29 21:54     ` Beau Belgrave
@ 2022-03-30  6:31       ` Masami Hiramatsu
  2022-03-31 19:05       ` Mathieu Desnoyers
  1 sibling, 0 replies; 11+ messages in thread
From: Masami Hiramatsu @ 2022-03-30  6:31 UTC (permalink / raw)
  To: Beau Belgrave
  Cc: Mathieu Desnoyers, Beau Belgrave, linux-arch, linux-kernel,
	linux-trace-devel, rostedt, Masami Hiramatsu, Alexei Starovoitov

On Tue, 29 Mar 2022 14:54:21 -0700
Beau Belgrave <beaub@linux.microsoft.com> wrote:

> On Tue, Mar 29, 2022 at 12:17:14PM -0400, Mathieu Desnoyers wrote:
> > ----- On Mar 28, 2022, at 8:29 PM, Beau Belgrave beaub@linux.microsoft.com wrote:
> > 
> > >> ----- On Mar 28, 2022, at 4:24 PM, Mathieu Desnoyers
> > >> mathieu.desnoyers@efficios.com wrote:
> 
> [..]
> 
> > >> I also don't get what user-space is expected to do differently if FTRACE vs
> > >> PERF is enabled, considering that it gates a writev() to a file descriptor
> > >> associated with /sys/kernel/debug/tracing/user_events_data.
> > >> 
> > > 
> > > The intention wasn't for user-space to check the byte other than non-zero.
> > > User processes could do that, but it's more so administration tools can see
> > > where the events are registered if they cannot be closed and the state of the
> > > machine.
> > 
> > Administration tools could simply use a seqfile interface and parse text. I
> > don't expect those to be performance-sensitive at all. So we should not design
> > the overhead-sensitive mmap'd ABI (enable flags) for administration tools.
> > 
> > > Maybe you have a slicker way to do this, but it seems to check a bit in the
> > > page would involve at least a byte read followed by a mask or shift?
> > 
> > I suspect the minimum cost in terms of instruction would be to have the mask
> > prepared in advance (when the event is registered), and therefore the instructions
> > required on the user-space fast-path would be a load followed by a mask (&).
> > 
> > > That seems more expensive than checking a byte?
> > 
> > It is one single extra ALU instruction on the fast-path, indeed. But we have to
> > be careful about optimizing too much for a non-representative microbenchmark
> > and missing the big picture, which includes L1/L2 cache lines and TLB load overhead
> > when tracing a real-life workloads.
> > 
> > On modern x86 CPUs, an ALU instruction takes around 0.3ns, which is a few
> > orders of magnitude faster than a cache miss. Using bit-wise rather than
> > byte-wise indexing increases functional density of those enable flags by a
> > factor 8, which lessens the amount of cpu cache required by that same factor.
> > 
> > So as soon as the workload running on the system fills the CPU L1 or L2 caches,
> > when the number of user events registered scales up, and when the checks for
> > events enabled are done frequently enough, evicting fewer cache lines becomes
> > more important than the extra instruction required to apply the mask on the
> > fast-path.
> > 
> 
> Ok, makes sense. This will give us more events as well, something I hear
> some complaints about.
> 
> > > 
> > >> I would have rather thought that tracers implemented in user-space could
> > >> register
> > >> themselves, and then there could be one
> > >> /sys/kernel/debug/tracing/user_events_status
> > >> per tracer. Considering that all kernel tracers use the same ABI to write an
> > >> event,
> > >> and then dispatch this event internally within the kernel to each registered
> > >> tracer, I would expect to have a single memory mapping for all those (e.g. a
> > >> /sys/kernel/debug/tracing/user_events_status/kernel_tracers file).

I'm not sure you understand the meaning of per-event flag, but as far as I know
the reason of those per-event flags may be same reason of SDT semaphore. The event
which is NOT traced from kernel(via perf or via ftrace) needs to be skipped because
preparing the user-event arguments (e.g. getting a variable "name" in user-space
language runtime) may take a long time.

> > >> Then eventually if we have other user-space tracers such as lttng-ust with its
> > >> their own user-space code performing tracing in a shared memory ring buffer, it
> > >> would make sense to allow it to register its own
> > >> /sys/kernel/debug/tracing/user_events_status/lttng_ust file, with its own
> > >> indexes.
> > >> 
> > > 
> > > I don't follow that. The intention is to get user processes to participate with
> > > trace_events and the built-in tooling. When would a user-space tracer be used
> > > instead of perf/ftrace?
> > > 
> > > It seems like a feature request?
> > 
> > It can very well be out of scope for the user events, and I'm fine with that.
> > I was merely considering how the user events could be leveraged by tracers
> > implemented purely in user-space. But if the stated goal of this feature is
> > really to call into kernel tracers through a writev(), then I suspect that
> > supporting purely user-space tracers is indeed out of scope.
> > 
> 
> That was the goal with this ABI, are there maybe ways we can change the
> ABI to accomodate this later without shutting that out?

Beau, would you have any plan to make a tracer-tool application?
I thought this ABI is for adding "logger" like interface for perf/ftrace (or BPF,
that will be a long way).

> 
> > > 
> > >> If this facility is ever used by lttng-ust to enable user-space tracing, I would
> > >> not
> > >> want to take the overhead of calling writev for the sake of kernel tracers if
> > >> those are disabled.

I think this is totally different design with the lttng-ust, which collects event data
in user-space and kernel-space separately. Users will use them properly according
to the situation and purpose.

> > > If they were disabled the byte wouldn't be set, right? So no writev overhead.
> > > 
> > > Seems I'm missing something.
> > 
> > I was wondering whether we could leverage the user events bit-enable ABI for
> > tracers which are not implemented within the Linux kernel (e.g. LTTng-UST).
> > But I suspect it might be better for me to re-implement this in user-space
> > over shared memory.
> > 
> 
> Perhaps.

Sorry, I actually doubt it, since it relays on the user space program which
correctly works and write the event correct way (this is why I had asked you
to verify the event format at writev). Or perhaps we can verify it when merging
the shared pages...

> > > 
> > >> So perhaps in the short-term there is no need to implement the user-space tracer
> > >> registration ABI, but I would have expected a simple bitmap for
> > >> /sys/kernel/debug/tracing/user_events_data/kernel_tracers rather than the
> > >> bytewise index, because as far as the kernel tracers are concerned, providing
> > >> the bit to tell userspace instrumentation exactly which tracers are internally
> > >> enabled within the kernel does not appear to be of any use other than increasing
> > >> the footprint on the actively used cpu cache lines.

I agreed with this.

> > >> * user_events_data page faults
> > >> 
> > >> If my understanding is correct, when the user-space program's memory containing
> > >> the payload passed to writev() to a user_events_data file descriptor is kicked
> > >> out from the page cache between fault_in_iov_iter_readable and its use by the
> > >> tracers due to high memory pressure, the writev() will fail with -EFAULT and
> > >> the data will be discarded unless user-space somehow handles this error (which
> > >> is not handled in the samples/user_events/sample.c example program). It is good
> > >> that the memory is faulted in immediately before calling the tracers, but
> > >> considering that it is not mlock'd, should we make more effort to ensure the
> > >> tracers are able to handle page faults ?

I think this should be handled by user-space implementation (retry?).

> > >> Integration of the work done by Michael Jeanson and myself on faultable
> > >> tracepoint
> > >> would allow the tracepoint probes to take page faults. Then, further
> > >> modifications
> > >> in the kernel tracers would be needed to handle those page faults.

Interesting.

> > > Is this something that can be done later or does it require ABI changes?
> > > 
> > > I would love to never miss data due to page faults.
> > 
> > This is internal to the user events, tracepoint, and kernel tracers
> > implementation. I don't expect this to modify the user events ABI.
> > 
> > The only thing that might require some thinking ABI-wise is how the ring
> > buffers are exposed to user-space consumers, because we would want to
> > allow taking page faults between space reservation and commit.
> > 
> > The LTTng ring buffer has been supporting this out of the box for years
> > now, but this may be trickier for other kernel tracers, for which allowing
> > preemption between reserve and commit has never been a requirement until
> > now.
> > 
> 
> I'll let Steven comment on that one.
> 
> > > 
> > >> 
> > >> * user_reg name_args and write_index vs purely user-space tracers
> > >> 
> > >> That part of the user event registration (event layout and ID allocation)
> > >> appears
> > >> to be intrinsically tied to the kernel tracers and the expected event layout.
> > >> This
> > >> seems fine as long as the only users we consider are the kernel tracers, but it
> > >> appears to be less relevant for purely user-space tracers. Actually, tying the
> > >> mmap'd event enable mechanism with the event ID and description makes me wonder
> > >> whether it might be better to have LTTng-UST implement its own shared-memory
> > >> based
> > >> "fast-event-enabling" mechanism rather than use this user-event ABI. The other
> > >> advantage of doing all of this in user-space would be to allow many instances
> > >> of this bitmap to exist on a given system, e.g. one per container in a
> > >> multi-container
> > >> system, rather than requiring this to be a global kernel-wide singleton, and to
> > >> use
> > >> it from a non-privileged user.
> > >> 
> > > 
> > > We have some conversation going about using namespaces/cgroups to isolation
> > > containers with bitmaps/status pages. The main thing I personally want to be
> > > able to do is from the root namespace see all the events in the descendents
> > > easily via perf, eBPF or ftrace.
> > > 
> > > Link:
> > > https://lore.kernel.org/linux-trace-devel/20220316232009.7952988633787ef1003f13b0@kernel.org/
> > > 
> > 
> > I see that a notion close to "tracing namespaces" comes up in this thread. This is something
> > I brought forward at the last Linux Plumbers Conference aiming to facilitate system call tracing
> > for a given container (or from a process hierarchy). I suspect the tracing namespace could also
> > be tied to a set of kernel buffers, and to a user events "domain". I think this concept could
> > neatly solve many of our container-related isolation issues here. As you say, it should probably
> > be a hierarchical namespace.
> > 
> 
> Agreed.

Great :-)

> > >> Some comments about the implementation:
> > [...]
> > > 
> > >> 
> > >>         return user_events_write_core(file, &i);
> > >> }
> > >> 
> > >> include/uapi/linux/user_events.h:
> > >> 
> > >> struct user_reg {
> > >> 
> > >>         /* Input: Size of the user_reg structure being used */
> > >>         __u32 size;
> > >> 
> > >>         /* Input: Pointer to string with event name, description and flags */
> > >>         __u64 name_args;
> > >> 
> > >>         /* Output: Byte index of the event within the status page */
> > >>         __u32 status_index;
> > >> 
> > >>         /* Output: Index of the event to use when writing data */
> > >>         __u32 write_index;
> > >> };
> > >> 
> > >> As this structure is expected to grow, and the user-space sample program uses
> > >> "sizeof()"
> > >> to figure out its size (which includes padding), I would be more comfortable if
> > >> this was
> > >> a packed structure rather than non-packed, because as fields are added, it's
> > >> tricky to
> > >> figure out from the kernel perspective whether the size received are fields that
> > >> user-space
> > >> is aware of, or if this is just padding.


Indeed, any user-space exposed structure should be packed or correctly aligned,
because those can be built on many architectures.


Thank you,

> > >> 
> > > 
> > > I think that would be a good idea, Steven?
> > 
> > [leaving this for Steven to answer]
> > 
>

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Comments on new user events ABI
  2022-03-29 21:54     ` Beau Belgrave
  2022-03-30  6:31       ` Masami Hiramatsu
@ 2022-03-31 19:05       ` Mathieu Desnoyers
  1 sibling, 0 replies; 11+ messages in thread
From: Mathieu Desnoyers @ 2022-03-31 19:05 UTC (permalink / raw)
  To: Beau Belgrave
  Cc: Beau Belgrave, linux-arch, linux-kernel, linux-trace-devel,
	rostedt, Masami Hiramatsu, Alexei Starovoitov

----- On Mar 29, 2022, at 5:54 PM, Beau Belgrave beaub@linux.microsoft.com wrote:

> On Tue, Mar 29, 2022 at 12:17:14PM -0400, Mathieu Desnoyers wrote:
>> ----- On Mar 28, 2022, at 8:29 PM, Beau Belgrave beaub@linux.microsoft.com
>> wrote:
>> 
>> >> ----- On Mar 28, 2022, at 4:24 PM, Mathieu Desnoyers
>> >> mathieu.desnoyers@efficios.com wrote:
[...]
> 
>> > 
>> >> I would have rather thought that tracers implemented in user-space could
>> >> register
>> >> themselves, and then there could be one
>> >> /sys/kernel/debug/tracing/user_events_status
>> >> per tracer. Considering that all kernel tracers use the same ABI to write an
>> >> event,
>> >> and then dispatch this event internally within the kernel to each registered
>> >> tracer, I would expect to have a single memory mapping for all those (e.g. a
>> >> /sys/kernel/debug/tracing/user_events_status/kernel_tracers file).
>> >> 
>> >> Then eventually if we have other user-space tracers such as lttng-ust with its
>> >> their own user-space code performing tracing in a shared memory ring buffer, it
>> >> would make sense to allow it to register its own
>> >> /sys/kernel/debug/tracing/user_events_status/lttng_ust file, with its own
>> >> indexes.
>> >> 
>> > 
>> > I don't follow that. The intention is to get user processes to participate with
>> > trace_events and the built-in tooling. When would a user-space tracer be used
>> > instead of perf/ftrace?
>> > 
>> > It seems like a feature request?
>> 
>> It can very well be out of scope for the user events, and I'm fine with that.
>> I was merely considering how the user events could be leveraged by tracers
>> implemented purely in user-space. But if the stated goal of this feature is
>> really to call into kernel tracers through a writev(), then I suspect that
>> supporting purely user-space tracers is indeed out of scope.
>> 
> 
> That was the goal with this ABI, are there maybe ways we can change the
> ABI to accomodate this later without shutting that out?

I suspect that targeting this bit-enable memory mapping only as a gate for
kernel tracers is good enough for now. Let's consider other use-cases when
we have a clear picture of how it would be used. For the moment, I suspect
that I would prefer to manage my own bit-enable memory mapping within lttng-ust
rather than use a kernel-wide facility, which opens up questions about isolation
of containers, and ability to run multiple concurrent instances on a given kernel
without side-effects on each other.

I suspect that clarifying what this mmap'd based mechanism brings over the already
existing SDT semaphores will be important to justify this new ABI.

Ref. https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation

[...]

> 
>> [...]
>> 
>> >> kernel/trace/trace_events_user.c:
>> >> 
>> >> static int user_events_release(struct inode *node, struct file *file)
>> >> {
>> >> [...]
>> >>         /*
>> >>          * Ensure refs cannot change under any situation by taking the
>> >>          * register mutex during the final freeing of the references.
>> >>          */
>> >>         mutex_lock(&reg_mutex);
>> >> [...]
>> >>         mutex_unlock(&reg_mutex);
>> >> 
>> >>         kfree(refs);
>> >> 
>> >>         ^ AFAIU, the user_events_write() does not rely on reg_mutex to ensure mutual
>> >>         exclusion.
>> >>           Doing so would be prohibitive performance-wise. But I suspect that freeing
>> >>           "refs" here
>> >>           without waiting for a RCU grace period can be an issue if user_events_write_core
>> >>           is using
>> >>           refs concurrently with file descriptor close.
>> >> 
>> > 
>> > The refs is per-file, so if user_events_write() is running release cannot be
>> > called for that file instance, since it's being used. Did I miss something?
>> 
>> I'm possibly the one missing something here, but what prevents 2 threads
>> from doing user_event_write() and close() concurrently on the same file
>> descriptor ?
>> 
> 
> While nothing prevents it, my understanding is that the actual release()
> on the file_operations won't be invoked until the file struct has been
> ref'd down to zero. (fs/file_table.c:fput, fs/read_write.c:do_pwritev).
> 
> While the write call is being run, the file should have a non-zero ref
> count. I believe looking at do_pwritev the fdget/fdput pair are largely
> responsible for this, do you agree?

Yes, you are correct. So the "refs" lifetime for writev vs close is guaranteed
by the reference counting of the struct file when the process file descriptor
table has multiple users (e.g. multithreaded process).

[...]

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-03-31 19:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-28 20:24 Comments on new user events ABI Mathieu Desnoyers
2022-03-28 20:35 ` Mathieu Desnoyers
2022-03-28 23:30   ` Beau Belgrave
2022-03-29  0:29 ` Beau Belgrave
2022-03-29 16:17   ` Mathieu Desnoyers
2022-03-29 16:25     ` Alexei Starovoitov
2022-03-29 16:57       ` Beau Belgrave
2022-03-29 19:45       ` Steven Rostedt
2022-03-29 21:54     ` Beau Belgrave
2022-03-30  6:31       ` Masami Hiramatsu
2022-03-31 19:05       ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).