linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*
@ 2023-10-02 13:52 Artem Savkov
  2023-10-03 22:11 ` Andrii Nakryiko
  2023-10-04  1:38 ` Steven Rostedt
  0 siblings, 2 replies; 16+ messages in thread
From: Artem Savkov @ 2023-10-02 13:52 UTC (permalink / raw)
  To: Steven Rostedt, Masami Hiramatsu
  Cc: linux-kernel, linux-trace-kernel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, bpf, netdev, Thomas Gleixner,
	linux-rt-users, Jiri Olsa, Artem Savkov

linux-rt-devel tree contains a patch that adds an extra member to struct
trace_entry. This causes the offset of args field in struct
trace_event_raw_sys_enter be different from the one in struct
syscall_trace_enter:

struct trace_event_raw_sys_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */
        /* XXX 4 bytes hole, try to pack */

        long int                   id;                   /*    16     8 */
        long unsigned int          args[6];              /*    24    48 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        char                       __data[];             /*    72     0 */

        /* size: 72, cachelines: 2, members: 4 */
        /* sum members: 68, holes: 1, sum holes: 4 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 8 bytes */
};

struct syscall_trace_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */

        int                        nr;                   /*    12     4 */
        long unsigned int          args[];               /*    16     0 */

        /* size: 16, cachelines: 1, members: 3 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 16 bytes */
};

This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
test_profiler testcase because max_ctx_offset is calculated based on the
former struct, while off on the latter:

  10488         if (is_tracepoint || is_syscall_tp) {
  10489                 int off = trace_event_get_offsets(event->tp_event);
  10490
  10491                 if (prog->aux->max_ctx_offset > off)
  10492                         return -EACCES;
  10493         }

This patch changes the type of nr member in syscall_trace_* structs to
be long so that "args" offset is equal to that in struct
trace_event_raw_sys_enter.

Signed-off-by: Artem Savkov <asavkov@redhat.com>
---
 kernel/trace/trace.h          | 4 ++--
 kernel/trace/trace_syscalls.c | 7 ++++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 77debe53f07cf..cd1d24df85364 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -135,13 +135,13 @@ enum trace_type {
  */
 struct syscall_trace_enter {
 	struct trace_entry	ent;
-	int			nr;
+	long			nr;
 	unsigned long		args[];
 };
 
 struct syscall_trace_exit {
 	struct trace_entry	ent;
-	int			nr;
+	long			nr;
 	long			ret;
 };
 
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index de753403cdafb..c26939119f2e4 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -101,7 +101,7 @@ find_syscall_meta(unsigned long syscall)
 	return NULL;
 }
 
-static struct syscall_metadata *syscall_nr_to_meta(int nr)
+static struct syscall_metadata *syscall_nr_to_meta(long nr)
 {
 	if (IS_ENABLED(CONFIG_HAVE_SPARSE_SYSCALL_NR))
 		return xa_load(&syscalls_metadata_sparse, (unsigned long)nr);
@@ -132,7 +132,8 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
 	struct trace_entry *ent = iter->ent;
 	struct syscall_trace_enter *trace;
 	struct syscall_metadata *entry;
-	int i, syscall;
+	int i;
+	long syscall;
 
 	trace = (typeof(trace))ent;
 	syscall = trace->nr;
@@ -177,7 +178,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags,
 	struct trace_seq *s = &iter->seq;
 	struct trace_entry *ent = iter->ent;
 	struct syscall_trace_exit *trace;
-	int syscall;
+	long syscall;
 	struct syscall_metadata *entry;
 
 	trace = (typeof(trace))ent;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*
  2023-10-02 13:52 [RFC PATCH] tracing: change syscall number type in struct syscall_trace_* Artem Savkov
@ 2023-10-03 22:11 ` Andrii Nakryiko
  2023-10-04  7:02   ` Artem Savkov
  2023-10-04  1:38 ` Steven Rostedt
  1 sibling, 1 reply; 16+ messages in thread
From: Andrii Nakryiko @ 2023-10-03 22:11 UTC (permalink / raw)
  To: Artem Savkov
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, bpf, netdev, Thomas Gleixner, linux-rt-users,
	Jiri Olsa

On Mon, Oct 2, 2023 at 6:53 AM Artem Savkov <asavkov@redhat.com> wrote:
>
> linux-rt-devel tree contains a patch that adds an extra member to struct

can you please point to the patch itself that makes that change?

> trace_entry. This causes the offset of args field in struct
> trace_event_raw_sys_enter be different from the one in struct
> syscall_trace_enter:
>
> struct trace_event_raw_sys_enter {
>         struct trace_entry         ent;                  /*     0    12 */
>
>         /* XXX last struct has 3 bytes of padding */
>         /* XXX 4 bytes hole, try to pack */
>
>         long int                   id;                   /*    16     8 */
>         long unsigned int          args[6];              /*    24    48 */
>         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
>         char                       __data[];             /*    72     0 */
>
>         /* size: 72, cachelines: 2, members: 4 */
>         /* sum members: 68, holes: 1, sum holes: 4 */
>         /* paddings: 1, sum paddings: 3 */
>         /* last cacheline: 8 bytes */
> };
>
> struct syscall_trace_enter {
>         struct trace_entry         ent;                  /*     0    12 */
>
>         /* XXX last struct has 3 bytes of padding */
>
>         int                        nr;                   /*    12     4 */
>         long unsigned int          args[];               /*    16     0 */
>
>         /* size: 16, cachelines: 1, members: 3 */
>         /* paddings: 1, sum paddings: 3 */
>         /* last cacheline: 16 bytes */
> };
>
> This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> test_profiler testcase because max_ctx_offset is calculated based on the
> former struct, while off on the latter:
>
>   10488         if (is_tracepoint || is_syscall_tp) {
>   10489                 int off = trace_event_get_offsets(event->tp_event);
>   10490
>   10491                 if (prog->aux->max_ctx_offset > off)
>   10492                         return -EACCES;
>   10493         }
>
> This patch changes the type of nr member in syscall_trace_* structs to
> be long so that "args" offset is equal to that in struct
> trace_event_raw_sys_enter.
>
> Signed-off-by: Artem Savkov <asavkov@redhat.com>
> ---
>  kernel/trace/trace.h          | 4 ++--
>  kernel/trace/trace_syscalls.c | 7 ++++---
>  2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 77debe53f07cf..cd1d24df85364 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -135,13 +135,13 @@ enum trace_type {
>   */
>  struct syscall_trace_enter {
>         struct trace_entry      ent;
> -       int                     nr;
> +       long                    nr;
>         unsigned long           args[];
>  };
>
>  struct syscall_trace_exit {
>         struct trace_entry      ent;
> -       int                     nr;
> +       long                    nr;
>         long                    ret;
>  };
>
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index de753403cdafb..c26939119f2e4 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -101,7 +101,7 @@ find_syscall_meta(unsigned long syscall)
>         return NULL;
>  }
>
> -static struct syscall_metadata *syscall_nr_to_meta(int nr)
> +static struct syscall_metadata *syscall_nr_to_meta(long nr)
>  {
>         if (IS_ENABLED(CONFIG_HAVE_SPARSE_SYSCALL_NR))
>                 return xa_load(&syscalls_metadata_sparse, (unsigned long)nr);
> @@ -132,7 +132,8 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
>         struct trace_entry *ent = iter->ent;
>         struct syscall_trace_enter *trace;
>         struct syscall_metadata *entry;
> -       int i, syscall;
> +       int i;
> +       long syscall;
>
>         trace = (typeof(trace))ent;
>         syscall = trace->nr;
> @@ -177,7 +178,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags,
>         struct trace_seq *s = &iter->seq;
>         struct trace_entry *ent = iter->ent;
>         struct syscall_trace_exit *trace;
> -       int syscall;
> +       long syscall;
>         struct syscall_metadata *entry;
>
>         trace = (typeof(trace))ent;
> --
> 2.41.0
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*
  2023-10-02 13:52 [RFC PATCH] tracing: change syscall number type in struct syscall_trace_* Artem Savkov
  2023-10-03 22:11 ` Andrii Nakryiko
@ 2023-10-04  1:38 ` Steven Rostedt
  2023-10-04 12:55   ` Artem Savkov
  1 sibling, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2023-10-04  1:38 UTC (permalink / raw)
  To: Artem Savkov
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
	netdev, Thomas Gleixner, linux-rt-users, Jiri Olsa

On Mon,  2 Oct 2023 15:52:42 +0200
Artem Savkov <asavkov@redhat.com> wrote:

> linux-rt-devel tree contains a patch that adds an extra member to struct
> trace_entry. This causes the offset of args field in struct
> trace_event_raw_sys_enter be different from the one in struct
> syscall_trace_enter:

This patch looks like it's fixing the symptom and not the issue. No code
should rely on the two event structures to be related. That's an unwanted
coupling, that will likely cause issues down the road (like the RT patch
you mentioned).


> 
> struct trace_event_raw_sys_enter {
>         struct trace_entry         ent;                  /*     0    12 */
> 
>         /* XXX last struct has 3 bytes of padding */
>         /* XXX 4 bytes hole, try to pack */
> 
>         long int                   id;                   /*    16     8 */
>         long unsigned int          args[6];              /*    24    48 */
>         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
>         char                       __data[];             /*    72     0 */
> 
>         /* size: 72, cachelines: 2, members: 4 */
>         /* sum members: 68, holes: 1, sum holes: 4 */
>         /* paddings: 1, sum paddings: 3 */
>         /* last cacheline: 8 bytes */
> };
> 
> struct syscall_trace_enter {
>         struct trace_entry         ent;                  /*     0    12 */
> 
>         /* XXX last struct has 3 bytes of padding */
> 
>         int                        nr;                   /*    12     4 */
>         long unsigned int          args[];               /*    16     0 */
> 
>         /* size: 16, cachelines: 1, members: 3 */
>         /* paddings: 1, sum paddings: 3 */
>         /* last cacheline: 16 bytes */
> };
> 
> This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> test_profiler testcase because max_ctx_offset is calculated based on the
> former struct, while off on the latter:

The above appears to be pointing to the real bug. The "is calculated based
on the former struct while off on the latter" Why are the two being used
together? They are supposed to be *unrelated*!


> 
>   10488         if (is_tracepoint || is_syscall_tp) {
>   10489                 int off = trace_event_get_offsets(event->tp_event);

So basically this is clumping together the raw_syscalls with the syscalls
events as if they are the same. But the are not. They are created
differently. It's basically like using one structure to get the offsets of
another structure. That would be a bug anyplace else in the kernel. Sounds
like it's a bug here too.

I think the issue is with this code, not the tracing code.

We could expose the struct syscall_trace_enter and syscall_trace_exit if
the offsets to those are needed.

-- Steve


>   10490
>   10491                 if (prog->aux->max_ctx_offset > off)
>   10492                         return -EACCES;
>   10493         }
> 
> This patch changes the type of nr member in syscall_trace_* structs to
> be long so that "args" offset is equal to that in struct
> trace_event_raw_sys_enter.
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*
  2023-10-03 22:11 ` Andrii Nakryiko
@ 2023-10-04  7:02   ` Artem Savkov
  0 siblings, 0 replies; 16+ messages in thread
From: Artem Savkov @ 2023-10-04  7:02 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, bpf, netdev, Thomas Gleixner, linux-rt-users,
	Jiri Olsa

On Tue, Oct 03, 2023 at 03:11:15PM -0700, Andrii Nakryiko wrote:
> On Mon, Oct 2, 2023 at 6:53 AM Artem Savkov <asavkov@redhat.com> wrote:
> >
> > linux-rt-devel tree contains a patch that adds an extra member to struct
> 
> can you please point to the patch itself that makes that change?

Of course, some context would be useful. The patch in question is b1773eac3f29c
("sched: Add support for lazy preemption") from rt-devel tree [0]. It came up
a couple of times before: [1] [2] [3] [4].

[0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71
[1] https://lore.kernel.org/linux-rt-users/20200221153541.681468-1-jolsa@kernel.org/t/#u
[2] https://github.com/iovisor/bpftrace/commit/a2e3d5dbc03ceb49b776cf5602d31896158844a7
[3] https://lore.kernel.org/bpf/xunyjzy64q9b.fsf@redhat.com/t/#u
[4] https://lore.kernel.org/bpf/20230727150647.397626-1-ykaliuta@redhat.com/t/#u

> > trace_entry. This causes the offset of args field in struct
> > trace_event_raw_sys_enter be different from the one in struct
> > syscall_trace_enter:
> >
> > struct trace_event_raw_sys_enter {
> >         struct trace_entry         ent;                  /*     0    12 */
> >
> >         /* XXX last struct has 3 bytes of padding */
> >         /* XXX 4 bytes hole, try to pack */
> >
> >         long int                   id;                   /*    16     8 */
> >         long unsigned int          args[6];              /*    24    48 */
> >         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> >         char                       __data[];             /*    72     0 */
> >
> >         /* size: 72, cachelines: 2, members: 4 */
> >         /* sum members: 68, holes: 1, sum holes: 4 */
> >         /* paddings: 1, sum paddings: 3 */
> >         /* last cacheline: 8 bytes */
> > };
> >
> > struct syscall_trace_enter {
> >         struct trace_entry         ent;                  /*     0    12 */
> >
> >         /* XXX last struct has 3 bytes of padding */
> >
> >         int                        nr;                   /*    12     4 */
> >         long unsigned int          args[];               /*    16     0 */
> >
> >         /* size: 16, cachelines: 1, members: 3 */
> >         /* paddings: 1, sum paddings: 3 */
> >         /* last cacheline: 16 bytes */
> > };
> >
> > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> > test_profiler testcase because max_ctx_offset is calculated based on the
> > former struct, while off on the latter:
> >
> >   10488         if (is_tracepoint || is_syscall_tp) {
> >   10489                 int off = trace_event_get_offsets(event->tp_event);
> >   10490
> >   10491                 if (prog->aux->max_ctx_offset > off)
> >   10492                         return -EACCES;
> >   10493         }
> >
> > This patch changes the type of nr member in syscall_trace_* structs to
> > be long so that "args" offset is equal to that in struct
> > trace_event_raw_sys_enter.
> >
> > Signed-off-by: Artem Savkov <asavkov@redhat.com>
> > ---
> >  kernel/trace/trace.h          | 4 ++--
> >  kernel/trace/trace_syscalls.c | 7 ++++---
> >  2 files changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > index 77debe53f07cf..cd1d24df85364 100644
> > --- a/kernel/trace/trace.h
> > +++ b/kernel/trace/trace.h
> > @@ -135,13 +135,13 @@ enum trace_type {
> >   */
> >  struct syscall_trace_enter {
> >         struct trace_entry      ent;
> > -       int                     nr;
> > +       long                    nr;
> >         unsigned long           args[];
> >  };
> >
> >  struct syscall_trace_exit {
> >         struct trace_entry      ent;
> > -       int                     nr;
> > +       long                    nr;
> >         long                    ret;
> >  };
> >
> > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> > index de753403cdafb..c26939119f2e4 100644
> > --- a/kernel/trace/trace_syscalls.c
> > +++ b/kernel/trace/trace_syscalls.c
> > @@ -101,7 +101,7 @@ find_syscall_meta(unsigned long syscall)
> >         return NULL;
> >  }
> >
> > -static struct syscall_metadata *syscall_nr_to_meta(int nr)
> > +static struct syscall_metadata *syscall_nr_to_meta(long nr)
> >  {
> >         if (IS_ENABLED(CONFIG_HAVE_SPARSE_SYSCALL_NR))
> >                 return xa_load(&syscalls_metadata_sparse, (unsigned long)nr);
> > @@ -132,7 +132,8 @@ print_syscall_enter(struct trace_iterator *iter, int flags,
> >         struct trace_entry *ent = iter->ent;
> >         struct syscall_trace_enter *trace;
> >         struct syscall_metadata *entry;
> > -       int i, syscall;
> > +       int i;
> > +       long syscall;
> >
> >         trace = (typeof(trace))ent;
> >         syscall = trace->nr;
> > @@ -177,7 +178,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags,
> >         struct trace_seq *s = &iter->seq;
> >         struct trace_entry *ent = iter->ent;
> >         struct syscall_trace_exit *trace;
> > -       int syscall;
> > +       long syscall;
> >         struct syscall_metadata *entry;
> >
> >         trace = (typeof(trace))ent;
> > --
> > 2.41.0
> >
> >
> 

-- 
 Artem


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*
  2023-10-04  1:38 ` Steven Rostedt
@ 2023-10-04 12:55   ` Artem Savkov
  2023-10-05 12:34     ` Artem Savkov
  0 siblings, 1 reply; 16+ messages in thread
From: Artem Savkov @ 2023-10-04 12:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
	netdev, Thomas Gleixner, linux-rt-users, Jiri Olsa

On Tue, Oct 03, 2023 at 09:38:44PM -0400, Steven Rostedt wrote:
> On Mon,  2 Oct 2023 15:52:42 +0200
> Artem Savkov <asavkov@redhat.com> wrote:
> 
> > linux-rt-devel tree contains a patch that adds an extra member to struct
> > trace_entry. This causes the offset of args field in struct
> > trace_event_raw_sys_enter be different from the one in struct
> > syscall_trace_enter:
> 
> This patch looks like it's fixing the symptom and not the issue. No code
> should rely on the two event structures to be related. That's an unwanted
> coupling, that will likely cause issues down the road (like the RT patch
> you mentioned).

I agree, but I didn't see a better solution and that was my way of
starting conversation, thus the RFC.

> > 
> > struct trace_event_raw_sys_enter {
> >         struct trace_entry         ent;                  /*     0    12 */
> > 
> >         /* XXX last struct has 3 bytes of padding */
> >         /* XXX 4 bytes hole, try to pack */
> > 
> >         long int                   id;                   /*    16     8 */
> >         long unsigned int          args[6];              /*    24    48 */
> >         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> >         char                       __data[];             /*    72     0 */
> > 
> >         /* size: 72, cachelines: 2, members: 4 */
> >         /* sum members: 68, holes: 1, sum holes: 4 */
> >         /* paddings: 1, sum paddings: 3 */
> >         /* last cacheline: 8 bytes */
> > };
> > 
> > struct syscall_trace_enter {
> >         struct trace_entry         ent;                  /*     0    12 */
> > 
> >         /* XXX last struct has 3 bytes of padding */
> > 
> >         int                        nr;                   /*    12     4 */
> >         long unsigned int          args[];               /*    16     0 */
> > 
> >         /* size: 16, cachelines: 1, members: 3 */
> >         /* paddings: 1, sum paddings: 3 */
> >         /* last cacheline: 16 bytes */
> > };
> > 
> > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> > test_profiler testcase because max_ctx_offset is calculated based on the
> > former struct, while off on the latter:
> 
> The above appears to be pointing to the real bug. The "is calculated based
> on the former struct while off on the latter" Why are the two being used
> together? They are supposed to be *unrelated*!
> 
> 
> > 
> >   10488         if (is_tracepoint || is_syscall_tp) {
> >   10489                 int off = trace_event_get_offsets(event->tp_event);
> 
> So basically this is clumping together the raw_syscalls with the syscalls
> events as if they are the same. But the are not. They are created
> differently. It's basically like using one structure to get the offsets of
> another structure. That would be a bug anyplace else in the kernel. Sounds
> like it's a bug here too.
> 
> I think the issue is with this code, not the tracing code.
> 
> We could expose the struct syscall_trace_enter and syscall_trace_exit if
> the offsets to those are needed.

I don't think we need syscall_trace_* offsets, looks like
trace_event_get_offsets() should return offset trace_event_raw_sys_enter
instead. I am still trying to figure out how all of this works together.
Maybe Alexei or Andrii have more context here.

-- 
 Artem


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*
  2023-10-04 12:55   ` Artem Savkov
@ 2023-10-05 12:34     ` Artem Savkov
  2023-10-12 11:45       ` [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t Artem Savkov
  0 siblings, 1 reply; 16+ messages in thread
From: Artem Savkov @ 2023-10-05 12:34 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko
  Cc: Masami Hiramatsu, linux-kernel, linux-trace-kernel,
	Steven Rostedt, Daniel Borkmann, bpf, netdev, Thomas Gleixner,
	linux-rt-users, Jiri Olsa

On Wed, Oct 04, 2023 at 02:55:47PM +0200, Artem Savkov wrote:
> On Tue, Oct 03, 2023 at 09:38:44PM -0400, Steven Rostedt wrote:
> > On Mon,  2 Oct 2023 15:52:42 +0200
> > Artem Savkov <asavkov@redhat.com> wrote:
> > 
> > > linux-rt-devel tree contains a patch that adds an extra member to struct
> > > trace_entry. This causes the offset of args field in struct
> > > trace_event_raw_sys_enter be different from the one in struct
> > > syscall_trace_enter:
> > 
> > This patch looks like it's fixing the symptom and not the issue. No code
> > should rely on the two event structures to be related. That's an unwanted
> > coupling, that will likely cause issues down the road (like the RT patch
> > you mentioned).
> 
> I agree, but I didn't see a better solution and that was my way of
> starting conversation, thus the RFC.
> 
> > > 
> > > struct trace_event_raw_sys_enter {
> > >         struct trace_entry         ent;                  /*     0    12 */
> > > 
> > >         /* XXX last struct has 3 bytes of padding */
> > >         /* XXX 4 bytes hole, try to pack */
> > > 
> > >         long int                   id;                   /*    16     8 */
> > >         long unsigned int          args[6];              /*    24    48 */
> > >         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> > >         char                       __data[];             /*    72     0 */
> > > 
> > >         /* size: 72, cachelines: 2, members: 4 */
> > >         /* sum members: 68, holes: 1, sum holes: 4 */
> > >         /* paddings: 1, sum paddings: 3 */
> > >         /* last cacheline: 8 bytes */
> > > };
> > > 
> > > struct syscall_trace_enter {
> > >         struct trace_entry         ent;                  /*     0    12 */
> > > 
> > >         /* XXX last struct has 3 bytes of padding */
> > > 
> > >         int                        nr;                   /*    12     4 */
> > >         long unsigned int          args[];               /*    16     0 */
> > > 
> > >         /* size: 16, cachelines: 1, members: 3 */
> > >         /* paddings: 1, sum paddings: 3 */
> > >         /* last cacheline: 16 bytes */
> > > };
> > > 
> > > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> > > test_profiler testcase because max_ctx_offset is calculated based on the
> > > former struct, while off on the latter:
> > 
> > The above appears to be pointing to the real bug. The "is calculated based
> > on the former struct while off on the latter" Why are the two being used
> > together? They are supposed to be *unrelated*!
> > 
> > 
> > > 
> > >   10488         if (is_tracepoint || is_syscall_tp) {
> > >   10489                 int off = trace_event_get_offsets(event->tp_event);
> > 
> > So basically this is clumping together the raw_syscalls with the syscalls
> > events as if they are the same. But the are not. They are created
> > differently. It's basically like using one structure to get the offsets of
> > another structure. That would be a bug anyplace else in the kernel. Sounds
> > like it's a bug here too.
> > 
> > I think the issue is with this code, not the tracing code.
> > 
> > We could expose the struct syscall_trace_enter and syscall_trace_exit if
> > the offsets to those are needed.
> 
> I don't think we need syscall_trace_* offsets, looks like
> trace_event_get_offsets() should return offset trace_event_raw_sys_enter
> instead. I am still trying to figure out how all of this works together.
> Maybe Alexei or Andrii have more context here.

Turns out it is even more confusing. The tests dereference the context as
struct trace_event_raw_sys_enter so bpf verifier sets max_ctx_offset
based on that, then perf_event_set_bpf_prog() checks this offset against
the one in struct syscall_trace_enter, but what bpf prog really gets is
a pointer to struct syscall_tp_t from kernel/trace/trace_syscalls.c.

I don't know the history behind these decisions, but should the tests
dereference context as struct syscall_trace_enter instead and struct
syscall_tp_t be changed to have syscall_nr as int?

-- 
 Artem


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-05 12:34     ` Artem Savkov
@ 2023-10-12 11:45       ` Artem Savkov
  2023-10-12 13:44         ` Steven Rostedt
  2023-10-13  3:13         ` Rod Webster
  0 siblings, 2 replies; 16+ messages in thread
From: Artem Savkov @ 2023-10-12 11:45 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf, netdev
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Thomas Gleixner, linux-rt-users, Jiri Olsa,
	Artem Savkov

linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
for lazy preemption")) that adds an extra member to struct trace_entry.
This causes the offset of args field in struct trace_event_raw_sys_enter
be different from the one in struct syscall_trace_enter:

struct trace_event_raw_sys_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */
        /* XXX 4 bytes hole, try to pack */

        long int                   id;                   /*    16     8 */
        long unsigned int          args[6];              /*    24    48 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        char                       __data[];             /*    72     0 */

        /* size: 72, cachelines: 2, members: 4 */
        /* sum members: 68, holes: 1, sum holes: 4 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 8 bytes */
};

struct syscall_trace_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */

        int                        nr;                   /*    12     4 */
        long unsigned int          args[];               /*    16     0 */

        /* size: 16, cachelines: 1, members: 3 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 16 bytes */
};

This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
test_profiler testcase because max_ctx_offset is calculated based on the
former struct, while off on the latter:

  10488         if (is_tracepoint || is_syscall_tp) {
  10489                 int off = trace_event_get_offsets(event->tp_event);
  10490
  10491                 if (prog->aux->max_ctx_offset > off)
  10492                         return -EACCES;
  10493         }

What bpf program is actually getting is a pointer to struct
syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
the problem by aligning struct syscall_tp_t with with struct
syscall_trace_(enter|exit) and changing the tests to use these structs
to dereference context.

Signed-off-by: Artem Savkov <asavkov@redhat.com>
---
 kernel/trace/trace_syscalls.c                    | 4 ++--
 tools/testing/selftests/bpf/progs/profiler.inc.h | 2 +-
 tools/testing/selftests/bpf/progs/test_vmlinux.c | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index de753403cdafb..9c581d6da843a 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -556,7 +556,7 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
 {
 	struct syscall_tp_t {
 		struct trace_entry ent;
-		unsigned long syscall_nr;
+		int syscall_nr;
 		unsigned long args[SYSCALL_DEFINE_MAXARGS];
 	} __aligned(8) param;
 	int i;
@@ -661,7 +661,7 @@ static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *reg
 {
 	struct syscall_tp_t {
 		struct trace_entry ent;
-		unsigned long syscall_nr;
+		int syscall_nr;
 		unsigned long ret;
 	} __aligned(8) param;
 
diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h b/tools/testing/selftests/bpf/progs/profiler.inc.h
index f799d87e87002..897061930cb76 100644
--- a/tools/testing/selftests/bpf/progs/profiler.inc.h
+++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
@@ -609,7 +609,7 @@ ssize_t BPF_KPROBE(kprobe__proc_sys_write,
 }
 
 SEC("tracepoint/syscalls/sys_enter_kill")
-int tracepoint__syscalls__sys_enter_kill(struct trace_event_raw_sys_enter* ctx)
+int tracepoint__syscalls__sys_enter_kill(struct syscall_trace_enter* ctx)
 {
 	struct bpf_func_stats_ctx stats_ctx;
 
diff --git a/tools/testing/selftests/bpf/progs/test_vmlinux.c b/tools/testing/selftests/bpf/progs/test_vmlinux.c
index 4b8e37f7fd06c..78b23934d9f8f 100644
--- a/tools/testing/selftests/bpf/progs/test_vmlinux.c
+++ b/tools/testing/selftests/bpf/progs/test_vmlinux.c
@@ -16,12 +16,12 @@ bool kprobe_called = false;
 bool fentry_called = false;
 
 SEC("tp/syscalls/sys_enter_nanosleep")
-int handle__tp(struct trace_event_raw_sys_enter *args)
+int handle__tp(struct syscall_trace_enter *args)
 {
 	struct __kernel_timespec *ts;
 	long tv_nsec;
 
-	if (args->id != __NR_nanosleep)
+	if (args->nr != __NR_nanosleep)
 		return 0;
 
 	ts = (void *)args->args[0];
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-12 11:45       ` [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t Artem Savkov
@ 2023-10-12 13:44         ` Steven Rostedt
  2023-10-12 23:32           ` Andrii Nakryiko
  2023-10-13  3:13         ` Rod Webster
  1 sibling, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2023-10-12 13:44 UTC (permalink / raw)
  To: Artem Savkov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
	netdev, Masami Hiramatsu, linux-kernel, linux-trace-kernel,
	Thomas Gleixner, linux-rt-users, Jiri Olsa

On Thu, 12 Oct 2023 13:45:50 +0200
Artem Savkov <asavkov@redhat.com> wrote:

> linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
> for lazy preemption")) that adds an extra member to struct trace_entry.
> This causes the offset of args field in struct trace_event_raw_sys_enter
> be different from the one in struct syscall_trace_enter:
> 
> struct trace_event_raw_sys_enter {
>         struct trace_entry         ent;                  /*     0    12 */
> 
>         /* XXX last struct has 3 bytes of padding */
>         /* XXX 4 bytes hole, try to pack */
> 
>         long int                   id;                   /*    16     8 */
>         long unsigned int          args[6];              /*    24    48 */
>         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
>         char                       __data[];             /*    72     0 */
> 
>         /* size: 72, cachelines: 2, members: 4 */
>         /* sum members: 68, holes: 1, sum holes: 4 */
>         /* paddings: 1, sum paddings: 3 */
>         /* last cacheline: 8 bytes */
> };
> 
> struct syscall_trace_enter {
>         struct trace_entry         ent;                  /*     0    12 */
> 
>         /* XXX last struct has 3 bytes of padding */
> 
>         int                        nr;                   /*    12     4 */
>         long unsigned int          args[];               /*    16     0 */
> 
>         /* size: 16, cachelines: 1, members: 3 */
>         /* paddings: 1, sum paddings: 3 */
>         /* last cacheline: 16 bytes */
> };
> 
> This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> test_profiler testcase because max_ctx_offset is calculated based on the
> former struct, while off on the latter:
> 
>   10488         if (is_tracepoint || is_syscall_tp) {
>   10489                 int off = trace_event_get_offsets(event->tp_event);
>   10490
>   10491                 if (prog->aux->max_ctx_offset > off)
>   10492                         return -EACCES;
>   10493         }
> 
> What bpf program is actually getting is a pointer to struct
> syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
> the problem by aligning struct syscall_tp_t with with struct
> syscall_trace_(enter|exit) and changing the tests to use these structs
> to dereference context.
> 
> Signed-off-by: Artem Savkov <asavkov@redhat.com>

Thanks for doing a proper fix.

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-12 13:44         ` Steven Rostedt
@ 2023-10-12 23:32           ` Andrii Nakryiko
  2023-10-13  5:42             ` [PATCH " Artem Savkov
  2023-10-13  6:01             ` [RFC PATCH " Artem Savkov
  0 siblings, 2 replies; 16+ messages in thread
From: Andrii Nakryiko @ 2023-10-12 23:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Artem Savkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, bpf, netdev, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Thomas Gleixner, linux-rt-users, Jiri Olsa

On Thu, Oct 12, 2023 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 12 Oct 2023 13:45:50 +0200
> Artem Savkov <asavkov@redhat.com> wrote:
>
> > linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
> > for lazy preemption")) that adds an extra member to struct trace_entry.
> > This causes the offset of args field in struct trace_event_raw_sys_enter
> > be different from the one in struct syscall_trace_enter:
> >
> > struct trace_event_raw_sys_enter {
> >         struct trace_entry         ent;                  /*     0    12 */
> >
> >         /* XXX last struct has 3 bytes of padding */
> >         /* XXX 4 bytes hole, try to pack */
> >
> >         long int                   id;                   /*    16     8 */
> >         long unsigned int          args[6];              /*    24    48 */
> >         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> >         char                       __data[];             /*    72     0 */
> >
> >         /* size: 72, cachelines: 2, members: 4 */
> >         /* sum members: 68, holes: 1, sum holes: 4 */
> >         /* paddings: 1, sum paddings: 3 */
> >         /* last cacheline: 8 bytes */
> > };
> >
> > struct syscall_trace_enter {
> >         struct trace_entry         ent;                  /*     0    12 */
> >
> >         /* XXX last struct has 3 bytes of padding */
> >
> >         int                        nr;                   /*    12     4 */
> >         long unsigned int          args[];               /*    16     0 */
> >
> >         /* size: 16, cachelines: 1, members: 3 */
> >         /* paddings: 1, sum paddings: 3 */
> >         /* last cacheline: 16 bytes */
> > };
> >
> > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> > test_profiler testcase because max_ctx_offset is calculated based on the
> > former struct, while off on the latter:
> >
> >   10488         if (is_tracepoint || is_syscall_tp) {
> >   10489                 int off = trace_event_get_offsets(event->tp_event);
> >   10490
> >   10491                 if (prog->aux->max_ctx_offset > off)
> >   10492                         return -EACCES;
> >   10493         }
> >
> > What bpf program is actually getting is a pointer to struct
> > syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
> > the problem by aligning struct syscall_tp_t with with struct
> > syscall_trace_(enter|exit) and changing the tests to use these structs
> > to dereference context.
> >
> > Signed-off-by: Artem Savkov <asavkov@redhat.com>
>

I think these changes make sense regardless, can you please resend the
patch without RFC tag so that our CI can run tests for it?

> Thanks for doing a proper fix.
>
> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

But looking at [0] and briefly reading some of the discussions you,
Steven, had. I'm just wondering if it would be best to avoid
increasing struct trace_entry altogether? It seems like preempt_count
is actually a 4-bit field in trace context, so it doesn't seem like we
really need to allocate an entire byte for both preempt_count and
preempt_lazy_count. Why can't we just combine them and not waste 8
extra bytes for each trace event in a ring buffer?

  [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71

>
> -- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-12 11:45       ` [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t Artem Savkov
  2023-10-12 13:44         ` Steven Rostedt
@ 2023-10-13  3:13         ` Rod Webster
  1 sibling, 0 replies; 16+ messages in thread
From: Rod Webster @ 2023-10-13  3:13 UTC (permalink / raw)
  To: Artem Savkov
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf,
	netdev, Steven Rostedt, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Thomas Gleixner, linux-rt-users, Jiri Olsa

For the novice with the RT kernel, Could somebody  tell me what kernel
this bug was first introduced in and what kernel we need to install to
get the fix?

This could be the issue we have been experiencing in the Linuxcnc
community with excessive RT network latency (mostly with realtek
NIC's).
I had flagged Lazy preemption as being the possible  changes causing
this issue. I thought Lazy Preemption was added Circa kernel 5.09 as
it affected Debian Bullseye on Kernel 5.10 which coincided with when
we first observed the problem.

Thanks in anticipation.

Rod Webster

Rod Webster
1300 896 832
+61 435 765 611
VMN®
www.vmn.com.au

Sole Queensland Distributor


On Thu, 12 Oct 2023 at 21:46, Artem Savkov <asavkov@redhat.com> wrote:
>
> linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
> for lazy preemption")) that adds an extra member to struct trace_entry.
> This causes the offset of args field in struct trace_event_raw_sys_enter
> be different from the one in struct syscall_trace_enter:
>
> struct trace_event_raw_sys_enter {
>         struct trace_entry         ent;                  /*     0    12 */
>
>         /* XXX last struct has 3 bytes of padding */
>         /* XXX 4 bytes hole, try to pack */
>
>         long int                   id;                   /*    16     8 */
>         long unsigned int          args[6];              /*    24    48 */
>         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
>         char                       __data[];             /*    72     0 */
>
>         /* size: 72, cachelines: 2, members: 4 */
>         /* sum members: 68, holes: 1, sum holes: 4 */
>         /* paddings: 1, sum paddings: 3 */
>         /* last cacheline: 8 bytes */
> };
>
> struct syscall_trace_enter {
>         struct trace_entry         ent;                  /*     0    12 */
>
>         /* XXX last struct has 3 bytes of padding */
>
>         int                        nr;                   /*    12     4 */
>         long unsigned int          args[];               /*    16     0 */
>
>         /* size: 16, cachelines: 1, members: 3 */
>         /* paddings: 1, sum paddings: 3 */
>         /* last cacheline: 16 bytes */
> };
>
> This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> test_profiler testcase because max_ctx_offset is calculated based on the
> former struct, while off on the latter:
>
>   10488         if (is_tracepoint || is_syscall_tp) {
>   10489                 int off = trace_event_get_offsets(event->tp_event);
>   10490
>   10491                 if (prog->aux->max_ctx_offset > off)
>   10492                         return -EACCES;
>   10493         }
>
> What bpf program is actually getting is a pointer to struct
> syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
> the problem by aligning struct syscall_tp_t with with struct
> syscall_trace_(enter|exit) and changing the tests to use these structs
> to dereference context.
>
> Signed-off-by: Artem Savkov <asavkov@redhat.com>
> ---
>  kernel/trace/trace_syscalls.c                    | 4 ++--
>  tools/testing/selftests/bpf/progs/profiler.inc.h | 2 +-
>  tools/testing/selftests/bpf/progs/test_vmlinux.c | 4 ++--
>  3 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index de753403cdafb..9c581d6da843a 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -556,7 +556,7 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
>  {
>         struct syscall_tp_t {
>                 struct trace_entry ent;
> -               unsigned long syscall_nr;
> +               int syscall_nr;
>                 unsigned long args[SYSCALL_DEFINE_MAXARGS];
>         } __aligned(8) param;
>         int i;
> @@ -661,7 +661,7 @@ static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *reg
>  {
>         struct syscall_tp_t {
>                 struct trace_entry ent;
> -               unsigned long syscall_nr;
> +               int syscall_nr;
>                 unsigned long ret;
>         } __aligned(8) param;
>
> diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h b/tools/testing/selftests/bpf/progs/profiler.inc.h
> index f799d87e87002..897061930cb76 100644
> --- a/tools/testing/selftests/bpf/progs/profiler.inc.h
> +++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
> @@ -609,7 +609,7 @@ ssize_t BPF_KPROBE(kprobe__proc_sys_write,
>  }
>
>  SEC("tracepoint/syscalls/sys_enter_kill")
> -int tracepoint__syscalls__sys_enter_kill(struct trace_event_raw_sys_enter* ctx)
> +int tracepoint__syscalls__sys_enter_kill(struct syscall_trace_enter* ctx)
>  {
>         struct bpf_func_stats_ctx stats_ctx;
>
> diff --git a/tools/testing/selftests/bpf/progs/test_vmlinux.c b/tools/testing/selftests/bpf/progs/test_vmlinux.c
> index 4b8e37f7fd06c..78b23934d9f8f 100644
> --- a/tools/testing/selftests/bpf/progs/test_vmlinux.c
> +++ b/tools/testing/selftests/bpf/progs/test_vmlinux.c
> @@ -16,12 +16,12 @@ bool kprobe_called = false;
>  bool fentry_called = false;
>
>  SEC("tp/syscalls/sys_enter_nanosleep")
> -int handle__tp(struct trace_event_raw_sys_enter *args)
> +int handle__tp(struct syscall_trace_enter *args)
>  {
>         struct __kernel_timespec *ts;
>         long tv_nsec;
>
> -       if (args->id != __NR_nanosleep)
> +       if (args->nr != __NR_nanosleep)
>                 return 0;
>
>         ts = (void *)args->args[0];
> --
> 2.41.0
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-12 23:32           ` Andrii Nakryiko
@ 2023-10-13  5:42             ` Artem Savkov
  2023-10-13 19:50               ` patchwork-bot+netdevbpf
  2023-10-13  6:01             ` [RFC PATCH " Artem Savkov
  1 sibling, 1 reply; 16+ messages in thread
From: Artem Savkov @ 2023-10-13  5:42 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf, netdev
  Cc: Steven Rostedt, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Thomas Gleixner, linux-rt-users, Jiri Olsa,
	Artem Savkov

linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
for lazy preemption")) that adds an extra member to struct trace_entry.
This causes the offset of args field in struct trace_event_raw_sys_enter
be different from the one in struct syscall_trace_enter:

struct trace_event_raw_sys_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */
        /* XXX 4 bytes hole, try to pack */

        long int                   id;                   /*    16     8 */
        long unsigned int          args[6];              /*    24    48 */
        /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
        char                       __data[];             /*    72     0 */

        /* size: 72, cachelines: 2, members: 4 */
        /* sum members: 68, holes: 1, sum holes: 4 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 8 bytes */
};

struct syscall_trace_enter {
        struct trace_entry         ent;                  /*     0    12 */

        /* XXX last struct has 3 bytes of padding */

        int                        nr;                   /*    12     4 */
        long unsigned int          args[];               /*    16     0 */

        /* size: 16, cachelines: 1, members: 3 */
        /* paddings: 1, sum paddings: 3 */
        /* last cacheline: 16 bytes */
};

This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
test_profiler testcase because max_ctx_offset is calculated based on the
former struct, while off on the latter:

  10488         if (is_tracepoint || is_syscall_tp) {
  10489                 int off = trace_event_get_offsets(event->tp_event);
  10490
  10491                 if (prog->aux->max_ctx_offset > off)
  10492                         return -EACCES;
  10493         }

What bpf program is actually getting is a pointer to struct
syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
the problem by aligning struct syscall_tp_t with with struct
syscall_trace_(enter|exit) and changing the tests to use these structs
to dereference context.

Signed-off-by: Artem Savkov <asavkov@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

---
 kernel/trace/trace_syscalls.c                    | 4 ++--
 tools/testing/selftests/bpf/progs/profiler.inc.h | 2 +-
 tools/testing/selftests/bpf/progs/test_vmlinux.c | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index de753403cdafb..9c581d6da843a 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -556,7 +556,7 @@ static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *re
 {
 	struct syscall_tp_t {
 		struct trace_entry ent;
-		unsigned long syscall_nr;
+		int syscall_nr;
 		unsigned long args[SYSCALL_DEFINE_MAXARGS];
 	} __aligned(8) param;
 	int i;
@@ -661,7 +661,7 @@ static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *reg
 {
 	struct syscall_tp_t {
 		struct trace_entry ent;
-		unsigned long syscall_nr;
+		int syscall_nr;
 		unsigned long ret;
 	} __aligned(8) param;
 
diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h b/tools/testing/selftests/bpf/progs/profiler.inc.h
index f799d87e87002..897061930cb76 100644
--- a/tools/testing/selftests/bpf/progs/profiler.inc.h
+++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
@@ -609,7 +609,7 @@ ssize_t BPF_KPROBE(kprobe__proc_sys_write,
 }
 
 SEC("tracepoint/syscalls/sys_enter_kill")
-int tracepoint__syscalls__sys_enter_kill(struct trace_event_raw_sys_enter* ctx)
+int tracepoint__syscalls__sys_enter_kill(struct syscall_trace_enter* ctx)
 {
 	struct bpf_func_stats_ctx stats_ctx;
 
diff --git a/tools/testing/selftests/bpf/progs/test_vmlinux.c b/tools/testing/selftests/bpf/progs/test_vmlinux.c
index 4b8e37f7fd06c..78b23934d9f8f 100644
--- a/tools/testing/selftests/bpf/progs/test_vmlinux.c
+++ b/tools/testing/selftests/bpf/progs/test_vmlinux.c
@@ -16,12 +16,12 @@ bool kprobe_called = false;
 bool fentry_called = false;
 
 SEC("tp/syscalls/sys_enter_nanosleep")
-int handle__tp(struct trace_event_raw_sys_enter *args)
+int handle__tp(struct syscall_trace_enter *args)
 {
 	struct __kernel_timespec *ts;
 	long tv_nsec;
 
-	if (args->id != __NR_nanosleep)
+	if (args->nr != __NR_nanosleep)
 		return 0;
 
 	ts = (void *)args->args[0];
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-12 23:32           ` Andrii Nakryiko
  2023-10-13  5:42             ` [PATCH " Artem Savkov
@ 2023-10-13  6:01             ` Artem Savkov
  2023-10-13 14:00               ` Steven Rostedt
  1 sibling, 1 reply; 16+ messages in thread
From: Artem Savkov @ 2023-10-13  6:01 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Steven Rostedt, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, bpf, netdev, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Thomas Gleixner, linux-rt-users, Jiri Olsa

On Thu, Oct 12, 2023 at 04:32:51PM -0700, Andrii Nakryiko wrote:
> On Thu, Oct 12, 2023 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Thu, 12 Oct 2023 13:45:50 +0200
> > Artem Savkov <asavkov@redhat.com> wrote:
> >
> > > linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
> > > for lazy preemption")) that adds an extra member to struct trace_entry.
> > > This causes the offset of args field in struct trace_event_raw_sys_enter
> > > be different from the one in struct syscall_trace_enter:
> > >
> > > struct trace_event_raw_sys_enter {
> > >         struct trace_entry         ent;                  /*     0    12 */
> > >
> > >         /* XXX last struct has 3 bytes of padding */
> > >         /* XXX 4 bytes hole, try to pack */
> > >
> > >         long int                   id;                   /*    16     8 */
> > >         long unsigned int          args[6];              /*    24    48 */
> > >         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> > >         char                       __data[];             /*    72     0 */
> > >
> > >         /* size: 72, cachelines: 2, members: 4 */
> > >         /* sum members: 68, holes: 1, sum holes: 4 */
> > >         /* paddings: 1, sum paddings: 3 */
> > >         /* last cacheline: 8 bytes */
> > > };
> > >
> > > struct syscall_trace_enter {
> > >         struct trace_entry         ent;                  /*     0    12 */
> > >
> > >         /* XXX last struct has 3 bytes of padding */
> > >
> > >         int                        nr;                   /*    12     4 */
> > >         long unsigned int          args[];               /*    16     0 */
> > >
> > >         /* size: 16, cachelines: 1, members: 3 */
> > >         /* paddings: 1, sum paddings: 3 */
> > >         /* last cacheline: 16 bytes */
> > > };
> > >
> > > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> > > test_profiler testcase because max_ctx_offset is calculated based on the
> > > former struct, while off on the latter:
> > >
> > >   10488         if (is_tracepoint || is_syscall_tp) {
> > >   10489                 int off = trace_event_get_offsets(event->tp_event);
> > >   10490
> > >   10491                 if (prog->aux->max_ctx_offset > off)
> > >   10492                         return -EACCES;
> > >   10493         }
> > >
> > > What bpf program is actually getting is a pointer to struct
> > > syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
> > > the problem by aligning struct syscall_tp_t with with struct
> > > syscall_trace_(enter|exit) and changing the tests to use these structs
> > > to dereference context.
> > >
> > > Signed-off-by: Artem Savkov <asavkov@redhat.com>
> >
> 
> I think these changes make sense regardless, can you please resend the
> patch without RFC tag so that our CI can run tests for it?

Ok, didn't know it was set up like that.

> > Thanks for doing a proper fix.
> >
> > Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 
> But looking at [0] and briefly reading some of the discussions you,
> Steven, had. I'm just wondering if it would be best to avoid
> increasing struct trace_entry altogether? It seems like preempt_count
> is actually a 4-bit field in trace context, so it doesn't seem like we
> really need to allocate an entire byte for both preempt_count and
> preempt_lazy_count. Why can't we just combine them and not waste 8
> extra bytes for each trace event in a ring buffer?
> 
>   [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71

I agree that avoiding increase in struct trace_entry size would be very
desirable, but I have no knowledge whether rt developers had reasons to
do it like this.

Nevertheless I think the issue with verifier running against a wrong
struct still needs to be addressed.

-- 
Regards,
  Artem


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-13  6:01             ` [RFC PATCH " Artem Savkov
@ 2023-10-13 14:00               ` Steven Rostedt
  2023-10-13 19:43                 ` Andrii Nakryiko
  0 siblings, 1 reply; 16+ messages in thread
From: Steven Rostedt @ 2023-10-13 14:00 UTC (permalink / raw)
  To: Artem Savkov
  Cc: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, bpf, netdev, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Thomas Gleixner, linux-rt-users, Jiri Olsa

On Fri, 13 Oct 2023 08:01:34 +0200
Artem Savkov <asavkov@redhat.com> wrote:

> > But looking at [0] and briefly reading some of the discussions you,
> > Steven, had. I'm just wondering if it would be best to avoid
> > increasing struct trace_entry altogether? It seems like preempt_count
> > is actually a 4-bit field in trace context, so it doesn't seem like we
> > really need to allocate an entire byte for both preempt_count and
> > preempt_lazy_count. Why can't we just combine them and not waste 8
> > extra bytes for each trace event in a ring buffer?
> > 
> >   [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71  
> 
> I agree that avoiding increase in struct trace_entry size would be very
> desirable, but I have no knowledge whether rt developers had reasons to
> do it like this.
> 
> Nevertheless I think the issue with verifier running against a wrong
> struct still needs to be addressed.

Correct. My Ack is based on the current way things are done upstream.
It was just that linux-rt showed the issue, where the code was not as
robust as it should have been. To me this was a correctness issue, not
an issue that had to do with how things are done in linux-rt.

As for the changes in linux-rt, they are not upstream yet. I'll have my
comments on that code when that happens.

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-13 14:00               ` Steven Rostedt
@ 2023-10-13 19:43                 ` Andrii Nakryiko
  2023-10-16 15:53                   ` Steven Rostedt
  0 siblings, 1 reply; 16+ messages in thread
From: Andrii Nakryiko @ 2023-10-13 19:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Artem Savkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, bpf, netdev, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Thomas Gleixner, linux-rt-users, Jiri Olsa

On Fri, Oct 13, 2023 at 7:00 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Fri, 13 Oct 2023 08:01:34 +0200
> Artem Savkov <asavkov@redhat.com> wrote:
>
> > > But looking at [0] and briefly reading some of the discussions you,
> > > Steven, had. I'm just wondering if it would be best to avoid
> > > increasing struct trace_entry altogether? It seems like preempt_count
> > > is actually a 4-bit field in trace context, so it doesn't seem like we
> > > really need to allocate an entire byte for both preempt_count and
> > > preempt_lazy_count. Why can't we just combine them and not waste 8
> > > extra bytes for each trace event in a ring buffer?
> > >
> > >   [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71
> >
> > I agree that avoiding increase in struct trace_entry size would be very
> > desirable, but I have no knowledge whether rt developers had reasons to
> > do it like this.
> >
> > Nevertheless I think the issue with verifier running against a wrong
> > struct still needs to be addressed.
>
> Correct. My Ack is based on the current way things are done upstream.
> It was just that linux-rt showed the issue, where the code was not as
> robust as it should have been. To me this was a correctness issue, not
> an issue that had to do with how things are done in linux-rt.

I think we should at least add some BUILD_BUG_ON() that validates
offsets in syscall_tp_t matches the ones in syscall_trace_enter and
syscall_trace_exit, to fail more loudly if there is any mismatch in
the future. WDYT?

>
> As for the changes in linux-rt, they are not upstream yet. I'll have my
> comments on that code when that happens.

Ah, ok, cool. I'd appreciate you cc'ing bpf@vger.kernel.org in that
discussion, thank you!

>
> -- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-13  5:42             ` [PATCH " Artem Savkov
@ 2023-10-13 19:50               ` patchwork-bot+netdevbpf
  0 siblings, 0 replies; 16+ messages in thread
From: patchwork-bot+netdevbpf @ 2023-10-13 19:50 UTC (permalink / raw)
  To: Artem Savkov
  Cc: ast, daniel, andrii, bpf, netdev, rostedt, mhiramat,
	linux-kernel, linux-trace-kernel, tglx, linux-rt-users, jolsa

Hello:

This patch was applied to bpf/bpf-next.git (master)
by Andrii Nakryiko <andrii@kernel.org>:

On Fri, 13 Oct 2023 07:42:19 +0200 you wrote:
> linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
> for lazy preemption")) that adds an extra member to struct trace_entry.
> This causes the offset of args field in struct trace_event_raw_sys_enter
> be different from the one in struct syscall_trace_enter:
> 
> struct trace_event_raw_sys_enter {
>         struct trace_entry         ent;                  /*     0    12 */
> 
> [...]

Here is the summary with links:
  - [bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
    https://git.kernel.org/bpf/bpf-next/c/ba8ea72388a1

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
  2023-10-13 19:43                 ` Andrii Nakryiko
@ 2023-10-16 15:53                   ` Steven Rostedt
  0 siblings, 0 replies; 16+ messages in thread
From: Steven Rostedt @ 2023-10-16 15:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Artem Savkov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, bpf, netdev, Masami Hiramatsu, linux-kernel,
	linux-trace-kernel, Thomas Gleixner, linux-rt-users, Jiri Olsa

On Fri, 13 Oct 2023 12:43:18 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:


> > Correct. My Ack is based on the current way things are done upstream.
> > It was just that linux-rt showed the issue, where the code was not as
> > robust as it should have been. To me this was a correctness issue, not
> > an issue that had to do with how things are done in linux-rt.  
> 
> I think we should at least add some BUILD_BUG_ON() that validates
> offsets in syscall_tp_t matches the ones in syscall_trace_enter and
> syscall_trace_exit, to fail more loudly if there is any mismatch in
> the future. WDYT?

If you want to, feel free to send a patch.

> 
> >
> > As for the changes in linux-rt, they are not upstream yet. I'll have my
> > comments on that code when that happens.  
> 
> Ah, ok, cool. I'd appreciate you cc'ing bpf@vger.kernel.org in that
> discussion, thank you!

If I remember ;-)

-- Steve

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-10-16 15:52 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-02 13:52 [RFC PATCH] tracing: change syscall number type in struct syscall_trace_* Artem Savkov
2023-10-03 22:11 ` Andrii Nakryiko
2023-10-04  7:02   ` Artem Savkov
2023-10-04  1:38 ` Steven Rostedt
2023-10-04 12:55   ` Artem Savkov
2023-10-05 12:34     ` Artem Savkov
2023-10-12 11:45       ` [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t Artem Savkov
2023-10-12 13:44         ` Steven Rostedt
2023-10-12 23:32           ` Andrii Nakryiko
2023-10-13  5:42             ` [PATCH " Artem Savkov
2023-10-13 19:50               ` patchwork-bot+netdevbpf
2023-10-13  6:01             ` [RFC PATCH " Artem Savkov
2023-10-13 14:00               ` Steven Rostedt
2023-10-13 19:43                 ` Andrii Nakryiko
2023-10-16 15:53                   ` Steven Rostedt
2023-10-13  3:13         ` Rod Webster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).