linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* "struct perf_sample_data" alignment
@ 2021-03-05  3:45 Linus Torvalds
  2021-03-05  8:36 ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2021-03-05  3:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim
  Cc: Linux Kernel Mailing List

Sp there's a note about new warnings in 5.12-rc1 that I looked at, and
many of the warnings made me go "Whaaa?". They were all of the type

   warning: 'perf_event_aux_event' uses dynamic stack allocation

and then when I go look, I see nothing that looks like a dynamic stack
allocation at all.

But you know what? The warning is kind of misleading, but at the same
time it's true in a sense. The problem is that the function (and a lot
of other functions) has a local variable like this:

        struct perf_sample_data sample;

and the definition of that "struct perf_sample_data" has
____cacheline_aligned at the end.

And guess what? That means that now the compiler has actually to play
games with manually aligning the frame of that function, since the
natural stack alignment is *not* a full cacheline aligned thing. So
it's kind of true: the frame allocation is mnot a simple static thing,
it's a nasty complex thing that wastes memory and time.

That ____cacheline_aligned goes back many years, this is not new, it
seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
the perf_sample_data struct layout").

But it really seems entirely and utterly bogus. That cacheline
alignment makes things *worse*, when the variables are on the local
stack. The local stack is already going to be dirty and in the cache,
and aligning those things isn't going to - and I quote from the code
in that commend in that commit - "minimize the cachelines touched".

Quite the reverse. It's just going to make the stack frame use *more*
memory, and make any cacheline usage _worse_.

Maybe there are static (non-automatic) cases of that "struct
perf_sample_data" where the alignment makes sense, but it does not
seem to make sense on the _type_. It should be on those individual
variables, not on every perf_sample_data.

I didn't make any real effort to analyze this, but from a few quick
greps, it looks like every single case is an automatic variable on the
stack, and that the forced alignment is actually a bad thing every
single time. It never ever generates better cache use, but it _does_
generate worse code, and bigger stack frames.

Hmm? What am I missing?

                 Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "struct perf_sample_data" alignment
  2021-03-05  3:45 "struct perf_sample_data" alignment Linus Torvalds
@ 2021-03-05  8:36 ` Peter Zijlstra
  2021-03-05 14:57   ` Peter Zijlstra
  2021-03-08 10:01   ` David Laight
  0 siblings, 2 replies; 6+ messages in thread
From: Peter Zijlstra @ 2021-03-05  8:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Linux Kernel Mailing List

On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote:
> That ____cacheline_aligned goes back many years, this is not new, it
> seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
> the perf_sample_data struct layout").

long time ago...

> But it really seems entirely and utterly bogus. That cacheline
> alignment makes things *worse*, when the variables are on the local
> stack. The local stack is already going to be dirty and in the cache,
> and aligning those things isn't going to - and I quote from the code
> in that commend in that commit - "minimize the cachelines touched".
> 
> Quite the reverse. It's just going to make the stack frame use *more*
> memory, and make any cacheline usage _worse_.

IIRC there is more history here, but I can't seem to find references
just now.

What I remember is that since perf_sample_data is fairly large,
unconditionally initializing the whole thing is *slow* (and
-fauto-var-init=zero will hurt here).

So at some point I removed that full initialization and made sure we
only unconditionally touched the first few variables, which gave a
measurable speedup.

Then things got messy again and the commit 2565711fb7d7 referenced above
was cleanup, to get back to that initial state.

Now, you're right that __cacheline_aligned on on-stack (and this is
indeed mostly on-stack) is fairly tedious (there were a few patches
recently to reduce the amount of on-stack instances).

I'll put it on the todo list, along with that hotplug stuff (which I
tried to fix but ended up with an even bigger mess). I suppose we can
try and not have the alignment for the on-stack instances while
preserving it for the few off-stack ones.

Also; we're running on the NMI stack, and that's not typically hot.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "struct perf_sample_data" alignment
  2021-03-05  8:36 ` Peter Zijlstra
@ 2021-03-05 14:57   ` Peter Zijlstra
  2021-03-05 15:57     ` Alexei Starovoitov
  2021-03-08 10:01   ` David Laight
  1 sibling, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2021-03-05 14:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Linux Kernel Mailing List

On Fri, Mar 05, 2021 at 09:36:30AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote:
> > That ____cacheline_aligned goes back many years, this is not new, it
> > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
> > the perf_sample_data struct layout").
> 
> long time ago...
> 
> > But it really seems entirely and utterly bogus. That cacheline
> > alignment makes things *worse*, when the variables are on the local
> > stack. The local stack is already going to be dirty and in the cache,
> > and aligning those things isn't going to - and I quote from the code
> > in that commend in that commit - "minimize the cachelines touched".
> > 
> > Quite the reverse. It's just going to make the stack frame use *more*
> > memory, and make any cacheline usage _worse_.
> 
> IIRC there is more history here, but I can't seem to find references
> just now.
> 
> What I remember is that since perf_sample_data is fairly large,
> unconditionally initializing the whole thing is *slow* (and
> -fauto-var-init=zero will hurt here).
> 
> So at some point I removed that full initialization and made sure we
> only unconditionally touched the first few variables, which gave a
> measurable speedup.
> 
> Then things got messy again and the commit 2565711fb7d7 referenced above
> was cleanup, to get back to that initial state.
> 
> Now, you're right that __cacheline_aligned on on-stack (and this is
> indeed mostly on-stack) is fairly tedious (there were a few patches
> recently to reduce the amount of on-stack instances).
> 
> I'll put it on the todo list, along with that hotplug stuff (which I
> tried to fix but ended up with an even bigger mess). I suppose we can
> try and not have the alignment for the on-stack instances while
> preserving it for the few off-stack ones.
> 
> Also; we're running on the NMI stack, and that's not typically hot.

This seems to be it... (completely untested)

---
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..918a296d2ca2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1032,7 +1032,9 @@ struct perf_sample_data {
 	u64				cgroup;
 	u64				data_page_size;
 	u64				code_page_size;
-} ____cacheline_aligned;
+};
+
+typedef struct perf_sample_data perf_sample_data_t ____cacheline_aligned;
 
 /* default value for data source */
 #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index b0c45d923f0f..f32c623abef6 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
  * bpf_perf_event_output
  */
 struct bpf_trace_sample_data {
-	struct perf_sample_data sds[3];
+	perf_sample_data_t sds[3];
 };
 
 static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds);

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: "struct perf_sample_data" alignment
  2021-03-05 14:57   ` Peter Zijlstra
@ 2021-03-05 15:57     ` Alexei Starovoitov
  2021-03-06 13:14       ` Ingo Molnar
  0 siblings, 1 reply; 6+ messages in thread
From: Alexei Starovoitov @ 2021-03-05 15:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Linux Kernel Mailing List

On Fri, Mar 5, 2021 at 7:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Mar 05, 2021 at 09:36:30AM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote:
> > > That ____cacheline_aligned goes back many years, this is not new, it
> > > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
> > > the perf_sample_data struct layout").
> >
> > long time ago...
> >
> > > But it really seems entirely and utterly bogus. That cacheline
> > > alignment makes things *worse*, when the variables are on the local
> > > stack. The local stack is already going to be dirty and in the cache,
> > > and aligning those things isn't going to - and I quote from the code
> > > in that commend in that commit - "minimize the cachelines touched".
> > >
> > > Quite the reverse. It's just going to make the stack frame use *more*
> > > memory, and make any cacheline usage _worse_.
> >
> > IIRC there is more history here, but I can't seem to find references
> > just now.
> >
> > What I remember is that since perf_sample_data is fairly large,
> > unconditionally initializing the whole thing is *slow* (and
> > -fauto-var-init=zero will hurt here).
> >
> > So at some point I removed that full initialization and made sure we
> > only unconditionally touched the first few variables, which gave a
> > measurable speedup.
> >
> > Then things got messy again and the commit 2565711fb7d7 referenced above
> > was cleanup, to get back to that initial state.
> >
> > Now, you're right that __cacheline_aligned on on-stack (and this is
> > indeed mostly on-stack) is fairly tedious (there were a few patches
> > recently to reduce the amount of on-stack instances).
> >
> > I'll put it on the todo list, along with that hotplug stuff (which I
> > tried to fix but ended up with an even bigger mess). I suppose we can
> > try and not have the alignment for the on-stack instances while
> > preserving it for the few off-stack ones.
> >
> > Also; we're running on the NMI stack, and that's not typically hot.
>
> This seems to be it... (completely untested)
>
> ---
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3f7f89ea5e51..918a296d2ca2 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1032,7 +1032,9 @@ struct perf_sample_data {
>         u64                             cgroup;
>         u64                             data_page_size;
>         u64                             code_page_size;
> -} ____cacheline_aligned;
> +};
> +
> +typedef struct perf_sample_data perf_sample_data_t ____cacheline_aligned;
>
>  /* default value for data source */
>  #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index b0c45d923f0f..f32c623abef6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
>   * bpf_perf_event_output
>   */
>  struct bpf_trace_sample_data {
> -       struct perf_sample_data sds[3];
> +       perf_sample_data_t sds[3];

bpf side doesn't care about about cacheline aligned.
No need to add new typedef just for that.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: "struct perf_sample_data" alignment
  2021-03-05 15:57     ` Alexei Starovoitov
@ 2021-03-06 13:14       ` Ingo Molnar
  0 siblings, 0 replies; 6+ messages in thread
From: Ingo Molnar @ 2021-03-06 13:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Linux Kernel Mailing List


* Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > This seems to be it... (completely untested)
> >
> > ---
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index 3f7f89ea5e51..918a296d2ca2 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -1032,7 +1032,9 @@ struct perf_sample_data {
> >         u64                             cgroup;
> >         u64                             data_page_size;
> >         u64                             code_page_size;
> > -} ____cacheline_aligned;
> > +};
> > +
> > +typedef struct perf_sample_data perf_sample_data_t ____cacheline_aligned;
> >
> >  /* default value for data source */
> >  #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index b0c45d923f0f..f32c623abef6 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
> >   * bpf_perf_event_output
> >   */
> >  struct bpf_trace_sample_data {
> > -       struct perf_sample_data sds[3];
> > +       perf_sample_data_t sds[3];
> 
> bpf side doesn't care about about cacheline aligned.
> No need to add new typedef just for that.

So this structure is not supposed to be exposed to any ABI anywhere.

I did a (non-exhaustive) search of tooling, and there doesn't appear 
to be any accidental exposure.

The in-kernel ABI interaction appears to be the following:

 - In __perf_event_header_size() we only use fields within 
   perf_sample_data to size the header. Alignment won't change any of 
   the output.

 - Ditto in perf_event__id_header_size().

I.e. I think we should just zap it per the patch below (untested).

Thanks,

	Ingo

============>

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..d75e03ff31ea 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1032,7 +1032,7 @@ struct perf_sample_data {
 	u64				cgroup;
 	u64				data_page_size;
 	u64				code_page_size;
-} ____cacheline_aligned;
+};
 
 /* default value for data source */
 #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* RE: "struct perf_sample_data" alignment
  2021-03-05  8:36 ` Peter Zijlstra
  2021-03-05 14:57   ` Peter Zijlstra
@ 2021-03-08 10:01   ` David Laight
  1 sibling, 0 replies; 6+ messages in thread
From: David Laight @ 2021-03-08 10:01 UTC (permalink / raw)
  To: 'Peter Zijlstra', Linus Torvalds
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Linux Kernel Mailing List

...
> What I remember is that since perf_sample_data is fairly large,
> unconditionally initializing the whole thing is *slow* (and
> -fauto-var-init=zero will hurt here).

That will hurt everywhere.
I can also imagine it hiding bugs and making people
shrink on-stack arrays to the point where they either overrun
or cause an unexpected error or string truncation.

Initialising to zero is a bad choice if the aim is to
avoid leaking stack.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-03-08 10:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-05  3:45 "struct perf_sample_data" alignment Linus Torvalds
2021-03-05  8:36 ` Peter Zijlstra
2021-03-05 14:57   ` Peter Zijlstra
2021-03-05 15:57     ` Alexei Starovoitov
2021-03-06 13:14       ` Ingo Molnar
2021-03-08 10:01   ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).