All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] perf/core: Set event shadow time for inactive events too
@ 2021-12-05 22:48 Namhyung Kim
  2021-12-06 23:11 ` Song Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Namhyung Kim @ 2021-12-05 22:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

While commit f79256532682 ("perf/core: fix userpage->time_enabled of
inactive events") fixed this problem for user rdpmc usage, bperf (perf
stat with BPF) still has the same problem that accessing inactive perf
events from BPF using bpf_perf_event_read_value().

You can reproduce this problem easily.  As this is about a small
window with multiplexing, we need a large number of events and short
duration like below:

  # perf stat -a -v --bpf-counters -e instructions,branches,branch-misses \
    -e cache-references,cache-misses,bus-cycles,ref-cycles,cycles sleep 0.1

  Control descriptor is not initialized
  instructions: 19616489 431324015 360374366
  branches: 3685346 417640114 344175443
  branch-misses: 75714 404089360 336145421
  cache-references: 438667 390474289 327444074
  cache-misses: 49279 349333164 272835067
  bus-cycles: 631887 283423953 165164214
  ref-cycles: 2578771111104847872 18446744069443110306 182116355
  cycles: 1785221016051271680 18446744071682768912 115821694

   Performance counter stats for 'system wide':

          19,616,489      instructions              #    0.00  insn per cycle           ( 83.55%)
           3,685,346      branches                                                      ( 82.41%)
              75,714      branch-misses             #    2.05% of all branches          ( 83.19%)
             438,667      cache-references                                              ( 83.86%)
              49,279      cache-misses              #   11.234 % of all cache refs      ( 78.10%)
             631,887      bus-cycles                                                    ( 58.27%)
  2,578,771,111,104,847,872      ref-cycles                                                     (0.00%)
  1,785,221,016,051,271,680      cycles                                                         (0.00%)

       0.010824702 seconds time elapsed

As you can see, it shows invalid values for the last two events.
The -v option shows that the enabled time is way bigger than the
running time.  So it scaled the counter values using the ratio
between the two and resulted in that.  This problem can get worse
if users want no-aggregation or cgroup aggregation with a small
interval.

Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
have a negative enabled time.  In fact, bperf keeps values returned by
bpf_perf_event_read_value() which calls perf_event_read_local(), and
accumulates delta between two calls.  When event->shadow_ctx_time is
not set, it'd return invalid enabled time which is bigger than normal.
Later, the shadow time is set and the function starts to return a
valid time.  At the moment, the recent value is smaller than before so
the delta in the bperf can be negative.

I think we need to set the shadow time even the events are inactive so
that BPF programs (or other potential users) can see valid time values
anytime.

Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 kernel/events/core.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3b3297a57228..682408ca3413 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3707,27 +3707,23 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	return 0;
 }
 
-static inline bool event_update_userpage(struct perf_event *event)
+static inline void update_event_time(struct perf_event *event)
 {
-	if (likely(!atomic_read(&event->mmap_count)))
-		return false;
-
 	perf_event_update_time(event);
 	perf_set_shadow_time(event, event->ctx);
-	perf_event_update_userpage(event);
 
-	return true;
+	if (unlikely(atomic_read(&event->mmap_count)))
+		perf_event_update_userpage(event);
 }
 
-static inline void group_update_userpage(struct perf_event *group_event)
+static inline void group_update_event_time(struct perf_event *group_event)
 {
 	struct perf_event *event;
 
-	if (!event_update_userpage(group_event))
-		return;
+	update_event_time(group_event);
 
 	for_each_sibling_event(event, group_event)
-		event_update_userpage(event);
+		update_event_time(event);
 }
 
 static int merge_sched_in(struct perf_event *event, void *data)
@@ -3755,7 +3751,7 @@ static int merge_sched_in(struct perf_event *event, void *data)
 		} else {
 			ctx->rotate_necessary = 1;
 			perf_mux_hrtimer_restart(cpuctx);
-			group_update_userpage(event);
+			group_update_event_time(event);
 		}
 	}
 
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-05 22:48 [PATCH v3] perf/core: Set event shadow time for inactive events too Namhyung Kim
@ 2021-12-06 23:11 ` Song Liu
  2021-12-08 23:22 ` Peter Zijlstra
  2021-12-09 11:26 ` Peter Zijlstra
  2 siblings, 0 replies; 24+ messages in thread
From: Song Liu @ 2021-12-06 23:11 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers



> On Dec 5, 2021, at 2:48 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> 
> While commit f79256532682 ("perf/core: fix userpage->time_enabled of
> inactive events") fixed this problem for user rdpmc usage, bperf (perf
> stat with BPF) still has the same problem that accessing inactive perf
> events from BPF using bpf_perf_event_read_value().
> 
> You can reproduce this problem easily.  As this is about a small
> window with multiplexing, we need a large number of events and short
> duration like below:
> 
>  # perf stat -a -v --bpf-counters -e instructions,branches,branch-misses \
>    -e cache-references,cache-misses,bus-cycles,ref-cycles,cycles sleep 0.1
> 
>  Control descriptor is not initialized
>  instructions: 19616489 431324015 360374366
>  branches: 3685346 417640114 344175443
>  branch-misses: 75714 404089360 336145421
>  cache-references: 438667 390474289 327444074
>  cache-misses: 49279 349333164 272835067
>  bus-cycles: 631887 283423953 165164214
>  ref-cycles: 2578771111104847872 18446744069443110306 182116355
>  cycles: 1785221016051271680 18446744071682768912 115821694
> 
>   Performance counter stats for 'system wide':
> 
>          19,616,489      instructions              #    0.00  insn per cycle           ( 83.55%)
>           3,685,346      branches                                                      ( 82.41%)
>              75,714      branch-misses             #    2.05% of all branches          ( 83.19%)
>             438,667      cache-references                                              ( 83.86%)
>              49,279      cache-misses              #   11.234 % of all cache refs      ( 78.10%)
>             631,887      bus-cycles                                                    ( 58.27%)
>  2,578,771,111,104,847,872      ref-cycles                                                     (0.00%)
>  1,785,221,016,051,271,680      cycles                                                         (0.00%)
> 
>       0.010824702 seconds time elapsed
> 
> As you can see, it shows invalid values for the last two events.
> The -v option shows that the enabled time is way bigger than the
> running time.  So it scaled the counter values using the ratio
> between the two and resulted in that.  This problem can get worse
> if users want no-aggregation or cgroup aggregation with a small
> interval.
> 
> Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
> have a negative enabled time.  In fact, bperf keeps values returned by
> bpf_perf_event_read_value() which calls perf_event_read_local(), and
> accumulates delta between two calls.  When event->shadow_ctx_time is
> not set, it'd return invalid enabled time which is bigger than normal.
> Later, the shadow time is set and the function starts to return a
> valid time.  At the moment, the recent value is smaller than before so
> the delta in the bperf can be negative.
> 
> I think we need to set the shadow time even the events are inactive so
> that BPF programs (or other potential users) can see valid time values
> anytime.
> 
> Cc: Song Liu <songliubraving@fb.com>
> Signed-off-by: Namhyung Kim <namhyung@kernel.org>

Acked-by: Song Liu <song@kernel.org>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-05 22:48 [PATCH v3] perf/core: Set event shadow time for inactive events too Namhyung Kim
  2021-12-06 23:11 ` Song Liu
@ 2021-12-08 23:22 ` Peter Zijlstra
  2021-12-09  5:52   ` Namhyung Kim
  2021-12-09 11:26 ` Peter Zijlstra
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-08 23:22 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
> While commit f79256532682 ("perf/core: fix userpage->time_enabled of
> inactive events") fixed this problem for user rdpmc usage,

You're referring to 'this problem' before actually describing a problem :-(

Also, you now have me looking at that commit again, and I'm still hating
it. Also, I'm again struggling to make sense of it; all except the very
last hunk that is.

So the whole, full-fat, mmap self-monitor thing looks like:


	u32 seq, time_mult, time_shift, index, width = 64;
	u64 count, enabled, running;
	u64 cyc, time_offset, time_cycles = 0, time_mask = ~0ULL;
	u64 quot, rem, delta;
	s64 pmc = 0;

	do {
		seq = pc->lock;
		barrier();

		enabled = pc->time_enabled;
		running = pc->time_running;

		if (pc->cap_user_time && enabled != running) {
			cyc = rdtsc();
			time_offset = pc->time_offset;
			time_mult   = pc->time_mult;
			time_shift  = pc->time_shift;
		}

		if (pc->cap_user_time_short) {
			time_cycles = pc->time_cycles;
			time_mask   = pc->time_mask;
		}

		index = pc->index;
		count = pc->offset;
		if (pc->cap_user_rdpmc && index) {
			width = pc->pmc_width;
			pmc = rdpmc(index - 1);
		}

		barrier();
	} while (pc->lock != seq);

	if (width < 64) {
		pmc <<= 64 - width;
		pmc >>= 64 - width;
	}
	count += pmc;

	cyc = time_cycles + ((cyc - time_cycles) & time_mask);

	quot = (cyc >> time_shift);
	rem = cyc & ((1ULL < time_shift) - 1);
	delta = time_offset + quot * time_mult +
		((rem * time_mult) >> time_shift);

	enabled += delta;
	if (index)
		running += delta;

	quot = count / running;
	rem  = count % running;
	count = quot * enabled + (rem * enabled) / running;


Now, the thing that sticks out to me is that 'enabled' is
unconditionally advanced. It *always* runs.

So how can not updating ->time_enabled when the counter is INACTIVE due
to rotation (which causes ->index == 0), cause enabled to not be
up-to-date?

Can we please figure that out so I can go revert all but the last hunk
of that patch?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-08 23:22 ` Peter Zijlstra
@ 2021-12-09  5:52   ` Namhyung Kim
  2021-12-09  8:21     ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2021-12-09  5:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

Hi Peter,

On Wed, Dec 8, 2021 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
> > While commit f79256532682 ("perf/core: fix userpage->time_enabled of
> > inactive events") fixed this problem for user rdpmc usage,
>
> You're referring to 'this problem' before actually describing a problem :-(

Well, it's a problem of reporting incorrect 'enabled' time.
I'm sorry if it was not clear.

>
> Also, you now have me looking at that commit again, and I'm still hating
> it. Also, I'm again struggling to make sense of it; all except the very
> last hunk that is.
>
> So the whole, full-fat, mmap self-monitor thing looks like:
>
>
>         u32 seq, time_mult, time_shift, index, width = 64;
>         u64 count, enabled, running;
>         u64 cyc, time_offset, time_cycles = 0, time_mask = ~0ULL;
>         u64 quot, rem, delta;
>         s64 pmc = 0;
>
>         do {
>                 seq = pc->lock;
>                 barrier();
>
>                 enabled = pc->time_enabled;
>                 running = pc->time_running;
>
>                 if (pc->cap_user_time && enabled != running) {
>                         cyc = rdtsc();
>                         time_offset = pc->time_offset;
>                         time_mult   = pc->time_mult;
>                         time_shift  = pc->time_shift;
>                 }
>
>                 if (pc->cap_user_time_short) {
>                         time_cycles = pc->time_cycles;
>                         time_mask   = pc->time_mask;
>                 }
>
>                 index = pc->index;
>                 count = pc->offset;
>                 if (pc->cap_user_rdpmc && index) {
>                         width = pc->pmc_width;
>                         pmc = rdpmc(index - 1);
>                 }
>
>                 barrier();
>         } while (pc->lock != seq);
>
>         if (width < 64) {
>                 pmc <<= 64 - width;
>                 pmc >>= 64 - width;
>         }
>         count += pmc;
>
>         cyc = time_cycles + ((cyc - time_cycles) & time_mask);
>
>         quot = (cyc >> time_shift);
>         rem = cyc & ((1ULL < time_shift) - 1);
>         delta = time_offset + quot * time_mult +
>                 ((rem * time_mult) >> time_shift);
>
>         enabled += delta;
>         if (index)
>                 running += delta;
>
>         quot = count / running;
>         rem  = count % running;
>         count = quot * enabled + (rem * enabled) / running;
>
>
> Now, the thing that sticks out to me is that 'enabled' is
> unconditionally advanced. It *always* runs.
>
> So how can not updating ->time_enabled when the counter is INACTIVE due
> to rotation (which causes ->index == 0), cause enabled to not be
> up-to-date?

Hmm.. I don't get it.  In my understanding, that's the whole point
of the enabled time - tracking time it was not active due to the
multiplexing (rotation).  So that users might want to scale the
count based on the ratio of running vs enabled.

Do I miss something?

Thanks,
Namhyung


>
> Can we please figure that out so I can go revert all but the last hunk
> of that patch?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-09  5:52   ` Namhyung Kim
@ 2021-12-09  8:21     ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-09  8:21 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Wed, Dec 08, 2021 at 09:52:16PM -0800, Namhyung Kim wrote:
> Hi Peter,
> 
> On Wed, Dec 8, 2021 at 3:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
> > > While commit f79256532682 ("perf/core: fix userpage->time_enabled of
> > > inactive events") fixed this problem for user rdpmc usage,
> >
> > You're referring to 'this problem' before actually describing a problem :-(
> 
> Well, it's a problem of reporting incorrect 'enabled' time.
> I'm sorry if it was not clear.
> 
> >
> > Also, you now have me looking at that commit again, and I'm still hating
> > it. Also, I'm again struggling to make sense of it; all except the very
> > last hunk that is.
> >
> > So the whole, full-fat, mmap self-monitor thing looks like:
> >
> >
> >         u32 seq, time_mult, time_shift, index, width = 64;
> >         u64 count, enabled, running;
> >         u64 cyc, time_offset, time_cycles = 0, time_mask = ~0ULL;
> >         u64 quot, rem, delta;
> >         s64 pmc = 0;
> >
> >         do {
> >                 seq = pc->lock;
> >                 barrier();
> >
> >                 enabled = pc->time_enabled;
> >                 running = pc->time_running;
> >
> >                 if (pc->cap_user_time && enabled != running) {
> >                         cyc = rdtsc();
> >                         time_offset = pc->time_offset;
> >                         time_mult   = pc->time_mult;
> >                         time_shift  = pc->time_shift;
> >                 }
> >
> >                 if (pc->cap_user_time_short) {
> >                         time_cycles = pc->time_cycles;
> >                         time_mask   = pc->time_mask;
> >                 }
> >
> >                 index = pc->index;
> >                 count = pc->offset;
> >                 if (pc->cap_user_rdpmc && index) {
> >                         width = pc->pmc_width;
> >                         pmc = rdpmc(index - 1);
> >                 }
> >
> >                 barrier();
> >         } while (pc->lock != seq);
> >
> >         if (width < 64) {
> >                 pmc <<= 64 - width;
> >                 pmc >>= 64 - width;
> >         }
> >         count += pmc;
> >
> >         cyc = time_cycles + ((cyc - time_cycles) & time_mask);
> >
> >         quot = (cyc >> time_shift);
> >         rem = cyc & ((1ULL < time_shift) - 1);
> >         delta = time_offset + quot * time_mult +
> >                 ((rem * time_mult) >> time_shift);
> >
> >         enabled += delta;
> >         if (index)
> >                 running += delta;
> >
> >         quot = count / running;
> >         rem  = count % running;
> >         count = quot * enabled + (rem * enabled) / running;
> >
> >
> > Now, the thing that sticks out to me is that 'enabled' is
> > unconditionally advanced. It *always* runs.
> >
> > So how can not updating ->time_enabled when the counter is INACTIVE due
> > to rotation (which causes ->index == 0), cause enabled to not be
> > up-to-date?
> 
> Hmm.. I don't get it.  In my understanding, that's the whole point
> of the enabled time - tracking time it was not active due to the
> multiplexing (rotation).  So that users might want to scale the
> count based on the ratio of running vs enabled.

Correct, and AFAICT that works as advertised.

> Do I miss something?

Where do we actually need the crap that is commit f79256532682 ?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-05 22:48 [PATCH v3] perf/core: Set event shadow time for inactive events too Namhyung Kim
  2021-12-06 23:11 ` Song Liu
  2021-12-08 23:22 ` Peter Zijlstra
@ 2021-12-09 11:26 ` Peter Zijlstra
  2021-12-09 11:34   ` Peter Zijlstra
  2021-12-09 21:35   ` Namhyung Kim
  2 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-09 11:26 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:

> Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
> have a negative enabled time.  In fact, bperf keeps values returned by
> bpf_perf_event_read_value() which calls perf_event_read_local(), and
> accumulates delta between two calls.  When event->shadow_ctx_time is
> not set, it'd return invalid enabled time which is bigger than normal.

*that*, how does it happen that shadow_time isn't set? It should be last
set when the event switches to INACTIVE, no? At which point the logic in
perf_event_read_local() should make @enabled move forward while @running
stays put.

Let me go rummage around a bit... either I'm missing something obvious
or something's smelly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-09 11:26 ` Peter Zijlstra
@ 2021-12-09 11:34   ` Peter Zijlstra
  2021-12-09 21:51     ` Namhyung Kim
  2021-12-09 21:35   ` Namhyung Kim
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-09 11:34 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Thu, Dec 09, 2021 at 12:26:32PM +0100, Peter Zijlstra wrote:
> On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
> 
> > Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
> > have a negative enabled time.  In fact, bperf keeps values returned by
> > bpf_perf_event_read_value() which calls perf_event_read_local(), and
> > accumulates delta between two calls.  When event->shadow_ctx_time is
> > not set, it'd return invalid enabled time which is bigger than normal.
> 
> *that*, how does it happen that shadow_time isn't set? It should be last
> set when the event switches to INACTIVE, no? At which point the logic in
> perf_event_read_local() should make @enabled move forward while @running
> stays put.
> 
> Let me go rummage around a bit... either I'm missing something obvious
> or something's smelly.

How's this then?

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7b7525e9155f..82baa8bdfc86 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -686,9 +686,7 @@ struct perf_event {
 	/*
 	 * timestamp shadows the actual context timing but it can
 	 * be safely used in NMI interrupt context. It reflects the
-	 * context time as it was when the event was last scheduled in,
-	 * or when ctx_sched_in failed to schedule the event because we
-	 * run out of PMC.
+	 * context time as it was when the event was last scheduled in.
 	 *
 	 * ctx_time already accounts for ctx->timestamp. Therefore to
 	 * compute ctx_time for a sample, simply add perf_clock().
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 28aaeacdaea1..20637b7f420c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -640,6 +640,9 @@ __perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *runnin
 		*running += delta;
 }
 
+static void perf_set_shadow_time(struct perf_event *event,
+				 struct perf_event_context *ctx);
+
 static void perf_event_update_time(struct perf_event *event)
 {
 	u64 now = perf_event_time(event);
@@ -647,6 +650,7 @@ static void perf_event_update_time(struct perf_event *event)
 	__perf_update_times(event, now, &event->total_time_enabled,
 					&event->total_time_running);
 	event->tstamp = now;
+	perf_set_shadow_time(event, event->ctx);
 }
 
 static void perf_event_update_sibling_time(struct perf_event *leader)
@@ -2552,8 +2556,6 @@ event_sched_in(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
-	perf_set_shadow_time(event, ctx);
-
 	perf_log_itrace_start(event);
 
 	if (event->pmu->add(event, PERF_EF_START)) {
@@ -3707,29 +3709,6 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	return 0;
 }
 
-static inline bool event_update_userpage(struct perf_event *event)
-{
-	if (likely(!atomic_read(&event->mmap_count)))
-		return false;
-
-	perf_event_update_time(event);
-	perf_set_shadow_time(event, event->ctx);
-	perf_event_update_userpage(event);
-
-	return true;
-}
-
-static inline void group_update_userpage(struct perf_event *group_event)
-{
-	struct perf_event *event;
-
-	if (!event_update_userpage(group_event))
-		return;
-
-	for_each_sibling_event(event, group_event)
-		event_update_userpage(event);
-}
-
 static int merge_sched_in(struct perf_event *event, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
@@ -3748,15 +3727,14 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	}
 
 	if (event->state == PERF_EVENT_STATE_INACTIVE) {
-		*can_add_hw = 0;
 		if (event->attr.pinned) {
 			perf_cgroup_event_disable(event, ctx);
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
-		} else {
-			ctx->rotate_necessary = 1;
-			perf_mux_hrtimer_restart(cpuctx);
-			group_update_userpage(event);
 		}
+
+		*can_add_hw = 0;
+		ctx->rotate_necessary = 1;
+		perf_mux_hrtimer_restart(cpuctx);
 	}
 
 	return 0;
@@ -6349,7 +6327,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		ring_buffer_attach(event, rb);
 
 		perf_event_update_time(event);
-		perf_set_shadow_time(event, event->ctx);
 		perf_event_init_userpage(event);
 		perf_event_update_userpage(event);
 	} else {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-09 11:26 ` Peter Zijlstra
  2021-12-09 11:34   ` Peter Zijlstra
@ 2021-12-09 21:35   ` Namhyung Kim
  2021-12-10 10:33     ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2021-12-09 21:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Thu, Dec 9, 2021 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
>
> > Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
> > have a negative enabled time.  In fact, bperf keeps values returned by
> > bpf_perf_event_read_value() which calls perf_event_read_local(), and
> > accumulates delta between two calls.  When event->shadow_ctx_time is
> > not set, it'd return invalid enabled time which is bigger than normal.
>
> *that*, how does it happen that shadow_time isn't set? It should be last
> set when the event switches to INACTIVE, no?

As you can see, perf_event_set_state() doesn't set the shadow time.
It's called from event_sched_in() which might result in ACTIVE or
INACTIVE.  But the problem is that there's a case that event_sched_in
was not called at all - when group_can_go_on() returns false.

> At which point the logic in
> perf_event_read_local() should make @enabled move forward while @running
> stays put.

It's not about updating event->total_time_enabled, it only
afftects the returned value of @enabled.

I'd say the time calculation is broken so it'd break @running
as well.  But this case can only happen on INACTIVE -
otherwise it'd call event_sched_in() and update the shadow
time properly, so no issue there.  And then we can see
the broken value of enabled time only.

>
> Let me go rummage around a bit... either I'm missing something obvious
> or something's smelly.

Thank you for doing that!

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-09 11:34   ` Peter Zijlstra
@ 2021-12-09 21:51     ` Namhyung Kim
  2021-12-10 10:19       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2021-12-09 21:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Thu, Dec 9, 2021 at 3:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 09, 2021 at 12:26:32PM +0100, Peter Zijlstra wrote:
> > On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
> >
> > > Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
> > > have a negative enabled time.  In fact, bperf keeps values returned by
> > > bpf_perf_event_read_value() which calls perf_event_read_local(), and
> > > accumulates delta between two calls.  When event->shadow_ctx_time is
> > > not set, it'd return invalid enabled time which is bigger than normal.
> >
> > *that*, how does it happen that shadow_time isn't set? It should be last
> > set when the event switches to INACTIVE, no? At which point the logic in
> > perf_event_read_local() should make @enabled move forward while @running
> > stays put.
> >
> > Let me go rummage around a bit... either I'm missing something obvious
> > or something's smelly.
>
> How's this then?

Still the same :(

Maybe because the event is enabled from the beginning.
Then it might miss set_state/update_time at all.

>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 7b7525e9155f..82baa8bdfc86 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -686,9 +686,7 @@ struct perf_event {
>         /*
>          * timestamp shadows the actual context timing but it can
>          * be safely used in NMI interrupt context. It reflects the
> -        * context time as it was when the event was last scheduled in,
> -        * or when ctx_sched_in failed to schedule the event because we
> -        * run out of PMC.
> +        * context time as it was when the event was last scheduled in.
>          *
>          * ctx_time already accounts for ctx->timestamp. Therefore to
>          * compute ctx_time for a sample, simply add perf_clock().
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 28aaeacdaea1..20637b7f420c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -640,6 +640,9 @@ __perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *runnin
>                 *running += delta;
>  }
>
> +static void perf_set_shadow_time(struct perf_event *event,
> +                                struct perf_event_context *ctx);
> +
>  static void perf_event_update_time(struct perf_event *event)
>  {
>         u64 now = perf_event_time(event);
> @@ -647,6 +650,7 @@ static void perf_event_update_time(struct perf_event *event)
>         __perf_update_times(event, now, &event->total_time_enabled,
>                                         &event->total_time_running);
>         event->tstamp = now;
> +       perf_set_shadow_time(event, event->ctx);

I like this.

>  }
>
>  static void perf_event_update_sibling_time(struct perf_event *leader)
> @@ -2552,8 +2556,6 @@ event_sched_in(struct perf_event *event,
>
>         perf_pmu_disable(event->pmu);
>
> -       perf_set_shadow_time(event, ctx);
> -
>         perf_log_itrace_start(event);
>
>         if (event->pmu->add(event, PERF_EF_START)) {
> @@ -3707,29 +3709,6 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
>         return 0;
>  }
>
> -static inline bool event_update_userpage(struct perf_event *event)
> -{
> -       if (likely(!atomic_read(&event->mmap_count)))
> -               return false;
> -
> -       perf_event_update_time(event);
> -       perf_set_shadow_time(event, event->ctx);
> -       perf_event_update_userpage(event);
> -
> -       return true;
> -}
> -
> -static inline void group_update_userpage(struct perf_event *group_event)
> -{
> -       struct perf_event *event;
> -
> -       if (!event_update_userpage(group_event))
> -               return;
> -
> -       for_each_sibling_event(event, group_event)
> -               event_update_userpage(event);
> -}
> -
>  static int merge_sched_in(struct perf_event *event, void *data)
>  {
>         struct perf_event_context *ctx = event->ctx;
> @@ -3748,15 +3727,14 @@ static int merge_sched_in(struct perf_event *event, void *data)
>         }
>
>         if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -               *can_add_hw = 0;
>                 if (event->attr.pinned) {
>                         perf_cgroup_event_disable(event, ctx);
>                         perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> -               } else {
> -                       ctx->rotate_necessary = 1;
> -                       perf_mux_hrtimer_restart(cpuctx);
> -                       group_update_userpage(event);
>                 }
> +
> +               *can_add_hw = 0;
> +               ctx->rotate_necessary = 1;
> +               perf_mux_hrtimer_restart(cpuctx);

Not sure about this.  We might not want to rotate them
if a pinned event failed...?

Thanks,
Namhyung


>         }
>
>         return 0;
> @@ -6349,7 +6327,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
>                 ring_buffer_attach(event, rb);
>
>                 perf_event_update_time(event);
> -               perf_set_shadow_time(event, event->ctx);
>                 perf_event_init_userpage(event);
>                 perf_event_update_userpage(event);
>         } else {

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-09 21:51     ` Namhyung Kim
@ 2021-12-10 10:19       ` Peter Zijlstra
  2021-12-10 18:59         ` Namhyung Kim
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-10 10:19 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Thu, Dec 09, 2021 at 01:51:42PM -0800, Namhyung Kim wrote:
> On Thu, Dec 9, 2021 at 3:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Dec 09, 2021 at 12:26:32PM +0100, Peter Zijlstra wrote:
> > > On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
> > >
> > > > Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
> > > > have a negative enabled time.  In fact, bperf keeps values returned by
> > > > bpf_perf_event_read_value() which calls perf_event_read_local(), and
> > > > accumulates delta between two calls.  When event->shadow_ctx_time is
> > > > not set, it'd return invalid enabled time which is bigger than normal.
> > >
> > > *that*, how does it happen that shadow_time isn't set? It should be last
> > > set when the event switches to INACTIVE, no? At which point the logic in
> > > perf_event_read_local() should make @enabled move forward while @running
> > > stays put.
> > >
> > > Let me go rummage around a bit... either I'm missing something obvious
> > > or something's smelly.
> >
> > How's this then?
> 
> Still the same :(

You're doing that bpf-cgroup crud, right? Where exactly do you hook into
to do the counter reads?

> Maybe because the event is enabled from the beginning.
> Then it might miss set_state/update_time at all.

Even then, it's set to INACTIVE and any state change thereafter needs to
go through perf_event_set_state() and update the relevant timestamps.

> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 28aaeacdaea1..20637b7f420c 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -640,6 +640,9 @@ __perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *runnin
> >                 *running += delta;
> >  }
> >
> > +static void perf_set_shadow_time(struct perf_event *event,
> > +                                struct perf_event_context *ctx);
> > +
> >  static void perf_event_update_time(struct perf_event *event)
> >  {
> >         u64 now = perf_event_time(event);
> > @@ -647,6 +650,7 @@ static void perf_event_update_time(struct perf_event *event)
> >         __perf_update_times(event, now, &event->total_time_enabled,
> >                                         &event->total_time_running);
> >         event->tstamp = now;
> > +       perf_set_shadow_time(event, event->ctx);
> 
> I like this.

Right, it keeps the shadow timestamp thingy in sync. Specifically it was
missing an update on event sched_out. Although thinking about it more,
that shouldn't make a difference since the relative displacement of the
clocks doesn't change at that point. All that changes there is that
RUNNING should stop advancing.

So in that regards, this not actually changing anything makes sense.

> > @@ -3748,15 +3727,14 @@ static int merge_sched_in(struct perf_event *event, void *data)
> >         }
> >
> >         if (event->state == PERF_EVENT_STATE_INACTIVE) {
> > -               *can_add_hw = 0;
> >                 if (event->attr.pinned) {
> >                         perf_cgroup_event_disable(event, ctx);
> >                         perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> > -               } else {
> > -                       ctx->rotate_necessary = 1;
> > -                       perf_mux_hrtimer_restart(cpuctx);
> > -                       group_update_userpage(event);
> >                 }
> > +
> > +               *can_add_hw = 0;
> > +               ctx->rotate_necessary = 1;
> > +               perf_mux_hrtimer_restart(cpuctx);
> 
> Not sure about this.  We might not want to rotate them
> if a pinned event failed...?

It's just a straight revert, but you're right, this stuff needs
some improvement.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-09 21:35   ` Namhyung Kim
@ 2021-12-10 10:33     ` Peter Zijlstra
  2021-12-10 23:19       ` Namhyung Kim
  2021-12-17 16:35       ` Peter Zijlstra
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-10 10:33 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Thu, Dec 09, 2021 at 01:35:11PM -0800, Namhyung Kim wrote:
> On Thu, Dec 9, 2021 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
> >
> > > Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
> > > have a negative enabled time.  In fact, bperf keeps values returned by
> > > bpf_perf_event_read_value() which calls perf_event_read_local(), and
> > > accumulates delta between two calls.  When event->shadow_ctx_time is
> > > not set, it'd return invalid enabled time which is bigger than normal.
> >
> > *that*, how does it happen that shadow_time isn't set? It should be last
> > set when the event switches to INACTIVE, no?
> 
> As you can see, perf_event_set_state() doesn't set the shadow time.
> It's called from event_sched_in() which might result in ACTIVE or
> INACTIVE.  But the problem is that there's a case that event_sched_in
> was not called at all - when group_can_go_on() returns false.
> 
> > At which point the logic in
> > perf_event_read_local() should make @enabled move forward while @running
> > stays put.
> 
> It's not about updating event->total_time_enabled, it only
> afftects the returned value of @enabled.
> 
> I'd say the time calculation is broken so it'd break @running
> as well.  But this case can only happen on INACTIVE -
> otherwise it'd call event_sched_in() and update the shadow
> time properly, so no issue there.  And then we can see
> the broken value of enabled time only.

I'm thinking this is a cgroup specific thing. Normally the shadow_time
thing is simply a relative displacement between event-time and the
global clock. That displacement never changes, except when you do
IOC_DISABLE/IOC_ENABLE.

However, for cgroup things are different, since the cgroup events aren't
unconditionally runnable, that is, the enabled time should only count
when the cgroup is active, right?

So perhaps perf_event_read_local() should use a cgroup clock instead of
perf_clock() for cgroup events.

Let me think about that some more...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-10 10:19       ` Peter Zijlstra
@ 2021-12-10 18:59         ` Namhyung Kim
  2021-12-20  9:39           ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Namhyung Kim @ 2021-12-10 18:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Fri, Dec 10, 2021 at 2:20 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 09, 2021 at 01:51:42PM -0800, Namhyung Kim wrote:
> > On Thu, Dec 9, 2021 at 3:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Dec 09, 2021 at 12:26:32PM +0100, Peter Zijlstra wrote:
> > > > On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
> > > >
> > > > > Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
> > > > > have a negative enabled time.  In fact, bperf keeps values returned by
> > > > > bpf_perf_event_read_value() which calls perf_event_read_local(), and
> > > > > accumulates delta between two calls.  When event->shadow_ctx_time is
> > > > > not set, it'd return invalid enabled time which is bigger than normal.
> > > >
> > > > *that*, how does it happen that shadow_time isn't set? It should be last
> > > > set when the event switches to INACTIVE, no? At which point the logic in
> > > > perf_event_read_local() should make @enabled move forward while @running
> > > > stays put.
> > > >
> > > > Let me go rummage around a bit... either I'm missing something obvious
> > > > or something's smelly.
> > >
> > > How's this then?
> >
> > Still the same :(
>
> You're doing that bpf-cgroup crud, right? Where exactly do you hook into
> to do the counter reads?

That's true but it doesn't use cgroup events actually.  They are plain cpu
events and BPF is called from a separate 'cgroup-switches' event to
read out the counters.

>
> > Maybe because the event is enabled from the beginning.
> > Then it might miss set_state/update_time at all.
>
> Even then, it's set to INACTIVE and any state change thereafter needs to
> go through perf_event_set_state() and update the relevant timestamps.

Right, but the problem happens when you read the event *before*
any state change.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-10 10:33     ` Peter Zijlstra
@ 2021-12-10 23:19       ` Namhyung Kim
  2021-12-17 16:35       ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Namhyung Kim @ 2021-12-10 23:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Fri, Dec 10, 2021 at 2:33 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 09, 2021 at 01:35:11PM -0800, Namhyung Kim wrote:
> > On Thu, Dec 9, 2021 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Sun, Dec 05, 2021 at 02:48:43PM -0800, Namhyung Kim wrote:
> > >
> > > > Actually 18446744069443110306 is 0xffffffff01b345a2 so it seems to
> > > > have a negative enabled time.  In fact, bperf keeps values returned by
> > > > bpf_perf_event_read_value() which calls perf_event_read_local(), and
> > > > accumulates delta between two calls.  When event->shadow_ctx_time is
> > > > not set, it'd return invalid enabled time which is bigger than normal.
> > >
> > > *that*, how does it happen that shadow_time isn't set? It should be last
> > > set when the event switches to INACTIVE, no?
> >
> > As you can see, perf_event_set_state() doesn't set the shadow time.
> > It's called from event_sched_in() which might result in ACTIVE or
> > INACTIVE.  But the problem is that there's a case that event_sched_in
> > was not called at all - when group_can_go_on() returns false.
> >
> > > At which point the logic in
> > > perf_event_read_local() should make @enabled move forward while @running
> > > stays put.
> >
> > It's not about updating event->total_time_enabled, it only
> > afftects the returned value of @enabled.
> >
> > I'd say the time calculation is broken so it'd break @running
> > as well.  But this case can only happen on INACTIVE -
> > otherwise it'd call event_sched_in() and update the shadow
> > time properly, so no issue there.  And then we can see
> > the broken value of enabled time only.
>
> I'm thinking this is a cgroup specific thing. Normally the shadow_time
> thing is simply a relative displacement between event-time and the
> global clock. That displacement never changes, except when you do
> IOC_DISABLE/IOC_ENABLE.

I think it changes when the events are scheduled in and out.
The global clock (ctx->timestamp) is constantly changing
when any event in the context is scheduled while event-
time might not change if the event is not scheduled, no?

Anyway, as I told you this is not a cgroup event.
The point of the BPF work was not to use cgroup events
and my example in the commit message was not about
cgroups at all.

The cgroup event has its own set of problems.. sigh.
I'll post one that I hit recently.

>
> However, for cgroup things are different, since the cgroup events aren't
> unconditionally runnable, that is, the enabled time should only count
> when the cgroup is active, right?

Yeah, that's my understanding.

>
> So perhaps perf_event_read_local() should use a cgroup clock instead of
> perf_clock() for cgroup events.
>
> Let me think about that some more...

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-10 10:33     ` Peter Zijlstra
  2021-12-10 23:19       ` Namhyung Kim
@ 2021-12-17 16:35       ` Peter Zijlstra
  2021-12-18  9:09         ` Song Liu
  2021-12-20 12:19         ` Peter Zijlstra
  1 sibling, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-17 16:35 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Fri, Dec 10, 2021 at 11:33:41AM +0100, Peter Zijlstra wrote:

> I'm thinking this is a cgroup specific thing. Normally the shadow_time
> thing is simply a relative displacement between event-time and the
> global clock. That displacement never changes, except when you do
> IOC_DISABLE/IOC_ENABLE.
> 
> However, for cgroup things are different, since the cgroup events aren't
> unconditionally runnable, that is, the enabled time should only count
> when the cgroup is active, right?
> 
> So perhaps perf_event_read_local() should use a cgroup clock instead of
> perf_clock() for cgroup events.
> 
> Let me think about that some more...

How's this then? Song, could you also please test and or better explain
the problem f79256532682 pretends to cure? Because the below is
reverting that, I *really* hate having to touch the events we're not
scheduling.

So basically enabled runs at ctx->time (or cgroup->info->time) rate but
is offset, the 0-point is when the event is first ENABLED (assuming it
is never disabled thereafter).

Now, ctx->time advances as long as a single event lives in the context,
which means the event's enabled time should advance right along with it,
irrespective of when that even last scheduled.

So the value we need to compute is ctx->time, and shadow_ctx_time is in
completely the wrong place (event vs ctx).

The crucial bit seems to be we should not advance ctx time when the
context isn't actually active. regular ctx seems to have most of the
bits there, except some ordering, while cgroup->info needs to track
activity.

*very* lightly tested, please give it a spin and contribute to perf-test
all failures observed or something.

---
 include/linux/perf_event.h |  15 +---
 kernel/events/core.c       | 214 ++++++++++++++++++++++-----------------------
 2 files changed, 107 insertions(+), 122 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7b7525e9155f..62b921561072 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -683,18 +683,6 @@ struct perf_event {
 	u64				total_time_running;
 	u64				tstamp;
 
-	/*
-	 * timestamp shadows the actual context timing but it can
-	 * be safely used in NMI interrupt context. It reflects the
-	 * context time as it was when the event was last scheduled in,
-	 * or when ctx_sched_in failed to schedule the event because we
-	 * run out of PMC.
-	 *
-	 * ctx_time already accounts for ctx->timestamp. Therefore to
-	 * compute ctx_time for a sample, simply add perf_clock().
-	 */
-	u64				shadow_ctx_time;
-
 	struct perf_event_attr		attr;
 	u16				header_size;
 	u16				id_header_size;
@@ -841,6 +829,7 @@ struct perf_event_context {
 	 */
 	u64				time;
 	u64				timestamp;
+	u64				timeoffset;
 
 	/*
 	 * These fields let us detect when two contexts have both
@@ -923,6 +912,8 @@ struct bpf_perf_event_data_kern {
 struct perf_cgroup_info {
 	u64				time;
 	u64				timestamp;
+	u64				timeoffset;
+	int				active;
 };
 
 struct perf_cgroup {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e364ea5bf167..08e5e9f1a155 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -719,34 +719,54 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
 	return t->time;
 }
 
-static inline void __update_cgrp_time(struct perf_cgroup *cgrp)
+static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 {
-	struct perf_cgroup_info *info;
-	u64 now;
-
-	now = perf_clock();
+	struct perf_cgroup_info *t;
 
-	info = this_cpu_ptr(cgrp->info);
+	t = per_cpu_ptr(event->cgrp->info, event->cpu);
+	if (!t->active) {
+		barrier();
+		return t->time;
+	}
+	now += READ_ONCE(t->timeoffset);
+	return now;
+}
 
+static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now)
+{
 	info->time += now - info->timestamp;
 	info->timestamp = now;
+	/*
+	 * see update_context_time()
+	 */
+	WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
 }
 
-static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
+static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
 {
 	struct perf_cgroup *cgrp = cpuctx->cgrp;
 	struct cgroup_subsys_state *css;
+	struct perf_cgroup_info *info;
 
 	if (cgrp) {
+		u64 now = perf_clock();
+
 		for (css = &cgrp->css; css; css = css->parent) {
 			cgrp = container_of(css, struct perf_cgroup, css);
-			__update_cgrp_time(cgrp);
+			info = this_cpu_ptr(cgrp->info);
+
+			__update_cgrp_time(info, now);
+			if (final) {
+				barrier();
+				info->active = 0;
+			}
 		}
 	}
 }
 
 static inline void update_cgrp_time_from_event(struct perf_event *event)
 {
+	struct perf_cgroup_info *info;
 	struct perf_cgroup *cgrp;
 
 	/*
@@ -760,8 +780,10 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
 	/*
 	 * Do not update time when cgroup is not active
 	 */
-	if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup))
-		__update_cgrp_time(event->cgrp);
+	if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) {
+		info = this_cpu_ptr(event->cgrp->info);
+		__update_cgrp_time(info, perf_clock());
+	}
 }
 
 static inline void
@@ -786,6 +808,8 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 		cgrp = container_of(css, struct perf_cgroup, css);
 		info = this_cpu_ptr(cgrp->info);
 		info->timestamp = ctx->timestamp;
+		barrier();
+		info->active = 1;
 	}
 }
 
@@ -981,14 +1005,6 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 	return ret;
 }
 
-static inline void
-perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
-{
-	struct perf_cgroup_info *t;
-	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	event->shadow_ctx_time = now - t->timestamp;
-}
-
 static inline void
 perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ctx)
 {
@@ -1066,7 +1082,8 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
 {
 }
 
-static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
+static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
+						bool final)
 {
 }
 
@@ -1098,12 +1115,12 @@ perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
 {
 }
 
-static inline void
-perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
+static inline u64 perf_cgroup_event_time(struct perf_event *event)
 {
+	return 0;
 }
 
-static inline u64 perf_cgroup_event_time(struct perf_event *event)
+static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 {
 	return 0;
 }
@@ -1531,16 +1548,48 @@ static void update_context_time(struct perf_event_context *ctx)
 
 	ctx->time += now - ctx->timestamp;
 	ctx->timestamp = now;
+
+	/*
+	 * The above: time' = time + (now - timestamp), can be re-arranged
+	 * into: time` = now + (time - timestamp), which gives a single value
+	 * offset to compute future time without locks on.
+	 *
+	 * See perf_event_time_now(), which can be used from NMI context where
+	 * it's (obviously) not possible to acquire ctx->lock in order to read
+	 * both the above values in a consistent manner.
+	 */
+	WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
 }
 
 static u64 perf_event_time(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
 
+	if (unlikely(!ctx))
+		return 0;
+
 	if (is_cgroup_event(event))
 		return perf_cgroup_event_time(event);
 
-	return ctx ? ctx->time : 0;
+	return ctx->time;
+}
+
+static u64 perf_event_time_now(struct perf_event *event, u64 now)
+{
+	struct perf_event_context *ctx = event->ctx;
+
+	if (unlikely(!ctx))
+		return 0;
+
+	if (is_cgroup_event(event))
+		return perf_cgroup_event_time_now(event, now);
+
+	if (!(ctx->is_active & EVENT_TIME)) {
+		barrier();
+		return ctx->time;
+	}
+	now += READ_ONCE(ctx->timeoffset);
+	return now;
 }
 
 static enum event_type_t get_event_type(struct perf_event *event)
@@ -2346,7 +2395,7 @@ __perf_remove_from_context(struct perf_event *event,
 
 	if (ctx->is_active & EVENT_TIME) {
 		update_context_time(ctx);
-		update_cgrp_time_from_cpuctx(cpuctx);
+		update_cgrp_time_from_cpuctx(cpuctx, false);
 	}
 
 	event_sched_out(event, cpuctx, ctx);
@@ -2357,6 +2406,9 @@ __perf_remove_from_context(struct perf_event *event,
 	list_del_event(event, ctx);
 
 	if (!ctx->nr_events && ctx->is_active) {
+		if (ctx == &cpuctx->ctx)
+			update_cgrp_time_from_cpuctx(cpuctx, true);
+
 		ctx->is_active = 0;
 		ctx->rotate_necessary = 0;
 		if (ctx->task) {
@@ -2478,40 +2530,6 @@ void perf_event_disable_inatomic(struct perf_event *event)
 	irq_work_queue(&event->pending);
 }
 
-static void perf_set_shadow_time(struct perf_event *event,
-				 struct perf_event_context *ctx)
-{
-	/*
-	 * use the correct time source for the time snapshot
-	 *
-	 * We could get by without this by leveraging the
-	 * fact that to get to this function, the caller
-	 * has most likely already called update_context_time()
-	 * and update_cgrp_time_xx() and thus both timestamp
-	 * are identical (or very close). Given that tstamp is,
-	 * already adjusted for cgroup, we could say that:
-	 *    tstamp - ctx->timestamp
-	 * is equivalent to
-	 *    tstamp - cgrp->timestamp.
-	 *
-	 * Then, in perf_output_read(), the calculation would
-	 * work with no changes because:
-	 * - event is guaranteed scheduled in
-	 * - no scheduled out in between
-	 * - thus the timestamp would be the same
-	 *
-	 * But this is a bit hairy.
-	 *
-	 * So instead, we have an explicit cgroup call to remain
-	 * within the time source all along. We believe it
-	 * is cleaner and simpler to understand.
-	 */
-	if (is_cgroup_event(event))
-		perf_cgroup_set_shadow_time(event, event->tstamp);
-	else
-		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
-}
-
 #define MAX_INTERRUPTS (~0ULL)
 
 static void perf_log_throttle(struct perf_event *event, int enable);
@@ -2552,8 +2570,6 @@ event_sched_in(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
-	perf_set_shadow_time(event, ctx);
-
 	perf_log_itrace_start(event);
 
 	if (event->pmu->add(event, PERF_EF_START)) {
@@ -3247,16 +3263,6 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 		return;
 	}
 
-	ctx->is_active &= ~event_type;
-	if (!(ctx->is_active & EVENT_ALL))
-		ctx->is_active = 0;
-
-	if (ctx->task) {
-		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
-		if (!ctx->is_active)
-			cpuctx->task_ctx = NULL;
-	}
-
 	/*
 	 * Always update time if it was set; not only when it changes.
 	 * Otherwise we can 'forget' to update time for any but the last
@@ -3270,7 +3276,18 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 	if (is_active & EVENT_TIME) {
 		/* update (and stop) ctx time */
 		update_context_time(ctx);
-		update_cgrp_time_from_cpuctx(cpuctx);
+		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
+		barrier();
+	}
+
+	ctx->is_active &= ~event_type;
+	if (!(ctx->is_active & EVENT_ALL))
+		ctx->is_active = 0;
+
+	if (ctx->task) {
+		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
+		if (!ctx->is_active)
+			cpuctx->task_ctx = NULL;
 	}
 
 	is_active ^= ctx->is_active; /* changed bits */
@@ -3707,29 +3724,6 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	return 0;
 }
 
-static inline bool event_update_userpage(struct perf_event *event)
-{
-	if (likely(!atomic_read(&event->mmap_count)))
-		return false;
-
-	perf_event_update_time(event);
-	perf_set_shadow_time(event, event->ctx);
-	perf_event_update_userpage(event);
-
-	return true;
-}
-
-static inline void group_update_userpage(struct perf_event *group_event)
-{
-	struct perf_event *event;
-
-	if (!event_update_userpage(group_event))
-		return;
-
-	for_each_sibling_event(event, group_event)
-		event_update_userpage(event);
-}
-
 static int merge_sched_in(struct perf_event *event, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
@@ -3748,15 +3742,14 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	}
 
 	if (event->state == PERF_EVENT_STATE_INACTIVE) {
-		*can_add_hw = 0;
 		if (event->attr.pinned) {
 			perf_cgroup_event_disable(event, ctx);
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
-		} else {
-			ctx->rotate_necessary = 1;
-			perf_mux_hrtimer_restart(cpuctx);
-			group_update_userpage(event);
 		}
+
+		*can_add_hw = 0;
+		ctx->rotate_necessary = 1;
+		perf_mux_hrtimer_restart(cpuctx);
 	}
 
 	return 0;
@@ -3804,6 +3797,14 @@ ctx_sched_in(struct perf_event_context *ctx,
 	if (likely(!ctx->nr_events))
 		return;
 
+	if (is_active ^ EVENT_TIME) {
+		/* start ctx time */
+		now = perf_clock();
+		ctx->timestamp = now;
+		perf_cgroup_set_timestamp(task, ctx);
+		barrier();
+	}
+
 	ctx->is_active |= (event_type | EVENT_TIME);
 	if (ctx->task) {
 		if (!is_active)
@@ -3814,13 +3815,6 @@ ctx_sched_in(struct perf_event_context *ctx,
 
 	is_active ^= ctx->is_active; /* changed bits */
 
-	if (is_active & EVENT_TIME) {
-		/* start ctx time */
-		now = perf_clock();
-		ctx->timestamp = now;
-		perf_cgroup_set_timestamp(task, ctx);
-	}
-
 	/*
 	 * First go through the list and put on any pinned groups
 	 * in order to give them the best chance of going on.
@@ -4473,10 +4467,11 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
 
 	*value = local64_read(&event->count);
 	if (enabled || running) {
-		u64 now = event->shadow_ctx_time + perf_clock();
+		u64 ctx_time, now = perf_clock();
 		u64 __enabled, __running;
 
-		__perf_update_times(event, now, &__enabled, &__running);
+		ctx_time = perf_event_time_now(event, now);
+		__perf_update_times(event, ctx_time, &__enabled, &__running);
 		if (enabled)
 			*enabled = __enabled;
 		if (running)
@@ -5806,7 +5801,7 @@ static void calc_timer_values(struct perf_event *event,
 	u64 ctx_time;
 
 	*now = perf_clock();
-	ctx_time = event->shadow_ctx_time + *now;
+	ctx_time = perf_event_time_now(event, *now);
 	__perf_update_times(event, ctx_time, enabled, running);
 }
 
@@ -6349,7 +6344,6 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
 		ring_buffer_attach(event, rb);
 
 		perf_event_update_time(event);
-		perf_set_shadow_time(event, event->ctx);
 		perf_event_init_userpage(event);
 		perf_event_update_userpage(event);
 	} else {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-17 16:35       ` Peter Zijlstra
@ 2021-12-18  9:09         ` Song Liu
  2021-12-20  9:30           ` Peter Zijlstra
  2021-12-20 12:19         ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Song Liu @ 2021-12-18  9:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers



> On Dec 17, 2021, at 8:35 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Fri, Dec 10, 2021 at 11:33:41AM +0100, Peter Zijlstra wrote:
> 
>> I'm thinking this is a cgroup specific thing. Normally the shadow_time
>> thing is simply a relative displacement between event-time and the
>> global clock. That displacement never changes, except when you do
>> IOC_DISABLE/IOC_ENABLE.
>> 
>> However, for cgroup things are different, since the cgroup events aren't
>> unconditionally runnable, that is, the enabled time should only count
>> when the cgroup is active, right?
>> 
>> So perhaps perf_event_read_local() should use a cgroup clock instead of
>> perf_clock() for cgroup events.
>> 
>> Let me think about that some more...
> 
> How's this then? Song, could you also please test and or better explain
> the problem f79256532682 pretends to cure? Because the below is
> reverting that, I *really* hate having to touch the events we're not
> scheduling.

Unfortunately, this change bring the bug back. For time_enabled in rdpmc
case to work properly, we have to touch all the enabled but not running 
events, right?

Thanks,
Song

[...]


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-18  9:09         ` Song Liu
@ 2021-12-20  9:30           ` Peter Zijlstra
  2021-12-20  9:54             ` Peter Zijlstra
  2021-12-21 12:39             ` Peter Zijlstra
  0 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-20  9:30 UTC (permalink / raw)
  To: Song Liu
  Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers

On Sat, Dec 18, 2021 at 09:09:05AM +0000, Song Liu wrote:
> 
> 
> > On Dec 17, 2021, at 8:35 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > On Fri, Dec 10, 2021 at 11:33:41AM +0100, Peter Zijlstra wrote:
> > 
> >> I'm thinking this is a cgroup specific thing. Normally the shadow_time
> >> thing is simply a relative displacement between event-time and the
> >> global clock. That displacement never changes, except when you do
> >> IOC_DISABLE/IOC_ENABLE.
> >> 
> >> However, for cgroup things are different, since the cgroup events aren't
> >> unconditionally runnable, that is, the enabled time should only count
> >> when the cgroup is active, right?
> >> 
> >> So perhaps perf_event_read_local() should use a cgroup clock instead of
> >> perf_clock() for cgroup events.
> >> 
> >> Let me think about that some more...
> > 
> > How's this then? Song, could you also please test and or better explain
> > the problem f79256532682 pretends to cure? Because the below is
> > reverting that, I *really* hate having to touch the events we're not
> > scheduling.
> 
> Unfortunately, this change bring the bug back. For time_enabled in rdpmc
> case to work properly, we have to touch all the enabled but not running 
> events, right?

Ohh.. argh. I think I see why, it looses the context time enable edge,
and because this is all strictly per-event in the uapi (there is no ctx
representation) it can't be cured by improving ctx time handling :/

Bah, I so hate this.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-10 18:59         ` Namhyung Kim
@ 2021-12-20  9:39           ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-20  9:39 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Fri, Dec 10, 2021 at 10:59:49AM -0800, Namhyung Kim wrote:

> > You're doing that bpf-cgroup crud, right? Where exactly do you hook into
> > to do the counter reads?
> 
> That's true but it doesn't use cgroup events actually.  They are plain cpu
> events and BPF is called from a separate 'cgroup-switches' event to
> read out the counters.

Oh, right.

> > > Maybe because the event is enabled from the beginning.
> > > Then it might miss set_state/update_time at all.
> >
> > Even then, it's set to INACTIVE and any state change thereafter needs to
> > go through perf_event_set_state() and update the relevant timestamps.
> 
> Right, but the problem happens when you read the event *before*
> any state change.

But the per-cpu event should be the simplest case ever, the cpu context
is *always* active, enabled time always runs.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-20  9:30           ` Peter Zijlstra
@ 2021-12-20  9:54             ` Peter Zijlstra
  2021-12-21 12:39             ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-20  9:54 UTC (permalink / raw)
  To: Song Liu
  Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers

On Mon, Dec 20, 2021 at 10:30:34AM +0100, Peter Zijlstra wrote:
> On Sat, Dec 18, 2021 at 09:09:05AM +0000, Song Liu wrote:

> > Unfortunately, this change bring the bug back. For time_enabled in rdpmc
> > case to work properly, we have to touch all the enabled but not running 
> > events, right?
> 
> Ohh.. argh. I think I see why, it looses the context time enable edge,
> and because this is all strictly per-event in the uapi (there is no ctx
> representation) it can't be cured by improving ctx time handling :/
> 
> Bah, I so hate this.

For now I've added this then ...

+/*
+ * Because the userpage is strictly per-event (there is no concept of context,
+ * so there cannot be a context indirection), every userpage must be updated
+ * when context time starts :-(
+ *
+ * IOW, we must not miss EVENT_TIME edges.
+ */
 static inline bool event_update_userpage(struct perf_event *event)


But that same is not true for perf_event_read_local(), that *has* access
to the context and so must be able to DTRT.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-17 16:35       ` Peter Zijlstra
  2021-12-18  9:09         ` Song Liu
@ 2021-12-20 12:19         ` Peter Zijlstra
  2021-12-21  5:54           ` Namhyung Kim
                             ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-20 12:19 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu


How's this then?

---
Subject: perf: Fix perf_event_read_local() time
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Dec 20 09:59:47 CET 2021

Time readers that cannot take locks (due to NMI etc..) currently make
use of perf_event::shadow_ctx_time, which, for that event gives:

	time' = now + (time - timestamp)

or, alternatively arranged:

	time' = time + (now - timestamp)

IOW, the progression of time since the last time the shadow_ctx_time
was updated.

There's problems with this:

 A) the shadow_ctx_time is per-event, even though the ctx_time it
    reflects is obviously per context. The direct concequence of this
    is that the context needs to iterate all events all the time to
    keep the shadow_ctx_time in sync.

 B) even with the prior point, the context itself might not be active
    meaning its time should not advance to begin with.

 C) shadow_ctx_time isn't consistently updated when ctx_time is

There are 3 users of this stuff, that suffer differently from this:

 - calc_timer_values()
   - perf_output_read()
   - perf_event_update_userpage()	/* A */

 - perf_event_read_local()		/* A,B */

In particular, perf_output_read() doesn't suffer at all, because it's
sample driven and hence only relevant when the event is actually
running.

This same was supposed to be true for perf_event_update_userpage(),
after all self-monitoring implies the context is active *HOWEVER*, as
per commit f79256532682 ("perf/core: fix userpage->time_enabled of
inactive events") this goes wrong when combined with counter
overcommit, in that case those events that do not get scheduled when
the context becomes active (task events typically) miss out on the
EVENT_TIME update and ENABLED time is inflated (for a little while)
with the time the context was inactive. Once the event gets rotated
in, this gets corrected, leading to a non-monotonic timeflow.

perf_event_read_local() made things even worse, it can request time at
any point, suffering all the problems perf_event_update_userpage()
does and more. Because while perf_event_update_userpage() is limited
by the context being active, perf_event_read_local() users have no
such constraint.

Therefore, completely overhaul things and do away with
perf_event::shadow_ctx_time. Instead have regular context time updates
keep track of this offset directly and provide perf_event_time_now()
to complement perf_event_time().

perf_event_time_now() will, in adition to being context wide, also
take into account if the context is active. For inactive context, it
will not advance time.

This latter property means the cgroup perf_cgroup_info context needs
to grow addition state to track this.

Additionally, since all this is strictly per-cpu, we can use barrier()
to order context activity vs context time.

Fixes: 7d9285e82db5 ("perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/perf_event.h |   15 ---
 kernel/events/core.c       |  224 +++++++++++++++++++++++++++------------------
 2 files changed, 138 insertions(+), 101 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -683,18 +683,6 @@ struct perf_event {
 	u64				total_time_running;
 	u64				tstamp;
 
-	/*
-	 * timestamp shadows the actual context timing but it can
-	 * be safely used in NMI interrupt context. It reflects the
-	 * context time as it was when the event was last scheduled in,
-	 * or when ctx_sched_in failed to schedule the event because we
-	 * run out of PMC.
-	 *
-	 * ctx_time already accounts for ctx->timestamp. Therefore to
-	 * compute ctx_time for a sample, simply add perf_clock().
-	 */
-	u64				shadow_ctx_time;
-
 	struct perf_event_attr		attr;
 	u16				header_size;
 	u16				id_header_size;
@@ -841,6 +829,7 @@ struct perf_event_context {
 	 */
 	u64				time;
 	u64				timestamp;
+	u64				timeoffset;
 
 	/*
 	 * These fields let us detect when two contexts have both
@@ -923,6 +912,8 @@ struct bpf_perf_event_data_kern {
 struct perf_cgroup_info {
 	u64				time;
 	u64				timestamp;
+	u64				timeoffset;
+	int				active;
 };
 
 struct perf_cgroup {
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -674,6 +674,23 @@ perf_event_set_state(struct perf_event *
 	WRITE_ONCE(event->state, state);
 }
 
+/*
+ * UP store-release, load-acquire
+ */
+
+#define __store_release(ptr, val)					\
+do {									\
+	barrier();							\
+	WRITE_ONCE(*(ptr), (val));					\
+} while (0)
+
+#define __load_acquire(ptr)						\
+({									\
+	__unqual_scalar_typeof(*(ptr)) ___p = READ_ONCE(*(ptr));	\
+	barrier();							\
+	___p;								\
+})
+
 #ifdef CONFIG_CGROUP_PERF
 
 static inline bool
@@ -719,34 +736,51 @@ static inline u64 perf_cgroup_event_time
 	return t->time;
 }
 
-static inline void __update_cgrp_time(struct perf_cgroup *cgrp)
+static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 {
-	struct perf_cgroup_info *info;
-	u64 now;
-
-	now = perf_clock();
+	struct perf_cgroup_info *t;
 
-	info = this_cpu_ptr(cgrp->info);
+	t = per_cpu_ptr(event->cgrp->info, event->cpu);
+	if (!__load_acquire(&t->active))
+		return t->time;
+	now += READ_ONCE(t->timeoffset);
+	return now;
+}
 
-	info->time += now - info->timestamp;
+static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bool adv)
+{
+	if (adv)
+		info->time += now - info->timestamp;
 	info->timestamp = now;
+	/*
+	 * see update_context_time()
+	 */
+	WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
 }
 
-static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
+static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
 {
 	struct perf_cgroup *cgrp = cpuctx->cgrp;
 	struct cgroup_subsys_state *css;
+	struct perf_cgroup_info *info;
 
 	if (cgrp) {
+		u64 now = perf_clock();
+
 		for (css = &cgrp->css; css; css = css->parent) {
 			cgrp = container_of(css, struct perf_cgroup, css);
-			__update_cgrp_time(cgrp);
+			info = this_cpu_ptr(cgrp->info);
+
+			__update_cgrp_time(info, now, true);
+			if (final)
+				__store_release(&info->active, 0);
 		}
 	}
 }
 
 static inline void update_cgrp_time_from_event(struct perf_event *event)
 {
+	struct perf_cgroup_info *info;
 	struct perf_cgroup *cgrp;
 
 	/*
@@ -760,8 +794,10 @@ static inline void update_cgrp_time_from
 	/*
 	 * Do not update time when cgroup is not active
 	 */
-	if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup))
-		__update_cgrp_time(event->cgrp);
+	if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) {
+		info = this_cpu_ptr(event->cgrp->info);
+		__update_cgrp_time(info, perf_clock(), true);
+	}
 }
 
 static inline void
@@ -785,7 +821,8 @@ perf_cgroup_set_timestamp(struct task_st
 	for (css = &cgrp->css; css; css = css->parent) {
 		cgrp = container_of(css, struct perf_cgroup, css);
 		info = this_cpu_ptr(cgrp->info);
-		info->timestamp = ctx->timestamp;
+		__update_cgrp_time(info, ctx->timestamp, false);
+		__store_release(&info->active, 1);
 	}
 }
 
@@ -982,14 +1019,6 @@ static inline int perf_cgroup_connect(in
 }
 
 static inline void
-perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
-{
-	struct perf_cgroup_info *t;
-	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	event->shadow_ctx_time = now - t->timestamp;
-}
-
-static inline void
 perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ctx)
 {
 	struct perf_cpu_context *cpuctx;
@@ -1066,7 +1095,8 @@ static inline void update_cgrp_time_from
 {
 }
 
-static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
+static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
+						bool final)
 {
 }
 
@@ -1098,12 +1128,12 @@ perf_cgroup_switch(struct task_struct *t
 {
 }
 
-static inline void
-perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
+static inline u64 perf_cgroup_event_time(struct perf_event *event)
 {
+	return 0;
 }
 
-static inline u64 perf_cgroup_event_time(struct perf_event *event)
+static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 {
 	return 0;
 }
@@ -1525,22 +1555,59 @@ static void perf_unpin_context(struct pe
 /*
  * Update the record of the current time in a context.
  */
-static void update_context_time(struct perf_event_context *ctx)
+static void __update_context_time(struct perf_event_context *ctx, bool adv)
 {
 	u64 now = perf_clock();
 
-	ctx->time += now - ctx->timestamp;
+	if (adv)
+		ctx->time += now - ctx->timestamp;
 	ctx->timestamp = now;
+
+	/*
+	 * The above: time' = time + (now - timestamp), can be re-arranged
+	 * into: time` = now + (time - timestamp), which gives a single value
+	 * offset to compute future time without locks on.
+	 *
+	 * See perf_event_time_now(), which can be used from NMI context where
+	 * it's (obviously) not possible to acquire ctx->lock in order to read
+	 * both the above values in a consistent manner.
+	 */
+	WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
+}
+
+static void update_context_time(struct perf_event_context *ctx)
+{
+	__update_context_time(ctx, true);
 }
 
 static u64 perf_event_time(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
 
+	if (unlikely(!ctx))
+		return 0;
+
 	if (is_cgroup_event(event))
 		return perf_cgroup_event_time(event);
 
-	return ctx ? ctx->time : 0;
+	return ctx->time;
+}
+
+static u64 perf_event_time_now(struct perf_event *event, u64 now)
+{
+	struct perf_event_context *ctx = event->ctx;
+
+	if (unlikely(!ctx))
+		return 0;
+
+	if (is_cgroup_event(event))
+		return perf_cgroup_event_time_now(event, now);
+
+	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
+		return ctx->time;
+
+	now += READ_ONCE(ctx->timeoffset);
+	return now;
 }
 
 static enum event_type_t get_event_type(struct perf_event *event)
@@ -2346,7 +2413,7 @@ __perf_remove_from_context(struct perf_e
 
 	if (ctx->is_active & EVENT_TIME) {
 		update_context_time(ctx);
-		update_cgrp_time_from_cpuctx(cpuctx);
+		update_cgrp_time_from_cpuctx(cpuctx, false);
 	}
 
 	event_sched_out(event, cpuctx, ctx);
@@ -2357,6 +2424,9 @@ __perf_remove_from_context(struct perf_e
 	list_del_event(event, ctx);
 
 	if (!ctx->nr_events && ctx->is_active) {
+		if (ctx == &cpuctx->ctx)
+			update_cgrp_time_from_cpuctx(cpuctx, true);
+
 		ctx->is_active = 0;
 		ctx->rotate_necessary = 0;
 		if (ctx->task) {
@@ -2478,40 +2548,6 @@ void perf_event_disable_inatomic(struct
 	irq_work_queue(&event->pending);
 }
 
-static void perf_set_shadow_time(struct perf_event *event,
-				 struct perf_event_context *ctx)
-{
-	/*
-	 * use the correct time source for the time snapshot
-	 *
-	 * We could get by without this by leveraging the
-	 * fact that to get to this function, the caller
-	 * has most likely already called update_context_time()
-	 * and update_cgrp_time_xx() and thus both timestamp
-	 * are identical (or very close). Given that tstamp is,
-	 * already adjusted for cgroup, we could say that:
-	 *    tstamp - ctx->timestamp
-	 * is equivalent to
-	 *    tstamp - cgrp->timestamp.
-	 *
-	 * Then, in perf_output_read(), the calculation would
-	 * work with no changes because:
-	 * - event is guaranteed scheduled in
-	 * - no scheduled out in between
-	 * - thus the timestamp would be the same
-	 *
-	 * But this is a bit hairy.
-	 *
-	 * So instead, we have an explicit cgroup call to remain
-	 * within the time source all along. We believe it
-	 * is cleaner and simpler to understand.
-	 */
-	if (is_cgroup_event(event))
-		perf_cgroup_set_shadow_time(event, event->tstamp);
-	else
-		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
-}
-
 #define MAX_INTERRUPTS (~0ULL)
 
 static void perf_log_throttle(struct perf_event *event, int enable);
@@ -2552,8 +2588,6 @@ event_sched_in(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
-	perf_set_shadow_time(event, ctx);
-
 	perf_log_itrace_start(event);
 
 	if (event->pmu->add(event, PERF_EF_START)) {
@@ -3247,16 +3281,6 @@ static void ctx_sched_out(struct perf_ev
 		return;
 	}
 
-	ctx->is_active &= ~event_type;
-	if (!(ctx->is_active & EVENT_ALL))
-		ctx->is_active = 0;
-
-	if (ctx->task) {
-		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
-		if (!ctx->is_active)
-			cpuctx->task_ctx = NULL;
-	}
-
 	/*
 	 * Always update time if it was set; not only when it changes.
 	 * Otherwise we can 'forget' to update time for any but the last
@@ -3270,7 +3294,22 @@ static void ctx_sched_out(struct perf_ev
 	if (is_active & EVENT_TIME) {
 		/* update (and stop) ctx time */
 		update_context_time(ctx);
-		update_cgrp_time_from_cpuctx(cpuctx);
+		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
+		/*
+		 * CPU-release for the below ->is_active store,
+		 * see __load_acquire() in perf_event_time_now()
+		 */
+		barrier();
+	}
+
+	ctx->is_active &= ~event_type;
+	if (!(ctx->is_active & EVENT_ALL))
+		ctx->is_active = 0;
+
+	if (ctx->task) {
+		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
+		if (!ctx->is_active)
+			cpuctx->task_ctx = NULL;
 	}
 
 	is_active ^= ctx->is_active; /* changed bits */
@@ -3707,13 +3746,19 @@ static noinline int visit_groups_merge(s
 	return 0;
 }
 
+/*
+ * Because the userpage is strictly per-event (there is no concept of context,
+ * so there cannot be a context indirection), every userpage must be updated
+ * when context time starts :-(
+ *
+ * IOW, we must not miss EVENT_TIME edges.
+ */
 static inline bool event_update_userpage(struct perf_event *event)
 {
 	if (likely(!atomic_read(&event->mmap_count)))
 		return false;
 
 	perf_event_update_time(event);
-	perf_set_shadow_time(event, event->ctx);
 	perf_event_update_userpage(event);
 
 	return true;
@@ -3797,13 +3842,23 @@ ctx_sched_in(struct perf_event_context *
 	     struct task_struct *task)
 {
 	int is_active = ctx->is_active;
-	u64 now;
 
 	lockdep_assert_held(&ctx->lock);
 
 	if (likely(!ctx->nr_events))
 		return;
 
+	if (is_active ^ EVENT_TIME) {
+		/* start ctx time */
+		__update_context_time(ctx, false);
+		perf_cgroup_set_timestamp(task, ctx);
+		/*
+		 * CPU-release for the below ->is_active store,
+		 * see __load_acquire() in perf_event_time_now()
+		 */
+		barrier();
+	}
+
 	ctx->is_active |= (event_type | EVENT_TIME);
 	if (ctx->task) {
 		if (!is_active)
@@ -3814,13 +3869,6 @@ ctx_sched_in(struct perf_event_context *
 
 	is_active ^= ctx->is_active; /* changed bits */
 
-	if (is_active & EVENT_TIME) {
-		/* start ctx time */
-		now = perf_clock();
-		ctx->timestamp = now;
-		perf_cgroup_set_timestamp(task, ctx);
-	}
-
 	/*
 	 * First go through the list and put on any pinned groups
 	 * in order to give them the best chance of going on.
@@ -4473,10 +4521,9 @@ int perf_event_read_local(struct perf_ev
 
 	*value = local64_read(&event->count);
 	if (enabled || running) {
-		u64 now = event->shadow_ctx_time + perf_clock();
-		u64 __enabled, __running;
+		u64 __enabled, __running, __now;;
 
-		__perf_update_times(event, now, &__enabled, &__running);
+		calc_timer_values(event, &__now, &__enabled, &__running);
 		if (enabled)
 			*enabled = __enabled;
 		if (running)
@@ -5806,7 +5853,7 @@ static void calc_timer_values(struct per
 	u64 ctx_time;
 
 	*now = perf_clock();
-	ctx_time = event->shadow_ctx_time + *now;
+	ctx_time = perf_event_time_now(event, *now);
 	__perf_update_times(event, ctx_time, enabled, running);
 }
 
@@ -6349,7 +6396,6 @@ static int perf_mmap(struct file *file,
 		ring_buffer_attach(event, rb);
 
 		perf_event_update_time(event);
-		perf_set_shadow_time(event, event->ctx);
 		perf_event_init_userpage(event);
 		perf_event_update_userpage(event);
 	} else {

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-20 12:19         ` Peter Zijlstra
@ 2021-12-21  5:54           ` Namhyung Kim
  2021-12-21  7:23           ` Song Liu
  2022-01-18 11:17           ` [tip: perf/urgent] perf: Fix perf_event_read_local() time tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 24+ messages in thread
From: Namhyung Kim @ 2021-12-21  5:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, Mark Rutland,
	Alexander Shishkin, LKML, Stephane Eranian, Andi Kleen,
	Ian Rogers, Song Liu

On Mon, Dec 20, 2021 at 4:20 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> How's this then?
>
> ---
> Subject: perf: Fix perf_event_read_local() time
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Dec 20 09:59:47 CET 2021
>
> Time readers that cannot take locks (due to NMI etc..) currently make
> use of perf_event::shadow_ctx_time, which, for that event gives:
>
>         time' = now + (time - timestamp)
>
> or, alternatively arranged:
>
>         time' = time + (now - timestamp)
>
> IOW, the progression of time since the last time the shadow_ctx_time
> was updated.
>
> There's problems with this:
>
>  A) the shadow_ctx_time is per-event, even though the ctx_time it
>     reflects is obviously per context. The direct concequence of this
>     is that the context needs to iterate all events all the time to
>     keep the shadow_ctx_time in sync.
>
>  B) even with the prior point, the context itself might not be active
>     meaning its time should not advance to begin with.
>
>  C) shadow_ctx_time isn't consistently updated when ctx_time is
>
> There are 3 users of this stuff, that suffer differently from this:
>
>  - calc_timer_values()
>    - perf_output_read()
>    - perf_event_update_userpage()       /* A */
>
>  - perf_event_read_local()              /* A,B */
>
> In particular, perf_output_read() doesn't suffer at all, because it's
> sample driven and hence only relevant when the event is actually
> running.
>
> This same was supposed to be true for perf_event_update_userpage(),
> after all self-monitoring implies the context is active *HOWEVER*, as
> per commit f79256532682 ("perf/core: fix userpage->time_enabled of
> inactive events") this goes wrong when combined with counter
> overcommit, in that case those events that do not get scheduled when
> the context becomes active (task events typically) miss out on the
> EVENT_TIME update and ENABLED time is inflated (for a little while)
> with the time the context was inactive. Once the event gets rotated
> in, this gets corrected, leading to a non-monotonic timeflow.
>
> perf_event_read_local() made things even worse, it can request time at
> any point, suffering all the problems perf_event_update_userpage()
> does and more. Because while perf_event_update_userpage() is limited
> by the context being active, perf_event_read_local() users have no
> such constraint.
>
> Therefore, completely overhaul things and do away with
> perf_event::shadow_ctx_time. Instead have regular context time updates
> keep track of this offset directly and provide perf_event_time_now()
> to complement perf_event_time().
>
> perf_event_time_now() will, in adition to being context wide, also
> take into account if the context is active. For inactive context, it
> will not advance time.
>
> This latter property means the cgroup perf_cgroup_info context needs
> to grow addition state to track this.
>
> Additionally, since all this is strictly per-cpu, we can use barrier()
> to order context activity vs context time.
>
> Fixes: 7d9285e82db5 ("perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

It worked well for my test.

Tested-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung


> ---
>  include/linux/perf_event.h |   15 ---
>  kernel/events/core.c       |  224 +++++++++++++++++++++++++++------------------
>  2 files changed, 138 insertions(+), 101 deletions(-)
>
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -683,18 +683,6 @@ struct perf_event {
>         u64                             total_time_running;
>         u64                             tstamp;
>
> -       /*
> -        * timestamp shadows the actual context timing but it can
> -        * be safely used in NMI interrupt context. It reflects the
> -        * context time as it was when the event was last scheduled in,
> -        * or when ctx_sched_in failed to schedule the event because we
> -        * run out of PMC.
> -        *
> -        * ctx_time already accounts for ctx->timestamp. Therefore to
> -        * compute ctx_time for a sample, simply add perf_clock().
> -        */
> -       u64                             shadow_ctx_time;
> -
>         struct perf_event_attr          attr;
>         u16                             header_size;
>         u16                             id_header_size;
> @@ -841,6 +829,7 @@ struct perf_event_context {
>          */
>         u64                             time;
>         u64                             timestamp;
> +       u64                             timeoffset;
>
>         /*
>          * These fields let us detect when two contexts have both
> @@ -923,6 +912,8 @@ struct bpf_perf_event_data_kern {
>  struct perf_cgroup_info {
>         u64                             time;
>         u64                             timestamp;
> +       u64                             timeoffset;
> +       int                             active;
>  };
>
>  struct perf_cgroup {
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -674,6 +674,23 @@ perf_event_set_state(struct perf_event *
>         WRITE_ONCE(event->state, state);
>  }
>
> +/*
> + * UP store-release, load-acquire
> + */
> +
> +#define __store_release(ptr, val)                                      \
> +do {                                                                   \
> +       barrier();                                                      \
> +       WRITE_ONCE(*(ptr), (val));                                      \
> +} while (0)
> +
> +#define __load_acquire(ptr)                                            \
> +({                                                                     \
> +       __unqual_scalar_typeof(*(ptr)) ___p = READ_ONCE(*(ptr));        \
> +       barrier();                                                      \
> +       ___p;                                                           \
> +})
> +
>  #ifdef CONFIG_CGROUP_PERF
>
>  static inline bool
> @@ -719,34 +736,51 @@ static inline u64 perf_cgroup_event_time
>         return t->time;
>  }
>
> -static inline void __update_cgrp_time(struct perf_cgroup *cgrp)
> +static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>  {
> -       struct perf_cgroup_info *info;
> -       u64 now;
> -
> -       now = perf_clock();
> +       struct perf_cgroup_info *t;
>
> -       info = this_cpu_ptr(cgrp->info);
> +       t = per_cpu_ptr(event->cgrp->info, event->cpu);
> +       if (!__load_acquire(&t->active))
> +               return t->time;
> +       now += READ_ONCE(t->timeoffset);
> +       return now;
> +}
>
> -       info->time += now - info->timestamp;
> +static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bool adv)
> +{
> +       if (adv)
> +               info->time += now - info->timestamp;
>         info->timestamp = now;
> +       /*
> +        * see update_context_time()
> +        */
> +       WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
>  }
>
> -static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
> +static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
>  {
>         struct perf_cgroup *cgrp = cpuctx->cgrp;
>         struct cgroup_subsys_state *css;
> +       struct perf_cgroup_info *info;
>
>         if (cgrp) {
> +               u64 now = perf_clock();
> +
>                 for (css = &cgrp->css; css; css = css->parent) {
>                         cgrp = container_of(css, struct perf_cgroup, css);
> -                       __update_cgrp_time(cgrp);
> +                       info = this_cpu_ptr(cgrp->info);
> +
> +                       __update_cgrp_time(info, now, true);
> +                       if (final)
> +                               __store_release(&info->active, 0);
>                 }
>         }
>  }
>
>  static inline void update_cgrp_time_from_event(struct perf_event *event)
>  {
> +       struct perf_cgroup_info *info;
>         struct perf_cgroup *cgrp;
>
>         /*
> @@ -760,8 +794,10 @@ static inline void update_cgrp_time_from
>         /*
>          * Do not update time when cgroup is not active
>          */
> -       if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup))
> -               __update_cgrp_time(event->cgrp);
> +       if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) {
> +               info = this_cpu_ptr(event->cgrp->info);
> +               __update_cgrp_time(info, perf_clock(), true);
> +       }
>  }
>
>  static inline void
> @@ -785,7 +821,8 @@ perf_cgroup_set_timestamp(struct task_st
>         for (css = &cgrp->css; css; css = css->parent) {
>                 cgrp = container_of(css, struct perf_cgroup, css);
>                 info = this_cpu_ptr(cgrp->info);
> -               info->timestamp = ctx->timestamp;
> +               __update_cgrp_time(info, ctx->timestamp, false);
> +               __store_release(&info->active, 1);
>         }
>  }
>
> @@ -982,14 +1019,6 @@ static inline int perf_cgroup_connect(in
>  }
>
>  static inline void
> -perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
> -{
> -       struct perf_cgroup_info *t;
> -       t = per_cpu_ptr(event->cgrp->info, event->cpu);
> -       event->shadow_ctx_time = now - t->timestamp;
> -}
> -
> -static inline void
>  perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ctx)
>  {
>         struct perf_cpu_context *cpuctx;
> @@ -1066,7 +1095,8 @@ static inline void update_cgrp_time_from
>  {
>  }
>
> -static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
> +static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
> +                                               bool final)
>  {
>  }
>
> @@ -1098,12 +1128,12 @@ perf_cgroup_switch(struct task_struct *t
>  {
>  }
>
> -static inline void
> -perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
> +static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  {
> +       return 0;
>  }
>
> -static inline u64 perf_cgroup_event_time(struct perf_event *event)
> +static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
>  {
>         return 0;
>  }
> @@ -1525,22 +1555,59 @@ static void perf_unpin_context(struct pe
>  /*
>   * Update the record of the current time in a context.
>   */
> -static void update_context_time(struct perf_event_context *ctx)
> +static void __update_context_time(struct perf_event_context *ctx, bool adv)
>  {
>         u64 now = perf_clock();
>
> -       ctx->time += now - ctx->timestamp;
> +       if (adv)
> +               ctx->time += now - ctx->timestamp;
>         ctx->timestamp = now;
> +
> +       /*
> +        * The above: time' = time + (now - timestamp), can be re-arranged
> +        * into: time` = now + (time - timestamp), which gives a single value
> +        * offset to compute future time without locks on.
> +        *
> +        * See perf_event_time_now(), which can be used from NMI context where
> +        * it's (obviously) not possible to acquire ctx->lock in order to read
> +        * both the above values in a consistent manner.
> +        */
> +       WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
> +}
> +
> +static void update_context_time(struct perf_event_context *ctx)
> +{
> +       __update_context_time(ctx, true);
>  }
>
>  static u64 perf_event_time(struct perf_event *event)
>  {
>         struct perf_event_context *ctx = event->ctx;
>
> +       if (unlikely(!ctx))
> +               return 0;
> +
>         if (is_cgroup_event(event))
>                 return perf_cgroup_event_time(event);
>
> -       return ctx ? ctx->time : 0;
> +       return ctx->time;
> +}
> +
> +static u64 perf_event_time_now(struct perf_event *event, u64 now)
> +{
> +       struct perf_event_context *ctx = event->ctx;
> +
> +       if (unlikely(!ctx))
> +               return 0;
> +
> +       if (is_cgroup_event(event))
> +               return perf_cgroup_event_time_now(event, now);
> +
> +       if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
> +               return ctx->time;
> +
> +       now += READ_ONCE(ctx->timeoffset);
> +       return now;
>  }
>
>  static enum event_type_t get_event_type(struct perf_event *event)
> @@ -2346,7 +2413,7 @@ __perf_remove_from_context(struct perf_e
>
>         if (ctx->is_active & EVENT_TIME) {
>                 update_context_time(ctx);
> -               update_cgrp_time_from_cpuctx(cpuctx);
> +               update_cgrp_time_from_cpuctx(cpuctx, false);
>         }
>
>         event_sched_out(event, cpuctx, ctx);
> @@ -2357,6 +2424,9 @@ __perf_remove_from_context(struct perf_e
>         list_del_event(event, ctx);
>
>         if (!ctx->nr_events && ctx->is_active) {
> +               if (ctx == &cpuctx->ctx)
> +                       update_cgrp_time_from_cpuctx(cpuctx, true);
> +
>                 ctx->is_active = 0;
>                 ctx->rotate_necessary = 0;
>                 if (ctx->task) {
> @@ -2478,40 +2548,6 @@ void perf_event_disable_inatomic(struct
>         irq_work_queue(&event->pending);
>  }
>
> -static void perf_set_shadow_time(struct perf_event *event,
> -                                struct perf_event_context *ctx)
> -{
> -       /*
> -        * use the correct time source for the time snapshot
> -        *
> -        * We could get by without this by leveraging the
> -        * fact that to get to this function, the caller
> -        * has most likely already called update_context_time()
> -        * and update_cgrp_time_xx() and thus both timestamp
> -        * are identical (or very close). Given that tstamp is,
> -        * already adjusted for cgroup, we could say that:
> -        *    tstamp - ctx->timestamp
> -        * is equivalent to
> -        *    tstamp - cgrp->timestamp.
> -        *
> -        * Then, in perf_output_read(), the calculation would
> -        * work with no changes because:
> -        * - event is guaranteed scheduled in
> -        * - no scheduled out in between
> -        * - thus the timestamp would be the same
> -        *
> -        * But this is a bit hairy.
> -        *
> -        * So instead, we have an explicit cgroup call to remain
> -        * within the time source all along. We believe it
> -        * is cleaner and simpler to understand.
> -        */
> -       if (is_cgroup_event(event))
> -               perf_cgroup_set_shadow_time(event, event->tstamp);
> -       else
> -               event->shadow_ctx_time = event->tstamp - ctx->timestamp;
> -}
> -
>  #define MAX_INTERRUPTS (~0ULL)
>
>  static void perf_log_throttle(struct perf_event *event, int enable);
> @@ -2552,8 +2588,6 @@ event_sched_in(struct perf_event *event,
>
>         perf_pmu_disable(event->pmu);
>
> -       perf_set_shadow_time(event, ctx);
> -
>         perf_log_itrace_start(event);
>
>         if (event->pmu->add(event, PERF_EF_START)) {
> @@ -3247,16 +3281,6 @@ static void ctx_sched_out(struct perf_ev
>                 return;
>         }
>
> -       ctx->is_active &= ~event_type;
> -       if (!(ctx->is_active & EVENT_ALL))
> -               ctx->is_active = 0;
> -
> -       if (ctx->task) {
> -               WARN_ON_ONCE(cpuctx->task_ctx != ctx);
> -               if (!ctx->is_active)
> -                       cpuctx->task_ctx = NULL;
> -       }
> -
>         /*
>          * Always update time if it was set; not only when it changes.
>          * Otherwise we can 'forget' to update time for any but the last
> @@ -3270,7 +3294,22 @@ static void ctx_sched_out(struct perf_ev
>         if (is_active & EVENT_TIME) {
>                 /* update (and stop) ctx time */
>                 update_context_time(ctx);
> -               update_cgrp_time_from_cpuctx(cpuctx);
> +               update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
> +               /*
> +                * CPU-release for the below ->is_active store,
> +                * see __load_acquire() in perf_event_time_now()
> +                */
> +               barrier();
> +       }
> +
> +       ctx->is_active &= ~event_type;
> +       if (!(ctx->is_active & EVENT_ALL))
> +               ctx->is_active = 0;
> +
> +       if (ctx->task) {
> +               WARN_ON_ONCE(cpuctx->task_ctx != ctx);
> +               if (!ctx->is_active)
> +                       cpuctx->task_ctx = NULL;
>         }
>
>         is_active ^= ctx->is_active; /* changed bits */
> @@ -3707,13 +3746,19 @@ static noinline int visit_groups_merge(s
>         return 0;
>  }
>
> +/*
> + * Because the userpage is strictly per-event (there is no concept of context,
> + * so there cannot be a context indirection), every userpage must be updated
> + * when context time starts :-(
> + *
> + * IOW, we must not miss EVENT_TIME edges.
> + */
>  static inline bool event_update_userpage(struct perf_event *event)
>  {
>         if (likely(!atomic_read(&event->mmap_count)))
>                 return false;
>
>         perf_event_update_time(event);
> -       perf_set_shadow_time(event, event->ctx);
>         perf_event_update_userpage(event);
>
>         return true;
> @@ -3797,13 +3842,23 @@ ctx_sched_in(struct perf_event_context *
>              struct task_struct *task)
>  {
>         int is_active = ctx->is_active;
> -       u64 now;
>
>         lockdep_assert_held(&ctx->lock);
>
>         if (likely(!ctx->nr_events))
>                 return;
>
> +       if (is_active ^ EVENT_TIME) {
> +               /* start ctx time */
> +               __update_context_time(ctx, false);
> +               perf_cgroup_set_timestamp(task, ctx);
> +               /*
> +                * CPU-release for the below ->is_active store,
> +                * see __load_acquire() in perf_event_time_now()
> +                */
> +               barrier();
> +       }
> +
>         ctx->is_active |= (event_type | EVENT_TIME);
>         if (ctx->task) {
>                 if (!is_active)
> @@ -3814,13 +3869,6 @@ ctx_sched_in(struct perf_event_context *
>
>         is_active ^= ctx->is_active; /* changed bits */
>
> -       if (is_active & EVENT_TIME) {
> -               /* start ctx time */
> -               now = perf_clock();
> -               ctx->timestamp = now;
> -               perf_cgroup_set_timestamp(task, ctx);
> -       }
> -
>         /*
>          * First go through the list and put on any pinned groups
>          * in order to give them the best chance of going on.
> @@ -4473,10 +4521,9 @@ int perf_event_read_local(struct perf_ev
>
>         *value = local64_read(&event->count);
>         if (enabled || running) {
> -               u64 now = event->shadow_ctx_time + perf_clock();
> -               u64 __enabled, __running;
> +               u64 __enabled, __running, __now;;
>
> -               __perf_update_times(event, now, &__enabled, &__running);
> +               calc_timer_values(event, &__now, &__enabled, &__running);
>                 if (enabled)
>                         *enabled = __enabled;
>                 if (running)
> @@ -5806,7 +5853,7 @@ static void calc_timer_values(struct per
>         u64 ctx_time;
>
>         *now = perf_clock();
> -       ctx_time = event->shadow_ctx_time + *now;
> +       ctx_time = perf_event_time_now(event, *now);
>         __perf_update_times(event, ctx_time, enabled, running);
>  }
>
> @@ -6349,7 +6396,6 @@ static int perf_mmap(struct file *file,
>                 ring_buffer_attach(event, rb);
>
>                 perf_event_update_time(event);
> -               perf_set_shadow_time(event, event->ctx);
>                 perf_event_init_userpage(event);
>                 perf_event_update_userpage(event);
>         } else {

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-20 12:19         ` Peter Zijlstra
  2021-12-21  5:54           ` Namhyung Kim
@ 2021-12-21  7:23           ` Song Liu
  2021-12-21 11:21             ` Peter Zijlstra
  2022-01-18 11:17           ` [tip: perf/urgent] perf: Fix perf_event_read_local() time tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 24+ messages in thread
From: Song Liu @ 2021-12-21  7:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers



> On Dec 20, 2021, at 4:19 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
> How's this then?
> 
> ---
> Subject: perf: Fix perf_event_read_local() time
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Dec 20 09:59:47 CET 2021
> 
> Time readers that cannot take locks (due to NMI etc..) currently make
> use of perf_event::shadow_ctx_time, which, for that event gives:
> 
> 	time' = now + (time - timestamp)
> 
> or, alternatively arranged:
> 
> 	time' = time + (now - timestamp)
> 
> IOW, the progression of time since the last time the shadow_ctx_time
> was updated.
> 
> There's problems with this:
> 
> A) the shadow_ctx_time is per-event, even though the ctx_time it
>    reflects is obviously per context. The direct concequence of this
>    is that the context needs to iterate all events all the time to
>    keep the shadow_ctx_time in sync.
> 
> B) even with the prior point, the context itself might not be active
>    meaning its time should not advance to begin with.
> 
> C) shadow_ctx_time isn't consistently updated when ctx_time is
> 
> There are 3 users of this stuff, that suffer differently from this:
> 
> - calc_timer_values()
>   - perf_output_read()
>   - perf_event_update_userpage()	/* A */
> 
> - perf_event_read_local()		/* A,B */
> 
> In particular, perf_output_read() doesn't suffer at all, because it's
> sample driven and hence only relevant when the event is actually
> running.
> 
> This same was supposed to be true for perf_event_update_userpage(),
> after all self-monitoring implies the context is active *HOWEVER*, as
> per commit f79256532682 ("perf/core: fix userpage->time_enabled of
> inactive events") this goes wrong when combined with counter
> overcommit, in that case those events that do not get scheduled when
> the context becomes active (task events typically) miss out on the
> EVENT_TIME update and ENABLED time is inflated (for a little while)
> with the time the context was inactive. Once the event gets rotated
> in, this gets corrected, leading to a non-monotonic timeflow.
> 
> perf_event_read_local() made things even worse, it can request time at
> any point, suffering all the problems perf_event_update_userpage()
> does and more. Because while perf_event_update_userpage() is limited
> by the context being active, perf_event_read_local() users have no
> such constraint.
> 
> Therefore, completely overhaul things and do away with
> perf_event::shadow_ctx_time. Instead have regular context time updates
> keep track of this offset directly and provide perf_event_time_now()
> to complement perf_event_time().
> 
> perf_event_time_now() will, in adition to being context wide, also
> take into account if the context is active. For inactive context, it
> will not advance time.
> 
> This latter property means the cgroup perf_cgroup_info context needs
> to grow addition state to track this.
> 
> Additionally, since all this is strictly per-cpu, we can use barrier()
> to order context activity vs context time.
> 
> Fixes: 7d9285e82db5 ("perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I think we need to forward the declaration of calc_timer_values?

With that fixed, this passes the test for enabled time. 

Tested-by: Song Liu <song@kernel.org>

Thanks,
Song

> ---
> include/linux/perf_event.h |   15 ---
> kernel/events/core.c       |  224 +++++++++++++++++++++++++++------------------
> 2 files changed, 138 insertions(+), 101 deletions(-)
> 
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -683,18 +683,6 @@ struct perf_event {
> 	u64				total_time_running;
> 	u64				tstamp;
> 
> -	/*
> -	 * timestamp shadows the actual context timing but it can
> -	 * be safely used in NMI interrupt context. It reflects the
> -	 * context time as it was when the event was last scheduled in,
> -	 * or when ctx_sched_in failed to schedule the event because we
> -	 * run out of PMC.
> -	 *
> -	 * ctx_time already accounts for ctx->timestamp. Therefore to
> -	 * compute ctx_time for a sample, simply add perf_clock().
> -	 */
> -	u64				shadow_ctx_time;
> -
> 	struct perf_event_attr		attr;
> 	u16				header_size;
> 	u16				id_header_size;
> @@ -841,6 +829,7 @@ struct perf_event_context {
> 	 */
> 	u64				time;
> 	u64				timestamp;
> +	u64				timeoffset;
> 
> 	/*
> 	 * These fields let us detect when two contexts have both
> @@ -923,6 +912,8 @@ struct bpf_perf_event_data_kern {
> struct perf_cgroup_info {
> 	u64				time;
> 	u64				timestamp;
> +	u64				timeoffset;
> +	int				active;
> };
> 
> struct perf_cgroup {
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -674,6 +674,23 @@ perf_event_set_state(struct perf_event *
> 	WRITE_ONCE(event->state, state);
> }
> 
> +/*
> + * UP store-release, load-acquire
> + */
> +
> +#define __store_release(ptr, val)					\
> +do {									\
> +	barrier();							\
> +	WRITE_ONCE(*(ptr), (val));					\
> +} while (0)
> +
> +#define __load_acquire(ptr)						\
> +({									\
> +	__unqual_scalar_typeof(*(ptr)) ___p = READ_ONCE(*(ptr));	\
> +	barrier();							\
> +	___p;								\
> +})
> +
> #ifdef CONFIG_CGROUP_PERF
> 
> static inline bool
> @@ -719,34 +736,51 @@ static inline u64 perf_cgroup_event_time
> 	return t->time;
> }
> 
> -static inline void __update_cgrp_time(struct perf_cgroup *cgrp)
> +static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
> {
> -	struct perf_cgroup_info *info;
> -	u64 now;
> -
> -	now = perf_clock();
> +	struct perf_cgroup_info *t;
> 
> -	info = this_cpu_ptr(cgrp->info);
> +	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> +	if (!__load_acquire(&t->active))
> +		return t->time;
> +	now += READ_ONCE(t->timeoffset);
> +	return now;
> +}
> 
> -	info->time += now - info->timestamp;
> +static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bool adv)
> +{
> +	if (adv)
> +		info->time += now - info->timestamp;
> 	info->timestamp = now;
> +	/*
> +	 * see update_context_time()
> +	 */
> +	WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
> }
> 
> -static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
> +static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
> {
> 	struct perf_cgroup *cgrp = cpuctx->cgrp;
> 	struct cgroup_subsys_state *css;
> +	struct perf_cgroup_info *info;
> 
> 	if (cgrp) {
> +		u64 now = perf_clock();
> +
> 		for (css = &cgrp->css; css; css = css->parent) {
> 			cgrp = container_of(css, struct perf_cgroup, css);
> -			__update_cgrp_time(cgrp);
> +			info = this_cpu_ptr(cgrp->info);
> +
> +			__update_cgrp_time(info, now, true);
> +			if (final)
> +				__store_release(&info->active, 0);
> 		}
> 	}
> }
> 
> static inline void update_cgrp_time_from_event(struct perf_event *event)
> {
> +	struct perf_cgroup_info *info;
> 	struct perf_cgroup *cgrp;
> 
> 	/*
> @@ -760,8 +794,10 @@ static inline void update_cgrp_time_from
> 	/*
> 	 * Do not update time when cgroup is not active
> 	 */
> -	if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup))
> -		__update_cgrp_time(event->cgrp);
> +	if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) {
> +		info = this_cpu_ptr(event->cgrp->info);
> +		__update_cgrp_time(info, perf_clock(), true);
> +	}
> }
> 
> static inline void
> @@ -785,7 +821,8 @@ perf_cgroup_set_timestamp(struct task_st
> 	for (css = &cgrp->css; css; css = css->parent) {
> 		cgrp = container_of(css, struct perf_cgroup, css);
> 		info = this_cpu_ptr(cgrp->info);
> -		info->timestamp = ctx->timestamp;
> +		__update_cgrp_time(info, ctx->timestamp, false);
> +		__store_release(&info->active, 1);
> 	}
> }
> 
> @@ -982,14 +1019,6 @@ static inline int perf_cgroup_connect(in
> }
> 
> static inline void
> -perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
> -{
> -	struct perf_cgroup_info *t;
> -	t = per_cpu_ptr(event->cgrp->info, event->cpu);
> -	event->shadow_ctx_time = now - t->timestamp;
> -}
> -
> -static inline void
> perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ctx)
> {
> 	struct perf_cpu_context *cpuctx;
> @@ -1066,7 +1095,8 @@ static inline void update_cgrp_time_from
> {
> }
> 
> -static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
> +static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
> +						bool final)
> {
> }
> 
> @@ -1098,12 +1128,12 @@ perf_cgroup_switch(struct task_struct *t
> {
> }
> 
> -static inline void
> -perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
> +static inline u64 perf_cgroup_event_time(struct perf_event *event)
> {
> +	return 0;
> }
> 
> -static inline u64 perf_cgroup_event_time(struct perf_event *event)
> +static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
> {
> 	return 0;
> }
> @@ -1525,22 +1555,59 @@ static void perf_unpin_context(struct pe
> /*
>  * Update the record of the current time in a context.
>  */
> -static void update_context_time(struct perf_event_context *ctx)
> +static void __update_context_time(struct perf_event_context *ctx, bool adv)
> {
> 	u64 now = perf_clock();
> 
> -	ctx->time += now - ctx->timestamp;
> +	if (adv)
> +		ctx->time += now - ctx->timestamp;
> 	ctx->timestamp = now;
> +
> +	/*
> +	 * The above: time' = time + (now - timestamp), can be re-arranged
> +	 * into: time` = now + (time - timestamp), which gives a single value
> +	 * offset to compute future time without locks on.
> +	 *
> +	 * See perf_event_time_now(), which can be used from NMI context where
> +	 * it's (obviously) not possible to acquire ctx->lock in order to read
> +	 * both the above values in a consistent manner.
> +	 */
> +	WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
> +}
> +
> +static void update_context_time(struct perf_event_context *ctx)
> +{
> +	__update_context_time(ctx, true);
> }
> 
> static u64 perf_event_time(struct perf_event *event)
> {
> 	struct perf_event_context *ctx = event->ctx;
> 
> +	if (unlikely(!ctx))
> +		return 0;
> +
> 	if (is_cgroup_event(event))
> 		return perf_cgroup_event_time(event);
> 
> -	return ctx ? ctx->time : 0;
> +	return ctx->time;
> +}
> +
> +static u64 perf_event_time_now(struct perf_event *event, u64 now)
> +{
> +	struct perf_event_context *ctx = event->ctx;
> +
> +	if (unlikely(!ctx))
> +		return 0;
> +
> +	if (is_cgroup_event(event))
> +		return perf_cgroup_event_time_now(event, now);
> +
> +	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
> +		return ctx->time;
> +
> +	now += READ_ONCE(ctx->timeoffset);
> +	return now;
> }
> 
> static enum event_type_t get_event_type(struct perf_event *event)
> @@ -2346,7 +2413,7 @@ __perf_remove_from_context(struct perf_e
> 
> 	if (ctx->is_active & EVENT_TIME) {
> 		update_context_time(ctx);
> -		update_cgrp_time_from_cpuctx(cpuctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, false);
> 	}
> 
> 	event_sched_out(event, cpuctx, ctx);
> @@ -2357,6 +2424,9 @@ __perf_remove_from_context(struct perf_e
> 	list_del_event(event, ctx);
> 
> 	if (!ctx->nr_events && ctx->is_active) {
> +		if (ctx == &cpuctx->ctx)
> +			update_cgrp_time_from_cpuctx(cpuctx, true);
> +
> 		ctx->is_active = 0;
> 		ctx->rotate_necessary = 0;
> 		if (ctx->task) {
> @@ -2478,40 +2548,6 @@ void perf_event_disable_inatomic(struct
> 	irq_work_queue(&event->pending);
> }
> 
> -static void perf_set_shadow_time(struct perf_event *event,
> -				 struct perf_event_context *ctx)
> -{
> -	/*
> -	 * use the correct time source for the time snapshot
> -	 *
> -	 * We could get by without this by leveraging the
> -	 * fact that to get to this function, the caller
> -	 * has most likely already called update_context_time()
> -	 * and update_cgrp_time_xx() and thus both timestamp
> -	 * are identical (or very close). Given that tstamp is,
> -	 * already adjusted for cgroup, we could say that:
> -	 *    tstamp - ctx->timestamp
> -	 * is equivalent to
> -	 *    tstamp - cgrp->timestamp.
> -	 *
> -	 * Then, in perf_output_read(), the calculation would
> -	 * work with no changes because:
> -	 * - event is guaranteed scheduled in
> -	 * - no scheduled out in between
> -	 * - thus the timestamp would be the same
> -	 *
> -	 * But this is a bit hairy.
> -	 *
> -	 * So instead, we have an explicit cgroup call to remain
> -	 * within the time source all along. We believe it
> -	 * is cleaner and simpler to understand.
> -	 */
> -	if (is_cgroup_event(event))
> -		perf_cgroup_set_shadow_time(event, event->tstamp);
> -	else
> -		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
> -}
> -
> #define MAX_INTERRUPTS (~0ULL)
> 
> static void perf_log_throttle(struct perf_event *event, int enable);
> @@ -2552,8 +2588,6 @@ event_sched_in(struct perf_event *event,
> 
> 	perf_pmu_disable(event->pmu);
> 
> -	perf_set_shadow_time(event, ctx);
> -
> 	perf_log_itrace_start(event);
> 
> 	if (event->pmu->add(event, PERF_EF_START)) {
> @@ -3247,16 +3281,6 @@ static void ctx_sched_out(struct perf_ev
> 		return;
> 	}
> 
> -	ctx->is_active &= ~event_type;
> -	if (!(ctx->is_active & EVENT_ALL))
> -		ctx->is_active = 0;
> -
> -	if (ctx->task) {
> -		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
> -		if (!ctx->is_active)
> -			cpuctx->task_ctx = NULL;
> -	}
> -
> 	/*
> 	 * Always update time if it was set; not only when it changes.
> 	 * Otherwise we can 'forget' to update time for any but the last
> @@ -3270,7 +3294,22 @@ static void ctx_sched_out(struct perf_ev
> 	if (is_active & EVENT_TIME) {
> 		/* update (and stop) ctx time */
> 		update_context_time(ctx);
> -		update_cgrp_time_from_cpuctx(cpuctx);
> +		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
> +		/*
> +		 * CPU-release for the below ->is_active store,
> +		 * see __load_acquire() in perf_event_time_now()
> +		 */
> +		barrier();
> +	}
> +
> +	ctx->is_active &= ~event_type;
> +	if (!(ctx->is_active & EVENT_ALL))
> +		ctx->is_active = 0;
> +
> +	if (ctx->task) {
> +		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
> +		if (!ctx->is_active)
> +			cpuctx->task_ctx = NULL;
> 	}
> 
> 	is_active ^= ctx->is_active; /* changed bits */
> @@ -3707,13 +3746,19 @@ static noinline int visit_groups_merge(s
> 	return 0;
> }
> 
> +/*
> + * Because the userpage is strictly per-event (there is no concept of context,
> + * so there cannot be a context indirection), every userpage must be updated
> + * when context time starts :-(
> + *
> + * IOW, we must not miss EVENT_TIME edges.
> + */
> static inline bool event_update_userpage(struct perf_event *event)
> {
> 	if (likely(!atomic_read(&event->mmap_count)))
> 		return false;
> 
> 	perf_event_update_time(event);
> -	perf_set_shadow_time(event, event->ctx);
> 	perf_event_update_userpage(event);
> 
> 	return true;
> @@ -3797,13 +3842,23 @@ ctx_sched_in(struct perf_event_context *
> 	     struct task_struct *task)
> {
> 	int is_active = ctx->is_active;
> -	u64 now;
> 
> 	lockdep_assert_held(&ctx->lock);
> 
> 	if (likely(!ctx->nr_events))
> 		return;
> 
> +	if (is_active ^ EVENT_TIME) {
> +		/* start ctx time */
> +		__update_context_time(ctx, false);
> +		perf_cgroup_set_timestamp(task, ctx);
> +		/*
> +		 * CPU-release for the below ->is_active store,
> +		 * see __load_acquire() in perf_event_time_now()
> +		 */
> +		barrier();
> +	}
> +
> 	ctx->is_active |= (event_type | EVENT_TIME);
> 	if (ctx->task) {
> 		if (!is_active)
> @@ -3814,13 +3869,6 @@ ctx_sched_in(struct perf_event_context *
> 
> 	is_active ^= ctx->is_active; /* changed bits */
> 
> -	if (is_active & EVENT_TIME) {
> -		/* start ctx time */
> -		now = perf_clock();
> -		ctx->timestamp = now;
> -		perf_cgroup_set_timestamp(task, ctx);
> -	}
> -
> 	/*
> 	 * First go through the list and put on any pinned groups
> 	 * in order to give them the best chance of going on.
> @@ -4473,10 +4521,9 @@ int perf_event_read_local(struct perf_ev
> 
> 	*value = local64_read(&event->count);
> 	if (enabled || running) {
> -		u64 now = event->shadow_ctx_time + perf_clock();
> -		u64 __enabled, __running;
> +		u64 __enabled, __running, __now;;
> 
> -		__perf_update_times(event, now, &__enabled, &__running);
> +		calc_timer_values(event, &__now, &__enabled, &__running);
> 		if (enabled)
> 			*enabled = __enabled;
> 		if (running)
> @@ -5806,7 +5853,7 @@ static void calc_timer_values(struct per
> 	u64 ctx_time;
> 
> 	*now = perf_clock();
> -	ctx_time = event->shadow_ctx_time + *now;
> +	ctx_time = perf_event_time_now(event, *now);
> 	__perf_update_times(event, ctx_time, enabled, running);
> }
> 
> @@ -6349,7 +6396,6 @@ static int perf_mmap(struct file *file,
> 		ring_buffer_attach(event, rb);
> 
> 		perf_event_update_time(event);
> -		perf_set_shadow_time(event, event->ctx);
> 		perf_event_init_userpage(event);
> 		perf_event_update_userpage(event);
> 	} else {


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-21  7:23           ` Song Liu
@ 2021-12-21 11:21             ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-21 11:21 UTC (permalink / raw)
  To: Song Liu
  Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers

On Tue, Dec 21, 2021 at 07:23:16AM +0000, Song Liu wrote:

> I think we need to forward the declaration of calc_timer_values?

Curse these last minute cleanups :-) I've moved the whole function up.

> With that fixed, this passes the test for enabled time. 
> 
> Tested-by: Song Liu <song@kernel.org>

Thanks both!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3] perf/core: Set event shadow time for inactive events too
  2021-12-20  9:30           ` Peter Zijlstra
  2021-12-20  9:54             ` Peter Zijlstra
@ 2021-12-21 12:39             ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2021-12-21 12:39 UTC (permalink / raw)
  To: Song Liu
  Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	Mark Rutland, Alexander Shishkin, LKML, Stephane Eranian,
	Andi Kleen, Ian Rogers, Vince Weaver

On Mon, Dec 20, 2021 at 10:30:34AM +0100, Peter Zijlstra wrote:

> Ohh.. argh. I think I see why, it looses the context time enable edge,
> and because this is all strictly per-event in the uapi (there is no ctx
> representation) it can't be cured by improving ctx time handling :/
> 
> Bah, I so hate this.
> 

I wonder... could we get away with something like this...


---
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2722,13 +2722,17 @@ void arch_perf_update_userpage(struct pe
 	struct cyc2ns_data data;
 	u64 offset;
 
-	userpg->cap_user_time = 0;
-	userpg->cap_user_time_zero = 0;
-	userpg->cap_user_rdpmc =
-		!!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED);
+	userpg->cap_user_rdpmc = !!(event->hw.flags & PERF_X86_EVENT_RDPMC_ALLOWED);
 	userpg->pmc_width = x86_pmu.cntval_bits;
 
-	if (!using_native_sched_clock() || !sched_clock_stable())
+	if (unlikely(!using_native_sched_clock() || !sched_clock_stable())) {
+		userpg->cap_user_time = 0;
+		userpg->cap_user_time_zero = 0;
+		return;
+	}
+
+	/* already set the time fields before */
+	if (likely(userpf->cap_user_time))
 		return;
 
 	cyc2ns_read_begin(&data);
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -612,6 +612,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_ITRACE	0x10
 #define PERF_ATTACH_SCHED_CB	0x20
 #define PERF_ATTACH_CHILD	0x40
+#define PERF_ATTACH_SELF	0x80
 
 struct perf_cgroup;
 struct perf_buffer;
@@ -812,6 +813,7 @@ struct perf_event_context {
 
 	int				nr_events;
 	int				nr_active;
+	int				nr_self;
 	int				is_active;
 	int				nr_stat;
 	int				nr_freq;
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1877,6 +1877,8 @@ list_add_event(struct perf_event *event,
 	ctx->nr_events++;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat++;
+	if (event->attach_state & PERF_ATTACH_SELF)
+		ctx->nr_self++;
 
 	if (event->state > PERF_EVENT_STATE_OFF)
 		perf_cgroup_event_enable(event, ctx);
@@ -2068,6 +2070,8 @@ list_del_event(struct perf_event *event,
 	ctx->nr_events--;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat--;
+	if (event->attach_state & PERF_ATTACH_SELF)
+		ctx->nr_self--;
 
 	list_del_rcu(&event->event_entry);
 
@@ -3755,7 +3759,8 @@ static noinline int visit_groups_merge(s
  */
 static inline bool event_update_userpage(struct perf_event *event)
 {
-	if (likely(!atomic_read(&event->mmap_count)))
+	if (likely(!atomic_read(&event->mmap_count) ||
+		   !(event->attach_state & PERF_ATTACH_SELF)))
 		return false;
 
 	perf_event_update_time(event);
@@ -3800,7 +3805,8 @@ static int merge_sched_in(struct perf_ev
 		} else {
 			ctx->rotate_necessary = 1;
 			perf_mux_hrtimer_restart(cpuctx);
-			group_update_userpage(event);
+			if (ctx->nr_self)
+				group_update_userpage(event);
 		}
 	}
 
@@ -5900,6 +5906,9 @@ void perf_event_update_userpage(struct p
 	if (!rb)
 		goto unlock;
 
+	if (!(event->attach_state & PERF_ATTACH_SELF))
+		goto unlock;
+
 	/*
 	 * compute total_time_enabled, total_time_running
 	 * based on snapshot values taken when the event
@@ -11613,6 +11622,8 @@ perf_event_alloc(struct perf_event_attr
 		 * pmu before we get a ctx.
 		 */
 		event->hw.target = get_task_struct(task);
+		if (event->hw.target == current && !attr->inherit)
+			event->attach_state |= PERF_ATTACH_SELF;
 	}
 
 	event->clock = &local_clock;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [tip: perf/urgent] perf: Fix perf_event_read_local() time
  2021-12-20 12:19         ` Peter Zijlstra
  2021-12-21  5:54           ` Namhyung Kim
  2021-12-21  7:23           ` Song Liu
@ 2022-01-18 11:17           ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 24+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2022-01-18 11:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Song Liu, Namhyung Kim, x86, linux-kernel

The following commit has been merged into the perf/urgent branch of tip:

Commit-ID:     09f5e7dc7ad705289e1b1ec065439aa3c42951c4
Gitweb:        https://git.kernel.org/tip/09f5e7dc7ad705289e1b1ec065439aa3c42951c4
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 20 Dec 2021 13:19:52 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 18 Jan 2022 12:09:47 +01:00

perf: Fix perf_event_read_local() time

Time readers that cannot take locks (due to NMI etc..) currently make
use of perf_event::shadow_ctx_time, which, for that event gives:

  time' = now + (time - timestamp)

or, alternatively arranged:

  time' = time + (now - timestamp)

IOW, the progression of time since the last time the shadow_ctx_time
was updated.

There's problems with this:

 A) the shadow_ctx_time is per-event, even though the ctx_time it
    reflects is obviously per context. The direct concequence of this
    is that the context needs to iterate all events all the time to
    keep the shadow_ctx_time in sync.

 B) even with the prior point, the context itself might not be active
    meaning its time should not advance to begin with.

 C) shadow_ctx_time isn't consistently updated when ctx_time is

There are 3 users of this stuff, that suffer differently from this:

 - calc_timer_values()
   - perf_output_read()
   - perf_event_update_userpage()	/* A */

 - perf_event_read_local()		/* A,B */

In particular, perf_output_read() doesn't suffer at all, because it's
sample driven and hence only relevant when the event is actually
running.

This same was supposed to be true for perf_event_update_userpage(),
after all self-monitoring implies the context is active *HOWEVER*, as
per commit f79256532682 ("perf/core: fix userpage->time_enabled of
inactive events") this goes wrong when combined with counter
overcommit, in that case those events that do not get scheduled when
the context becomes active (task events typically) miss out on the
EVENT_TIME update and ENABLED time is inflated (for a little while)
with the time the context was inactive. Once the event gets rotated
in, this gets corrected, leading to a non-monotonic timeflow.

perf_event_read_local() made things even worse, it can request time at
any point, suffering all the problems perf_event_update_userpage()
does and more. Because while perf_event_update_userpage() is limited
by the context being active, perf_event_read_local() users have no
such constraint.

Therefore, completely overhaul things and do away with
perf_event::shadow_ctx_time. Instead have regular context time updates
keep track of this offset directly and provide perf_event_time_now()
to complement perf_event_time().

perf_event_time_now() will, in adition to being context wide, also
take into account if the context is active. For inactive context, it
will not advance time.

This latter property means the cgroup perf_cgroup_info context needs
to grow addition state to track this.

Additionally, since all this is strictly per-cpu, we can use barrier()
to order context activity vs context time.

Fixes: 7d9285e82db5 ("perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Song Liu <song@kernel.org>
Tested-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lkml.kernel.org/r/YcB06DasOBtU0b00@hirez.programming.kicks-ass.net
---
 include/linux/perf_event.h |  15 +--
 kernel/events/core.c       | 246 +++++++++++++++++++++---------------
 2 files changed, 149 insertions(+), 112 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 117f230..7336491 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -693,18 +693,6 @@ struct perf_event {
 	u64				total_time_running;
 	u64				tstamp;
 
-	/*
-	 * timestamp shadows the actual context timing but it can
-	 * be safely used in NMI interrupt context. It reflects the
-	 * context time as it was when the event was last scheduled in,
-	 * or when ctx_sched_in failed to schedule the event because we
-	 * run out of PMC.
-	 *
-	 * ctx_time already accounts for ctx->timestamp. Therefore to
-	 * compute ctx_time for a sample, simply add perf_clock().
-	 */
-	u64				shadow_ctx_time;
-
 	struct perf_event_attr		attr;
 	u16				header_size;
 	u16				id_header_size;
@@ -852,6 +840,7 @@ struct perf_event_context {
 	 */
 	u64				time;
 	u64				timestamp;
+	u64				timeoffset;
 
 	/*
 	 * These fields let us detect when two contexts have both
@@ -934,6 +923,8 @@ struct bpf_perf_event_data_kern {
 struct perf_cgroup_info {
 	u64				time;
 	u64				timestamp;
+	u64				timeoffset;
+	int				active;
 };
 
 struct perf_cgroup {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index fc18664..479c9e6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -674,6 +674,23 @@ perf_event_set_state(struct perf_event *event, enum perf_event_state state)
 	WRITE_ONCE(event->state, state);
 }
 
+/*
+ * UP store-release, load-acquire
+ */
+
+#define __store_release(ptr, val)					\
+do {									\
+	barrier();							\
+	WRITE_ONCE(*(ptr), (val));					\
+} while (0)
+
+#define __load_acquire(ptr)						\
+({									\
+	__unqual_scalar_typeof(*(ptr)) ___p = READ_ONCE(*(ptr));	\
+	barrier();							\
+	___p;								\
+})
+
 #ifdef CONFIG_CGROUP_PERF
 
 static inline bool
@@ -719,34 +736,51 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
 	return t->time;
 }
 
-static inline void __update_cgrp_time(struct perf_cgroup *cgrp)
+static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 {
-	struct perf_cgroup_info *info;
-	u64 now;
-
-	now = perf_clock();
+	struct perf_cgroup_info *t;
 
-	info = this_cpu_ptr(cgrp->info);
+	t = per_cpu_ptr(event->cgrp->info, event->cpu);
+	if (!__load_acquire(&t->active))
+		return t->time;
+	now += READ_ONCE(t->timeoffset);
+	return now;
+}
 
-	info->time += now - info->timestamp;
+static inline void __update_cgrp_time(struct perf_cgroup_info *info, u64 now, bool adv)
+{
+	if (adv)
+		info->time += now - info->timestamp;
 	info->timestamp = now;
+	/*
+	 * see update_context_time()
+	 */
+	WRITE_ONCE(info->timeoffset, info->time - info->timestamp);
 }
 
-static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
+static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx, bool final)
 {
 	struct perf_cgroup *cgrp = cpuctx->cgrp;
 	struct cgroup_subsys_state *css;
+	struct perf_cgroup_info *info;
 
 	if (cgrp) {
+		u64 now = perf_clock();
+
 		for (css = &cgrp->css; css; css = css->parent) {
 			cgrp = container_of(css, struct perf_cgroup, css);
-			__update_cgrp_time(cgrp);
+			info = this_cpu_ptr(cgrp->info);
+
+			__update_cgrp_time(info, now, true);
+			if (final)
+				__store_release(&info->active, 0);
 		}
 	}
 }
 
 static inline void update_cgrp_time_from_event(struct perf_event *event)
 {
+	struct perf_cgroup_info *info;
 	struct perf_cgroup *cgrp;
 
 	/*
@@ -760,8 +794,10 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
 	/*
 	 * Do not update time when cgroup is not active
 	 */
-	if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup))
-		__update_cgrp_time(event->cgrp);
+	if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) {
+		info = this_cpu_ptr(event->cgrp->info);
+		__update_cgrp_time(info, perf_clock(), true);
+	}
 }
 
 static inline void
@@ -785,7 +821,8 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 	for (css = &cgrp->css; css; css = css->parent) {
 		cgrp = container_of(css, struct perf_cgroup, css);
 		info = this_cpu_ptr(cgrp->info);
-		info->timestamp = ctx->timestamp;
+		__update_cgrp_time(info, ctx->timestamp, false);
+		__store_release(&info->active, 1);
 	}
 }
 
@@ -982,14 +1019,6 @@ out:
 }
 
 static inline void
-perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
-{
-	struct perf_cgroup_info *t;
-	t = per_cpu_ptr(event->cgrp->info, event->cpu);
-	event->shadow_ctx_time = now - t->timestamp;
-}
-
-static inline void
 perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ctx)
 {
 	struct perf_cpu_context *cpuctx;
@@ -1066,7 +1095,8 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
 {
 }
 
-static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
+static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
+						bool final)
 {
 }
 
@@ -1098,12 +1128,12 @@ perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
 {
 }
 
-static inline void
-perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
+static inline u64 perf_cgroup_event_time(struct perf_event *event)
 {
+	return 0;
 }
 
-static inline u64 perf_cgroup_event_time(struct perf_event *event)
+static inline u64 perf_cgroup_event_time_now(struct perf_event *event, u64 now)
 {
 	return 0;
 }
@@ -1525,22 +1555,59 @@ static void perf_unpin_context(struct perf_event_context *ctx)
 /*
  * Update the record of the current time in a context.
  */
-static void update_context_time(struct perf_event_context *ctx)
+static void __update_context_time(struct perf_event_context *ctx, bool adv)
 {
 	u64 now = perf_clock();
 
-	ctx->time += now - ctx->timestamp;
+	if (adv)
+		ctx->time += now - ctx->timestamp;
 	ctx->timestamp = now;
+
+	/*
+	 * The above: time' = time + (now - timestamp), can be re-arranged
+	 * into: time` = now + (time - timestamp), which gives a single value
+	 * offset to compute future time without locks on.
+	 *
+	 * See perf_event_time_now(), which can be used from NMI context where
+	 * it's (obviously) not possible to acquire ctx->lock in order to read
+	 * both the above values in a consistent manner.
+	 */
+	WRITE_ONCE(ctx->timeoffset, ctx->time - ctx->timestamp);
+}
+
+static void update_context_time(struct perf_event_context *ctx)
+{
+	__update_context_time(ctx, true);
 }
 
 static u64 perf_event_time(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
 
+	if (unlikely(!ctx))
+		return 0;
+
 	if (is_cgroup_event(event))
 		return perf_cgroup_event_time(event);
 
-	return ctx ? ctx->time : 0;
+	return ctx->time;
+}
+
+static u64 perf_event_time_now(struct perf_event *event, u64 now)
+{
+	struct perf_event_context *ctx = event->ctx;
+
+	if (unlikely(!ctx))
+		return 0;
+
+	if (is_cgroup_event(event))
+		return perf_cgroup_event_time_now(event, now);
+
+	if (!(__load_acquire(&ctx->is_active) & EVENT_TIME))
+		return ctx->time;
+
+	now += READ_ONCE(ctx->timeoffset);
+	return now;
 }
 
 static enum event_type_t get_event_type(struct perf_event *event)
@@ -2350,7 +2417,7 @@ __perf_remove_from_context(struct perf_event *event,
 
 	if (ctx->is_active & EVENT_TIME) {
 		update_context_time(ctx);
-		update_cgrp_time_from_cpuctx(cpuctx);
+		update_cgrp_time_from_cpuctx(cpuctx, false);
 	}
 
 	event_sched_out(event, cpuctx, ctx);
@@ -2361,6 +2428,9 @@ __perf_remove_from_context(struct perf_event *event,
 	list_del_event(event, ctx);
 
 	if (!ctx->nr_events && ctx->is_active) {
+		if (ctx == &cpuctx->ctx)
+			update_cgrp_time_from_cpuctx(cpuctx, true);
+
 		ctx->is_active = 0;
 		ctx->rotate_necessary = 0;
 		if (ctx->task) {
@@ -2482,40 +2552,6 @@ void perf_event_disable_inatomic(struct perf_event *event)
 	irq_work_queue(&event->pending);
 }
 
-static void perf_set_shadow_time(struct perf_event *event,
-				 struct perf_event_context *ctx)
-{
-	/*
-	 * use the correct time source for the time snapshot
-	 *
-	 * We could get by without this by leveraging the
-	 * fact that to get to this function, the caller
-	 * has most likely already called update_context_time()
-	 * and update_cgrp_time_xx() and thus both timestamp
-	 * are identical (or very close). Given that tstamp is,
-	 * already adjusted for cgroup, we could say that:
-	 *    tstamp - ctx->timestamp
-	 * is equivalent to
-	 *    tstamp - cgrp->timestamp.
-	 *
-	 * Then, in perf_output_read(), the calculation would
-	 * work with no changes because:
-	 * - event is guaranteed scheduled in
-	 * - no scheduled out in between
-	 * - thus the timestamp would be the same
-	 *
-	 * But this is a bit hairy.
-	 *
-	 * So instead, we have an explicit cgroup call to remain
-	 * within the time source all along. We believe it
-	 * is cleaner and simpler to understand.
-	 */
-	if (is_cgroup_event(event))
-		perf_cgroup_set_shadow_time(event, event->tstamp);
-	else
-		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
-}
-
 #define MAX_INTERRUPTS (~0ULL)
 
 static void perf_log_throttle(struct perf_event *event, int enable);
@@ -2556,8 +2592,6 @@ event_sched_in(struct perf_event *event,
 
 	perf_pmu_disable(event->pmu);
 
-	perf_set_shadow_time(event, ctx);
-
 	perf_log_itrace_start(event);
 
 	if (event->pmu->add(event, PERF_EF_START)) {
@@ -3251,16 +3285,6 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 		return;
 	}
 
-	ctx->is_active &= ~event_type;
-	if (!(ctx->is_active & EVENT_ALL))
-		ctx->is_active = 0;
-
-	if (ctx->task) {
-		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
-		if (!ctx->is_active)
-			cpuctx->task_ctx = NULL;
-	}
-
 	/*
 	 * Always update time if it was set; not only when it changes.
 	 * Otherwise we can 'forget' to update time for any but the last
@@ -3274,7 +3298,22 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 	if (is_active & EVENT_TIME) {
 		/* update (and stop) ctx time */
 		update_context_time(ctx);
-		update_cgrp_time_from_cpuctx(cpuctx);
+		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
+		/*
+		 * CPU-release for the below ->is_active store,
+		 * see __load_acquire() in perf_event_time_now()
+		 */
+		barrier();
+	}
+
+	ctx->is_active &= ~event_type;
+	if (!(ctx->is_active & EVENT_ALL))
+		ctx->is_active = 0;
+
+	if (ctx->task) {
+		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
+		if (!ctx->is_active)
+			cpuctx->task_ctx = NULL;
 	}
 
 	is_active ^= ctx->is_active; /* changed bits */
@@ -3711,13 +3750,19 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	return 0;
 }
 
+/*
+ * Because the userpage is strictly per-event (there is no concept of context,
+ * so there cannot be a context indirection), every userpage must be updated
+ * when context time starts :-(
+ *
+ * IOW, we must not miss EVENT_TIME edges.
+ */
 static inline bool event_update_userpage(struct perf_event *event)
 {
 	if (likely(!atomic_read(&event->mmap_count)))
 		return false;
 
 	perf_event_update_time(event);
-	perf_set_shadow_time(event, event->ctx);
 	perf_event_update_userpage(event);
 
 	return true;
@@ -3801,13 +3846,23 @@ ctx_sched_in(struct perf_event_context *ctx,
 	     struct task_struct *task)
 {
 	int is_active = ctx->is_active;
-	u64 now;
 
 	lockdep_assert_held(&ctx->lock);
 
 	if (likely(!ctx->nr_events))
 		return;
 
+	if (is_active ^ EVENT_TIME) {
+		/* start ctx time */
+		__update_context_time(ctx, false);
+		perf_cgroup_set_timestamp(task, ctx);
+		/*
+		 * CPU-release for the below ->is_active store,
+		 * see __load_acquire() in perf_event_time_now()
+		 */
+		barrier();
+	}
+
 	ctx->is_active |= (event_type | EVENT_TIME);
 	if (ctx->task) {
 		if (!is_active)
@@ -3818,13 +3873,6 @@ ctx_sched_in(struct perf_event_context *ctx,
 
 	is_active ^= ctx->is_active; /* changed bits */
 
-	if (is_active & EVENT_TIME) {
-		/* start ctx time */
-		now = perf_clock();
-		ctx->timestamp = now;
-		perf_cgroup_set_timestamp(task, ctx);
-	}
-
 	/*
 	 * First go through the list and put on any pinned groups
 	 * in order to give them the best chance of going on.
@@ -4418,6 +4466,18 @@ static inline u64 perf_event_count(struct perf_event *event)
 	return local64_read(&event->count) + atomic64_read(&event->child_count);
 }
 
+static void calc_timer_values(struct perf_event *event,
+				u64 *now,
+				u64 *enabled,
+				u64 *running)
+{
+	u64 ctx_time;
+
+	*now = perf_clock();
+	ctx_time = perf_event_time_now(event, *now);
+	__perf_update_times(event, ctx_time, enabled, running);
+}
+
 /*
  * NMI-safe method to read a local event, that is an event that
  * is:
@@ -4477,10 +4537,9 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
 
 	*value = local64_read(&event->count);
 	if (enabled || running) {
-		u64 now = event->shadow_ctx_time + perf_clock();
-		u64 __enabled, __running;
+		u64 __enabled, __running, __now;;
 
-		__perf_update_times(event, now, &__enabled, &__running);
+		calc_timer_values(event, &__now, &__enabled, &__running);
 		if (enabled)
 			*enabled = __enabled;
 		if (running)
@@ -5802,18 +5861,6 @@ static int perf_event_index(struct perf_event *event)
 	return event->pmu->event_idx(event);
 }
 
-static void calc_timer_values(struct perf_event *event,
-				u64 *now,
-				u64 *enabled,
-				u64 *running)
-{
-	u64 ctx_time;
-
-	*now = perf_clock();
-	ctx_time = event->shadow_ctx_time + *now;
-	__perf_update_times(event, ctx_time, enabled, running);
-}
-
 static void perf_event_init_userpage(struct perf_event *event)
 {
 	struct perf_event_mmap_page *userpg;
@@ -6353,7 +6400,6 @@ accounting:
 		ring_buffer_attach(event, rb);
 
 		perf_event_update_time(event);
-		perf_set_shadow_time(event, event->ctx);
 		perf_event_init_userpage(event);
 		perf_event_update_userpage(event);
 	} else {

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-01-18 11:18 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-05 22:48 [PATCH v3] perf/core: Set event shadow time for inactive events too Namhyung Kim
2021-12-06 23:11 ` Song Liu
2021-12-08 23:22 ` Peter Zijlstra
2021-12-09  5:52   ` Namhyung Kim
2021-12-09  8:21     ` Peter Zijlstra
2021-12-09 11:26 ` Peter Zijlstra
2021-12-09 11:34   ` Peter Zijlstra
2021-12-09 21:51     ` Namhyung Kim
2021-12-10 10:19       ` Peter Zijlstra
2021-12-10 18:59         ` Namhyung Kim
2021-12-20  9:39           ` Peter Zijlstra
2021-12-09 21:35   ` Namhyung Kim
2021-12-10 10:33     ` Peter Zijlstra
2021-12-10 23:19       ` Namhyung Kim
2021-12-17 16:35       ` Peter Zijlstra
2021-12-18  9:09         ` Song Liu
2021-12-20  9:30           ` Peter Zijlstra
2021-12-20  9:54             ` Peter Zijlstra
2021-12-21 12:39             ` Peter Zijlstra
2021-12-20 12:19         ` Peter Zijlstra
2021-12-21  5:54           ` Namhyung Kim
2021-12-21  7:23           ` Song Liu
2021-12-21 11:21             ` Peter Zijlstra
2022-01-18 11:17           ` [tip: perf/urgent] perf: Fix perf_event_read_local() time tip-bot2 for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.