All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] perf/core: fix userpage->time_enabled of inactive events
@ 2021-09-22  1:17 Song Liu
  2021-09-23  6:42 ` Song Liu
  2021-09-23 10:08 ` Peter Zijlstra
  0 siblings, 2 replies; 6+ messages in thread
From: Song Liu @ 2021-09-22  1:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: acme, peterz, mingo, kernel-team, eranian, Song Liu, Lucian Grijincu

Users of rdpmc rely on the mmapped user page to calculate accurate
time_enabled. Currently, userpage->time_enabled is only updated when the
event is added to the pmu. As a result, inactive event (due to counter
multiplexing) does not have accurate userpage->time_enabled. This can
be reproduced with something like:

   /* open 20 task perf_event "cycles", to create multiplexing */

   fd = perf_event_open();  /* open task perf_event "cycles" */
   userpage = mmap(fd);     /* use mmap and rdmpc */

   while (true) {
     time_enabled_mmap = xxx; /* use logic in perf_event_mmap_page */
     time_enabled_read = read(fd).time_enabled;
     if (time_enabled_mmap > time_enabled_read)
         BUG();
   }

Fix this by updating userpage for inactive events in ctx_sched_in.

Reported-and-tested-by: Lucian Grijincu <lucian@fb.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/perf_event.h |  4 +++-
 kernel/events/core.c       | 26 ++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d510ad750edc..4aa52f7a48c16 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -683,7 +683,9 @@ struct perf_event {
 	/*
 	 * timestamp shadows the actual context timing but it can
 	 * be safely used in NMI interrupt context. It reflects the
-	 * context time as it was when the event was last scheduled in.
+	 * context time as it was when the event was last scheduled in,
+	 * or when ctx_sched_in failed to schedule the event because we
+	 * run out of PMC.
 	 *
 	 * ctx_time already accounts for ctx->timestamp. Therefore to
 	 * compute ctx_time for a sample, simply add perf_clock().
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1cb1f9b8392e2..549ce22df7bc7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3766,6 +3766,15 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
 			   merge_sched_in, &can_add_hw);
 }
 
+static inline void
+perf_event_update_inactive_userpage(struct perf_event *event,
+				    struct perf_event_context *ctx)
+{
+	perf_event_update_time(event);
+	perf_set_shadow_time(event, ctx);
+	perf_event_update_userpage(event);
+}
+
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
@@ -3807,6 +3816,23 @@ ctx_sched_in(struct perf_event_context *ctx,
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE)
 		ctx_flexible_sched_in(ctx, cpuctx);
+
+	/*
+	 * Update userpage for inactive events. This is needed for accurate
+	 * time_enabled.
+	 */
+	if (unlikely(ctx->rotate_necessary)) {
+		struct perf_event *event;
+
+		perf_event_groups_for_each(event, &ctx->pinned_groups) {
+			if (event->state == PERF_EVENT_STATE_INACTIVE)
+				perf_event_update_inactive_userpage(event, ctx);
+		}
+		perf_event_groups_for_each(event, &ctx->flexible_groups) {
+			if (event->state == PERF_EVENT_STATE_INACTIVE)
+				perf_event_update_inactive_userpage(event, ctx);
+		}
+	}
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf/core: fix userpage->time_enabled of inactive events
  2021-09-22  1:17 [PATCH] perf/core: fix userpage->time_enabled of inactive events Song Liu
@ 2021-09-23  6:42 ` Song Liu
  2021-09-23 10:08 ` Peter Zijlstra
  1 sibling, 0 replies; 6+ messages in thread
From: Song Liu @ 2021-09-23  6:42 UTC (permalink / raw)
  To: open list, Peter Zijlstra
  Cc: Arnaldo Carvalho de Melo, Ingo Molnar, Kernel Team,
	Stephane Eranian, Lucian Grijincu

Hi Peter, 

> On Sep 21, 2021, at 6:17 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> Users of rdpmc rely on the mmapped user page to calculate accurate
> time_enabled. Currently, userpage->time_enabled is only updated when the
> event is added to the pmu. As a result, inactive event (due to counter
> multiplexing) does not have accurate userpage->time_enabled. This can
> be reproduced with something like:
> 
>   /* open 20 task perf_event "cycles", to create multiplexing */
> 
>   fd = perf_event_open();  /* open task perf_event "cycles" */
>   userpage = mmap(fd);     /* use mmap and rdmpc */
> 
>   while (true) {
>     time_enabled_mmap = xxx; /* use logic in perf_event_mmap_page */
>     time_enabled_read = read(fd).time_enabled;
>     if (time_enabled_mmap > time_enabled_read)
>         BUG();
>   }
> 
> Fix this by updating userpage for inactive events in ctx_sched_in.
> 
> Reported-and-tested-by: Lucian Grijincu <lucian@fb.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>

Could you please share your thoughts on this? It works well in our 
tests, but we would like to know your opinion before shipping it to
production. 

Thanks,
Song


> ---
> include/linux/perf_event.h |  4 +++-
> kernel/events/core.c       | 26 ++++++++++++++++++++++++++
> 2 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2d510ad750edc..4aa52f7a48c16 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -683,7 +683,9 @@ struct perf_event {
> 	/*
> 	 * timestamp shadows the actual context timing but it can
> 	 * be safely used in NMI interrupt context. It reflects the
> -	 * context time as it was when the event was last scheduled in.
> +	 * context time as it was when the event was last scheduled in,
> +	 * or when ctx_sched_in failed to schedule the event because we
> +	 * run out of PMC.
> 	 *
> 	 * ctx_time already accounts for ctx->timestamp. Therefore to
> 	 * compute ctx_time for a sample, simply add perf_clock().
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1cb1f9b8392e2..549ce22df7bc7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3766,6 +3766,15 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
> 			   merge_sched_in, &can_add_hw);
> }
> 
> +static inline void
> +perf_event_update_inactive_userpage(struct perf_event *event,
> +				    struct perf_event_context *ctx)
> +{
> +	perf_event_update_time(event);
> +	perf_set_shadow_time(event, ctx);
> +	perf_event_update_userpage(event);
> +}
> +
> static void
> ctx_sched_in(struct perf_event_context *ctx,
> 	     struct perf_cpu_context *cpuctx,
> @@ -3807,6 +3816,23 @@ ctx_sched_in(struct perf_event_context *ctx,
> 	/* Then walk through the lower prio flexible groups */
> 	if (is_active & EVENT_FLEXIBLE)
> 		ctx_flexible_sched_in(ctx, cpuctx);
> +
> +	/*
> +	 * Update userpage for inactive events. This is needed for accurate
> +	 * time_enabled.
> +	 */
> +	if (unlikely(ctx->rotate_necessary)) {
> +		struct perf_event *event;
> +
> +		perf_event_groups_for_each(event, &ctx->pinned_groups) {
> +			if (event->state == PERF_EVENT_STATE_INACTIVE)
> +				perf_event_update_inactive_userpage(event, ctx);
> +		}
> +		perf_event_groups_for_each(event, &ctx->flexible_groups) {
> +			if (event->state == PERF_EVENT_STATE_INACTIVE)
> +				perf_event_update_inactive_userpage(event, ctx);
> +		}
> +	}
> }
> 
> static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf/core: fix userpage->time_enabled of inactive events
  2021-09-22  1:17 [PATCH] perf/core: fix userpage->time_enabled of inactive events Song Liu
  2021-09-23  6:42 ` Song Liu
@ 2021-09-23 10:08 ` Peter Zijlstra
  2021-09-23 10:13   ` Peter Zijlstra
  2021-09-23 17:47   ` Song Liu
  1 sibling, 2 replies; 6+ messages in thread
From: Peter Zijlstra @ 2021-09-23 10:08 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, acme, mingo, kernel-team, eranian, Lucian Grijincu

On Tue, Sep 21, 2021 at 06:17:15PM -0700, Song Liu wrote:
> Users of rdpmc rely on the mmapped user page to calculate accurate
> time_enabled. Currently, userpage->time_enabled is only updated when the
> event is added to the pmu. As a result, inactive event (due to counter
> multiplexing) does not have accurate userpage->time_enabled. This can
> be reproduced with something like:
> 
>    /* open 20 task perf_event "cycles", to create multiplexing */
> 
>    fd = perf_event_open();  /* open task perf_event "cycles" */
>    userpage = mmap(fd);     /* use mmap and rdmpc */
> 
>    while (true) {
>      time_enabled_mmap = xxx; /* use logic in perf_event_mmap_page */
>      time_enabled_read = read(fd).time_enabled;
>      if (time_enabled_mmap > time_enabled_read)
>          BUG();
>    }

*groan*, yes I fear you're right.

> @@ -3807,6 +3816,23 @@ ctx_sched_in(struct perf_event_context *ctx,
>  	/* Then walk through the lower prio flexible groups */
>  	if (is_active & EVENT_FLEXIBLE)
>  		ctx_flexible_sched_in(ctx, cpuctx);
> +
> +	/*
> +	 * Update userpage for inactive events. This is needed for accurate
> +	 * time_enabled.
> +	 */
> +	if (unlikely(ctx->rotate_necessary)) {
> +		struct perf_event *event;
> +
> +		perf_event_groups_for_each(event, &ctx->pinned_groups) {
> +			if (event->state == PERF_EVENT_STATE_INACTIVE)
> +				perf_event_update_inactive_userpage(event, ctx);
> +		}

That's a straight up error, if a pinned event doesn't get scheduled,
it's game over.

> +		perf_event_groups_for_each(event, &ctx->flexible_groups) {
> +			if (event->state == PERF_EVENT_STATE_INACTIVE)
> +				perf_event_update_inactive_userpage(event, ctx);
> +		}
> +	}
>  }

That's terrible though, and also wrong I think.

It's wrong because:

 - you should only do this for (is_active & EVENT_TIME)
 - you should only consider the events visit_groups_merge() would have
 - you miss updating group-siling events

(I also think it's possible to try harder to avoid the work)

Now, looking at visit_groups_merge() we only terminate the iteration
when func returns !0, which is merge_sched_in(), and that *never*
returns !0.

So we already iterate all the right events. So I'm thinking we can do
something like the below, hmm?


diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0c000cb01eeb..4d1e962c2ebe 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3707,6 +3712,28 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 	return 0;
 }
 
+static inline bool event_update_userpage(struct perf_event *event)
+{
+	if (!atomic_read(&event->mmap_count))
+		return false;
+
+	perf_event_update_time(event);
+	perf_set_shadow_time(event, event->ctx);
+	perf_event_update_userpage(event);
+	return true;
+}
+
+static inline void group_update_userpage(struct perf_event *group_event)
+{
+	struct perf_event *event;
+
+	if (!event_update_userpage(group_event))
+		return;
+
+	for_each_sibling_event(event, group_event)
+		event_update_userpage(event);
+}
+
 static int merge_sched_in(struct perf_event *event, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
@@ -3725,14 +3752,18 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	}
 
 	if (event->state == PERF_EVENT_STATE_INACTIVE) {
+		*can_add_hw = 0;
+
 		if (event->attr.pinned) {
 			perf_cgroup_event_disable(event, ctx);
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
-		}
 
-		*can_add_hw = 0;
-		ctx->rotate_necessary = 1;
-		perf_mux_hrtimer_restart(cpuctx);
+		} else {
+			ctx->rotate_necessary = 1;
+			perf_mux_hrtimer_restart(cpuctx);
+
+			group_update_userpage(event);
+		}
 	}
 
 	return 0;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf/core: fix userpage->time_enabled of inactive events
  2021-09-23 10:08 ` Peter Zijlstra
@ 2021-09-23 10:13   ` Peter Zijlstra
  2021-09-23 17:47   ` Song Liu
  1 sibling, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2021-09-23 10:13 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-kernel, acme, mingo, kernel-team, eranian, Lucian Grijincu

On Thu, Sep 23, 2021 at 12:08:50PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 21, 2021 at 06:17:15PM -0700, Song Liu wrote:

> It's wrong because:
> 
>  - you should only do this for (is_active & EVENT_TIME)

Note I also didn't do that :/ I actually did, but then figured it wasn't
worth the effort/mess it creates.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf/core: fix userpage->time_enabled of inactive events
  2021-09-23 10:08 ` Peter Zijlstra
  2021-09-23 10:13   ` Peter Zijlstra
@ 2021-09-23 17:47   ` Song Liu
  2021-09-24 13:21     ` Peter Zijlstra
  1 sibling, 1 reply; 6+ messages in thread
From: Song Liu @ 2021-09-23 17:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: open list, Arnaldo Carvalho de Melo, Ingo Molnar, Kernel Team,
	Stephane Eranian, Lucian Grijincu



> On Sep 23, 2021, at 3:08 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Sep 21, 2021 at 06:17:15PM -0700, Song Liu wrote:
>> Users of rdpmc rely on the mmapped user page to calculate accurate
>> time_enabled. Currently, userpage->time_enabled is only updated when the
>> event is added to the pmu. As a result, inactive event (due to counter
>> multiplexing) does not have accurate userpage->time_enabled. This can
>> be reproduced with something like:
>> 
>>   /* open 20 task perf_event "cycles", to create multiplexing */
>> 
>>   fd = perf_event_open();  /* open task perf_event "cycles" */
>>   userpage = mmap(fd);     /* use mmap and rdmpc */
>> 
>>   while (true) {
>>     time_enabled_mmap = xxx; /* use logic in perf_event_mmap_page */
>>     time_enabled_read = read(fd).time_enabled;
>>     if (time_enabled_mmap > time_enabled_read)
>>         BUG();
>>   }
> 
> *groan*, yes I fear you're right.
> 
>> @@ -3807,6 +3816,23 @@ ctx_sched_in(struct perf_event_context *ctx,
>> 	/* Then walk through the lower prio flexible groups */
>> 	if (is_active & EVENT_FLEXIBLE)
>> 		ctx_flexible_sched_in(ctx, cpuctx);
>> +
>> +	/*
>> +	 * Update userpage for inactive events. This is needed for accurate
>> +	 * time_enabled.
>> +	 */
>> +	if (unlikely(ctx->rotate_necessary)) {
>> +		struct perf_event *event;
>> +
>> +		perf_event_groups_for_each(event, &ctx->pinned_groups) {
>> +			if (event->state == PERF_EVENT_STATE_INACTIVE)
>> +				perf_event_update_inactive_userpage(event, ctx);
>> +		}
> 
> That's a straight up error, if a pinned event doesn't get scheduled,
> it's game over.
> 
>> +		perf_event_groups_for_each(event, &ctx->flexible_groups) {
>> +			if (event->state == PERF_EVENT_STATE_INACTIVE)
>> +				perf_event_update_inactive_userpage(event, ctx);
>> +		}
>> +	}
>> }
> 
> That's terrible though, and also wrong I think.
> 
> It's wrong because:
> 
> - you should only do this for (is_active & EVENT_TIME)
> - you should only consider the events visit_groups_merge() would have
> - you miss updating group-siling events
> 
> (I also think it's possible to try harder to avoid the work)
> 
> Now, looking at visit_groups_merge() we only terminate the iteration
> when func returns !0, which is merge_sched_in(), and that *never*
> returns !0.
> 
> So we already iterate all the right events. So I'm thinking we can do
> something like the below, hmm?
> 
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 0c000cb01eeb..4d1e962c2ebe 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3707,6 +3712,28 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
> 	return 0;
> }
> 
> +static inline bool event_update_userpage(struct perf_event *event)
> +{
> +	if (!atomic_read(&event->mmap_count))
> +		return false;

Technically, the user could mmap a sibling but not the group leader, right?
It is weird though. 

There is also a corner case we didn't cover. If the user enable the event
before mmap it. The event is first scheduled in via:

__perf_event_enable (or __perf_install_in_context)
   -> ctx_resched
      -> perf_event_sched_in
          -> ctx_sched_in

but it doesn't have mmap_count yet. \x10And we won't call perf_event_update_userpage
for it before the first rotation after mmap. As a result, the user page will
contain garbage data before the first rotation. 

I haven't figured out a good fix for this. Maybe we just require the user to 
do mmap before enabling it? If not, the user should expect some garbage data
before the first rotation?

Other than this corner case, this version works well in my tests. Shall I send 
v2 in this version?

Thanks,
Song




> +
> +	perf_event_update_time(event);
> +	perf_set_shadow_time(event, event->ctx);
> +	perf_event_update_userpage(event);
> +	return true;
> +}
> +
> +static inline void group_update_userpage(struct perf_event *group_event)
> +{
> +	struct perf_event *event;
> +
> +	if (!event_update_userpage(group_event))
> +		return;
> +
> +	for_each_sibling_event(event, group_event)
> +		event_update_userpage(event);
> +}
> +
> static int merge_sched_in(struct perf_event *event, void *data)
> {
> 	struct perf_event_context *ctx = event->ctx;
> @@ -3725,14 +3752,18 @@ static int merge_sched_in(struct perf_event *event, void *data)
> 	}
> 
> 	if (event->state == PERF_EVENT_STATE_INACTIVE) {
> +		*can_add_hw = 0;
> +
> 		if (event->attr.pinned) {
> 			perf_cgroup_event_disable(event, ctx);
> 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
> -		}
> 
> -		*can_add_hw = 0;
> -		ctx->rotate_necessary = 1;
> -		perf_mux_hrtimer_restart(cpuctx);
> +		} else {
> +			ctx->rotate_necessary = 1;
> +			perf_mux_hrtimer_restart(cpuctx);
> +
> +			group_update_userpage(event);
> +		}
> 	}
> 
> 	return 0;



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf/core: fix userpage->time_enabled of inactive events
  2021-09-23 17:47   ` Song Liu
@ 2021-09-24 13:21     ` Peter Zijlstra
  0 siblings, 0 replies; 6+ messages in thread
From: Peter Zijlstra @ 2021-09-24 13:21 UTC (permalink / raw)
  To: Song Liu
  Cc: open list, Arnaldo Carvalho de Melo, Ingo Molnar, Kernel Team,
	Stephane Eranian, Lucian Grijincu

On Thu, Sep 23, 2021 at 05:47:40PM +0000, Song Liu wrote:

> > 
> > So we already iterate all the right events. So I'm thinking we can do
> > something like the below, hmm?
> > 
> > 
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 0c000cb01eeb..4d1e962c2ebe 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -3707,6 +3712,28 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
> > 	return 0;
> > }
> > 
> > +static inline bool event_update_userpage(struct perf_event *event)
> > +{
> > +	if (!atomic_read(&event->mmap_count))
> > +		return false;
> 
> Technically, the user could mmap a sibling but not the group leader, right?
> It is weird though. 

Yeah, I went with the same logic that a disabled leader disables the
whole group, I suppose we can revisit if anybody ever gets a sane
use-case for this.

> There is also a corner case we didn't cover. If the user enable the event
> before mmap it. The event is first scheduled in via:
> 
> __perf_event_enable (or __perf_install_in_context)
>    -> ctx_resched
>       -> perf_event_sched_in
>           -> ctx_sched_in
> 
> but it doesn't have mmap_count yet. \x10And we won't call perf_event_update_userpage
> for it before the first rotation after mmap. As a result, the user page will
> contain garbage data before the first rotation. 

That was already a problem, because without the mmap, the event->rb will
be NULL and perf_event_update_userpage() will not actually do anything.

Should not the mmap() itself update the state before populating those
fields for the first time?

> Other than this corner case, this version works well in my tests. Shall I send 
> v2 in this version?

Please,

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-24 13:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-22  1:17 [PATCH] perf/core: fix userpage->time_enabled of inactive events Song Liu
2021-09-23  6:42 ` Song Liu
2021-09-23 10:08 ` Peter Zijlstra
2021-09-23 10:13   ` Peter Zijlstra
2021-09-23 17:47   ` Song Liu
2021-09-24 13:21     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.