perf/core: fix multiplexing event scheduling issue
diff mbox series

Message ID 20191018002746.149200-1-eranian@google.com
State New
Headers show
Series
  • perf/core: fix multiplexing event scheduling issue
Related show

Commit Message

Stephane Eranian Oct. 18, 2019, 12:27 a.m. UTC
This patch complements the following commit:
7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")

The fix from Song addresses the consequences of the problem but
not the cause. This patch fixes the causes and can sit on top of
Song's patch.

This patch fixes a scheduling problem in the core functions of
perf_events. Under certain conditions, some events would not be
scheduled even though many counters would be available. This
is related to multiplexing and is architecture agnostic and
PMU agnostic (i.e., core or uncore).

This problem can easily be reproduced when you have two perf
stat sessions. The first session does not cause multiplexing,
let's say it is measuring 1 event, E1. While it is measuring,
a second session starts and causes multiplexing. Let's say it
adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
When the second session terminates, all 6 (B1-B6) events are
removed. Normally, you'd expect the E1 event to continue to run
with no multiplexing. However, the problem is that depending on
the state Of E1 when B1-B6 are removed, it may never be scheduled
again. If E1 was inactive at the time of removal, despite the
multiplexing hrtimer still firing, it will not find any active
events and will not try to reschedule. This is what Song's patch
fixes. It forces the multiplexing code to consider non-active events.
However, the cause is not addressed. The kernel should not rely on
the multiplexing hrtimer to unblock inactive events. That timer
can have abitrary duration in the milliseconds. Until the timer
fires, counters are available, but no measurable events are using
them. We do not want to introduce blind spots of arbitrary durations.

This patch addresses the cause of the problem, by checking that,
when an event is disabled or removed and the context was multiplexing
events, inactive events gets immediately a chance to be scheduled by
calling ctx_resched(). The rescheduling is done on  event of equal
or lower priority types.  With that in place, as soon as a counter
is freed, schedulable inactive events may run, thereby eliminating
a blind spot.

This can be illustrated as follows using Skylake uncore CHA here:

$ perf stat --no-merge -a -I 1000 -C 28 -e uncore_cha_0/event=0x0/
    42.007856531      2,000,291,322      uncore_cha_0/event=0x0/
    43.008062166      2,000,399,526      uncore_cha_0/event=0x0/
    44.008293244      2,000,473,720      uncore_cha_0/event=0x0/
    45.008501847      2,000,423,420      uncore_cha_0/event=0x0/
    46.008706558      2,000,411,132      uncore_cha_0/event=0x0/
    47.008928543      2,000,441,660      uncore_cha_0/event=0x0/

Adding second sessiont with 4 events for 4s

$ perf stat -a -I 1000 -C 28 --no-merge -e uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/ sleep 4
    48.009114643      1,983,129,830      uncore_cha_0/event=0x0/                                       (99.71%)
    49.009279616      1,976,067,751      uncore_cha_0/event=0x0/                                       (99.30%)
    50.009428660      1,974,448,006      uncore_cha_0/event=0x0/                                       (98.92%)
    51.009524309      1,973,083,237      uncore_cha_0/event=0x0/                                       (98.55%)
    52.009673467      1,972,097,678      uncore_cha_0/event=0x0/                                       (97.11%)

End of 4s, second session is removed. But the first event does not schedule and never will
unless new events force multiplexing again.

    53.009815999      <not counted>      uncore_cha_0/event=0x0/                                       (95.28%)
    54.009961809      <not counted>      uncore_cha_0/event=0x0/                                       (93.52%)
    55.010110972      <not counted>      uncore_cha_0/event=0x0/                                       (91.82%)
    56.010217579      <not counted>      uncore_cha_0/event=0x0/                                       (90.18%)

Signed-off-by: Stephane Eranian <eranian@google.com>
---
 kernel/events/core.c | 67 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

Comments

Song Liu Oct. 18, 2019, 6:13 a.m. UTC | #1
> On Oct 17, 2019, at 5:27 PM, Stephane Eranian <eranian@google.com> wrote:
> 
> This patch complements the following commit:
> 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> 
> The fix from Song addresses the consequences of the problem but
> not the cause. This patch fixes the causes and can sit on top of
> Song's patch.
> 
> This patch fixes a scheduling problem in the core functions of
> perf_events. Under certain conditions, some events would not be
> scheduled even though many counters would be available. This
> is related to multiplexing and is architecture agnostic and
> PMU agnostic (i.e., core or uncore).
> 
> This problem can easily be reproduced when you have two perf
> stat sessions. The first session does not cause multiplexing,
> let's say it is measuring 1 event, E1. While it is measuring,
> a second session starts and causes multiplexing. Let's say it
> adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
> When the second session terminates, all 6 (B1-B6) events are
> removed. Normally, you'd expect the E1 event to continue to run
> with no multiplexing. However, the problem is that depending on
> the state Of E1 when B1-B6 are removed, it may never be scheduled
> again. If E1 was inactive at the time of removal, despite the
> multiplexing hrtimer still firing, it will not find any active
> events and will not try to reschedule. This is what Song's patch
> fixes. It forces the multiplexing code to consider non-active events.

Good analysis! I kinda knew the example I had (with pinned event)
was not the only way to trigger this issue. But I didn't think 
about event remove path. 

> However, the cause is not addressed. The kernel should not rely on
> the multiplexing hrtimer to unblock inactive events. That timer
> can have abitrary duration in the milliseconds. Until the timer
> fires, counters are available, but no measurable events are using
> them. We do not want to introduce blind spots of arbitrary durations.
> 
> This patch addresses the cause of the problem, by checking that,
> when an event is disabled or removed and the context was multiplexing
> events, inactive events gets immediately a chance to be scheduled by
> calling ctx_resched(). The rescheduling is done on  event of equal
> or lower priority types.  With that in place, as soon as a counter
> is freed, schedulable inactive events may run, thereby eliminating
> a blind spot.
> 
> This can be illustrated as follows using Skylake uncore CHA here:
> 
> $ perf stat --no-merge -a -I 1000 -C 28 -e uncore_cha_0/event=0x0/
>    42.007856531      2,000,291,322      uncore_cha_0/event=0x0/
>    43.008062166      2,000,399,526      uncore_cha_0/event=0x0/
>    44.008293244      2,000,473,720      uncore_cha_0/event=0x0/
>    45.008501847      2,000,423,420      uncore_cha_0/event=0x0/
>    46.008706558      2,000,411,132      uncore_cha_0/event=0x0/
>    47.008928543      2,000,441,660      uncore_cha_0/event=0x0/
> 
> Adding second sessiont with 4 events for 4s
> 
> $ perf stat -a -I 1000 -C 28 --no-merge -e uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/ sleep 4
>    48.009114643      1,983,129,830      uncore_cha_0/event=0x0/                                       (99.71%)
>    49.009279616      1,976,067,751      uncore_cha_0/event=0x0/                                       (99.30%)
>    50.009428660      1,974,448,006      uncore_cha_0/event=0x0/                                       (98.92%)
>    51.009524309      1,973,083,237      uncore_cha_0/event=0x0/                                       (98.55%)
>    52.009673467      1,972,097,678      uncore_cha_0/event=0x0/                                       (97.11%)
> 
> End of 4s, second session is removed. But the first event does not schedule and never will
> unless new events force multiplexing again.
> 
>    53.009815999      <not counted>      uncore_cha_0/event=0x0/                                       (95.28%)
>    54.009961809      <not counted>      uncore_cha_0/event=0x0/                                       (93.52%)
>    55.010110972      <not counted>      uncore_cha_0/event=0x0/                                       (91.82%)
>    56.010217579      <not counted>      uncore_cha_0/event=0x0/                                       (90.18%)

Does this still happen after my patch? I was not able to repro this. 

> Signed-off-by: Stephane Eranian <eranian@google.com>

Maybe add:
Fixes: 8d5bce0c37fa ("perf/core: Optimize perf_rotate_context() event scheduling")

Thanks,
Song
Stephane Eranian Oct. 18, 2019, 6:19 a.m. UTC | #2
On Thu, Oct 17, 2019 at 11:13 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Oct 17, 2019, at 5:27 PM, Stephane Eranian <eranian@google.com> wrote:
> >
> > This patch complements the following commit:
> > 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> >
> > The fix from Song addresses the consequences of the problem but
> > not the cause. This patch fixes the causes and can sit on top of
> > Song's patch.
> >
> > This patch fixes a scheduling problem in the core functions of
> > perf_events. Under certain conditions, some events would not be
> > scheduled even though many counters would be available. This
> > is related to multiplexing and is architecture agnostic and
> > PMU agnostic (i.e., core or uncore).
> >
> > This problem can easily be reproduced when you have two perf
> > stat sessions. The first session does not cause multiplexing,
> > let's say it is measuring 1 event, E1. While it is measuring,
> > a second session starts and causes multiplexing. Let's say it
> > adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
> > When the second session terminates, all 6 (B1-B6) events are
> > removed. Normally, you'd expect the E1 event to continue to run
> > with no multiplexing. However, the problem is that depending on
> > the state Of E1 when B1-B6 are removed, it may never be scheduled
> > again. If E1 was inactive at the time of removal, despite the
> > multiplexing hrtimer still firing, it will not find any active
> > events and will not try to reschedule. This is what Song's patch
> > fixes. It forces the multiplexing code to consider non-active events.
>
> Good analysis! I kinda knew the example I had (with pinned event)
> was not the only way to trigger this issue. But I didn't think
> about event remove path.
>
I was pursuing this bug without knowledged of your patch. Your patch
makes it harder to see. Clearly in my test case, it disappears, but it is
just because of the multiplexing interval. If we get to the rotate code
and we have no active events yet some inactive, there is something
wrong because we are wasting counters. When I tracked the bug,
I started from the remove_context code, then realized there was also
the disable case. I fixed both and they I discovered your patch which
is fixing it at the receiving end. Hopefully, there aren't any code path
that can lead to this situation.


> > However, the cause is not addressed. The kernel should not rely on
> > the multiplexing hrtimer to unblock inactive events. That timer
> > can have abitrary duration in the milliseconds. Until the timer
> > fires, counters are available, but no measurable events are using
> > them. We do not want to introduce blind spots of arbitrary durations.
> >
> > This patch addresses the cause of the problem, by checking that,
> > when an event is disabled or removed and the context was multiplexing
> > events, inactive events gets immediately a chance to be scheduled by
> > calling ctx_resched(). The rescheduling is done on  event of equal
> > or lower priority types.  With that in place, as soon as a counter
> > is freed, schedulable inactive events may run, thereby eliminating
> > a blind spot.
> >
> > This can be illustrated as follows using Skylake uncore CHA here:
> >
> > $ perf stat --no-merge -a -I 1000 -C 28 -e uncore_cha_0/event=0x0/
> >    42.007856531      2,000,291,322      uncore_cha_0/event=0x0/
> >    43.008062166      2,000,399,526      uncore_cha_0/event=0x0/
> >    44.008293244      2,000,473,720      uncore_cha_0/event=0x0/
> >    45.008501847      2,000,423,420      uncore_cha_0/event=0x0/
> >    46.008706558      2,000,411,132      uncore_cha_0/event=0x0/
> >    47.008928543      2,000,441,660      uncore_cha_0/event=0x0/
> >
> > Adding second sessiont with 4 events for 4s
> >
> > $ perf stat -a -I 1000 -C 28 --no-merge -e uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/,uncore_cha_0/event=0x0/ sleep 4
> >    48.009114643      1,983,129,830      uncore_cha_0/event=0x0/                                       (99.71%)
> >    49.009279616      1,976,067,751      uncore_cha_0/event=0x0/                                       (99.30%)
> >    50.009428660      1,974,448,006      uncore_cha_0/event=0x0/                                       (98.92%)
> >    51.009524309      1,973,083,237      uncore_cha_0/event=0x0/                                       (98.55%)
> >    52.009673467      1,972,097,678      uncore_cha_0/event=0x0/                                       (97.11%)
> >
> > End of 4s, second session is removed. But the first event does not schedule and never will
> > unless new events force multiplexing again.
> >
> >    53.009815999      <not counted>      uncore_cha_0/event=0x0/                                       (95.28%)
> >    54.009961809      <not counted>      uncore_cha_0/event=0x0/                                       (93.52%)
> >    55.010110972      <not counted>      uncore_cha_0/event=0x0/                                       (91.82%)
> >    56.010217579      <not counted>      uncore_cha_0/event=0x0/                                       (90.18%)
>
> Does this still happen after my patch? I was not able to repro this.
>
> > Signed-off-by: Stephane Eranian <eranian@google.com>
>
> Maybe add:
> Fixes: 8d5bce0c37fa ("perf/core: Optimize perf_rotate_context() event scheduling")
>
It does not really fix your patch, I think we can keep it as a double
precaution. It fixes
the causes. I think it is useful to check beyond the active in the
rotate code as well.



> Thanks,
> Song
Song Liu Oct. 18, 2019, 6:55 a.m. UTC | #3
> On Oct 17, 2019, at 11:19 PM, Stephane Eranian <eranian@google.com> wrote:
> 
> On Thu, Oct 17, 2019 at 11:13 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Oct 17, 2019, at 5:27 PM, Stephane Eranian <eranian@google.com> wrote:
>>> 
>>> This patch complements the following commit:
>>> 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
>>> 
>>> The fix from Song addresses the consequences of the problem but
>>> not the cause. This patch fixes the causes and can sit on top of
>>> Song's patch.
>>> 
>>> This patch fixes a scheduling problem in the core functions of
>>> perf_events. Under certain conditions, some events would not be
>>> scheduled even though many counters would be available. This
>>> is related to multiplexing and is architecture agnostic and
>>> PMU agnostic (i.e., core or uncore).
>>> 
>>> This problem can easily be reproduced when you have two perf
>>> stat sessions. The first session does not cause multiplexing,
>>> let's say it is measuring 1 event, E1. While it is measuring,
>>> a second session starts and causes multiplexing. Let's say it
>>> adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
>>> When the second session terminates, all 6 (B1-B6) events are
>>> removed. Normally, you'd expect the E1 event to continue to run
>>> with no multiplexing. However, the problem is that depending on
>>> the state Of E1 when B1-B6 are removed, it may never be scheduled
>>> again. If E1 was inactive at the time of removal, despite the
>>> multiplexing hrtimer still firing, it will not find any active
>>> events and will not try to reschedule. This is what Song's patch
>>> fixes. It forces the multiplexing code to consider non-active events.
>> 
>> Good analysis! I kinda knew the example I had (with pinned event)
>> was not the only way to trigger this issue. But I didn't think
>> about event remove path.
>> 
> I was pursuing this bug without knowledged of your patch. Your patch
> makes it harder to see. Clearly in my test case, it disappears, but it is
> just because of the multiplexing interval. If we get to the rotate code
> and we have no active events yet some inactive, there is something
> wrong because we are wasting counters. When I tracked the bug,
> I started from the remove_context code, then realized there was also
> the disable case. I fixed both and they I discovered your patch which
> is fixing it at the receiving end. Hopefully, there aren't any code path
> that can lead to this situation.

Thanks for the explanation. Agreed that blind spot has bigger impact 
with longer rotation interval. 

[...]

>>> Signed-off-by: Stephane Eranian <eranian@google.com>
>> 
>> Maybe add:
>> Fixes: 8d5bce0c37fa ("perf/core: Optimize perf_rotate_context() event scheduling")
>> 
> It does not really fix your patch, I think we can keep it as a double
> precaution. It fixes
> the causes. I think it is useful to check beyond the active in the
> rotate code as well.

Also agreed, this is not really fixing that specific commit. 

Acked-by: Song Liu <songliubraving@fb.com>

Thanks,
Song
Peter Zijlstra Oct. 21, 2019, 10:05 a.m. UTC | #4
On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> @@ -2153,6 +2157,7 @@ __perf_remove_from_context(struct perf_event *event,
>  			   void *info)
>  {
>  	unsigned long flags = (unsigned long)info;
> +	int was_necessary = ctx->rotate_necessary;
>  
>  	if (ctx->is_active & EVENT_TIME) {
>  		update_context_time(ctx);
> @@ -2171,6 +2176,37 @@ __perf_remove_from_context(struct perf_event *event,
>  			cpuctx->task_ctx = NULL;
>  		}
>  	}
> +
> +	/*
> +	 * sanity check that event_sched_out() does not and will not
> +	 * change the state of ctx->rotate_necessary
> +	 */
> +	WARN_ON(was_necessary != event->ctx->rotate_necessary);

It doesn't... why is this important to check?

> +	/*
> +	 * if we remove an event AND we were multiplexing then, that means
> +	 * we had more events than we have counters for, and thus, at least,
> +	 * one event was in INACTIVE state. Now, that we removed an event,
> +	 * we need to resched to give a chance to all events to get scheduled,
> +	 * otherwise some may get stuck.
> +	 *
> +	 * By the time this function is called the event is usually in the OFF
> +	 * state.
> +	 * Note that this is not a duplicate of the same code in _perf_event_disable()
> +	 * because the call path are different. Some events may be simply disabled

It is the exact same code twice though; IIRC this C language has a
feature to help with that.

> +	 * others removed. There is a way to get removed and not be disabled first.
> +	 */
> +	if (ctx->rotate_necessary && ctx->nr_events) {
> +		int type = get_event_type(event);
> +		/*
> +		 * In case we removed a pinned event, then we need to
> +		 * resched for both pinned and flexible events. The
> +		 * opposite is not true. A pinned event can never be
> +		 * inactive due to multiplexing.
> +		 */
> +		if (type & EVENT_PINNED)
> +			type |= EVENT_FLEXIBLE;
> +		ctx_resched(cpuctx, cpuctx->task_ctx, type);
> +	}

What you're relying on is that ->rotate_necessary implies ->is_active
and there's pending events. And if we tighten ->rotate_necessary you can
remove the && ->nr_events.

> @@ -2232,6 +2270,35 @@ static void __perf_event_disable(struct perf_event *event,
>  		event_sched_out(event, cpuctx, ctx);
>  
>  	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
> +	/*
> +	 * sanity check that event_sched_out() does not and will not
> +	 * change the state of ctx->rotate_necessary
> +	 */
> +	WARN_ON_ONCE(was_necessary != event->ctx->rotate_necessary);
> +
> +	/*
> +	 * if we disable an event AND we were multiplexing then, that means
> +	 * we had more events than we have counters for, and thus, at least,
> +	 * one event was in INACTIVE state. Now, that we disabled an event,
> +	 * we need to resched to give a chance to all events to be scheduled,
> +	 * otherwise some may get stuck.
> +	 *
> +	 * Note that this is not a duplicate of the same code in
> +	 * __perf_remove_from_context()
> +	 * because events can be disabled without being removed.

It _IS_ a duplicate, it is the _exact_ same code twice. What you're
trying to say is that we need it in both places, but that's something
else entirely.

> +	 */
> +	if (ctx->rotate_necessary && ctx->nr_events) {
> +		int type = get_event_type(event);
> +		/*
> +		 * In case we removed a pinned event, then we need to
> +		 * resched for both pinned and flexible events. The
> +		 * opposite is not true. A pinned event can never be
> +		 * inactive due to multiplexing.
> +		 */
> +		if (type & EVENT_PINNED)
> +			type |= EVENT_FLEXIBLE;
> +		ctx_resched(cpuctx, cpuctx->task_ctx, type);
> +	}
>  }
Peter Zijlstra Oct. 21, 2019, 10:20 a.m. UTC | #5
On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> This patch complements the following commit:
> 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> 
> The fix from Song addresses the consequences of the problem but
> not the cause. This patch fixes the causes and can sit on top of
> Song's patch.

I'm tempted to say the other way around.

Consider the case where you claim fixed2 with a pinned event and then
have another fixed2 in the flexible list. At that point you're _never_
going to run any other flexible events (without Song's patch).

This patch isn't going to help with that. Similarly, Songs patch helps
with your situation where it will allow rotation to resume after you
disable/remove all active events (while you still have pending events).

> This patch fixes a scheduling problem in the core functions of
> perf_events. Under certain conditions, some events would not be
> scheduled even though many counters would be available. This
> is related to multiplexing and is architecture agnostic and
> PMU agnostic (i.e., core or uncore).
> 
> This problem can easily be reproduced when you have two perf
> stat sessions. The first session does not cause multiplexing,
> let's say it is measuring 1 event, E1. While it is measuring,
> a second session starts and causes multiplexing. Let's say it
> adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
> When the second session terminates, all 6 (B1-B6) events are
> removed. Normally, you'd expect the E1 event to continue to run
> with no multiplexing. However, the problem is that depending on
> the state Of E1 when B1-B6 are removed, it may never be scheduled
> again. If E1 was inactive at the time of removal, despite the
> multiplexing hrtimer still firing, it will not find any active
> events and will not try to reschedule. This is what Song's patch
> fixes. It forces the multiplexing code to consider non-active events.

This; so Song's patch fixes the fundamental problem of the rotation not
working right under certain conditions.

> However, the cause is not addressed. The kernel should not rely on
> the multiplexing hrtimer to unblock inactive events. That timer
> can have abitrary duration in the milliseconds. Until the timer
> fires, counters are available, but no measurable events are using
> them. We do not want to introduce blind spots of arbitrary durations.

This I disagree with -- you don't get a guarantee other than
timer_period/n when you multiplex, and idling the counters until the
next tick doesn't violate that at all.

> This patch addresses the cause of the problem, by checking that,
> when an event is disabled or removed and the context was multiplexing
> events, inactive events gets immediately a chance to be scheduled by
> calling ctx_resched(). The rescheduling is done on  event of equal
> or lower priority types.  With that in place, as soon as a counter
> is freed, schedulable inactive events may run, thereby eliminating
> a blind spot.

Disagreed, Song's patch removed the fundamental blind spot of rotation
completely failing.

This just slightly optimizes counter usage -- at a cost of having to
reprogram the counters more often.

Not saying we shouldn't do this, but this justification is just all
sorts of wrong.
Stephane Eranian Oct. 23, 2019, 7:06 a.m. UTC | #6
On Mon, Oct 21, 2019 at 3:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> > @@ -2153,6 +2157,7 @@ __perf_remove_from_context(struct perf_event *event,
> >                          void *info)
> >  {
> >       unsigned long flags = (unsigned long)info;
> > +     int was_necessary = ctx->rotate_necessary;
> >
> >       if (ctx->is_active & EVENT_TIME) {
> >               update_context_time(ctx);
> > @@ -2171,6 +2176,37 @@ __perf_remove_from_context(struct perf_event *event,
> >                       cpuctx->task_ctx = NULL;
> >               }
> >       }
> > +
> > +     /*
> > +      * sanity check that event_sched_out() does not and will not
> > +      * change the state of ctx->rotate_necessary
> > +      */
> > +     WARN_ON(was_necessary != event->ctx->rotate_necessary);
>
> It doesn't... why is this important to check?
>
I can remove that. It is leftover from debugging. It is okay to look
at the situation
after event_sched_out(). Today, it does not change rotate_necessary.

> > +     /*
> > +      * if we remove an event AND we were multiplexing then, that means
> > +      * we had more events than we have counters for, and thus, at least,
> > +      * one event was in INACTIVE state. Now, that we removed an event,
> > +      * we need to resched to give a chance to all events to get scheduled,
> > +      * otherwise some may get stuck.
> > +      *
> > +      * By the time this function is called the event is usually in the OFF
> > +      * state.
> > +      * Note that this is not a duplicate of the same code in _perf_event_disable()
> > +      * because the call path are different. Some events may be simply disabled
>
> It is the exact same code twice though; IIRC this C language has a
> feature to help with that.

Sure! I will make a function to check on the condition.

>
> > +      * others removed. There is a way to get removed and not be disabled first.
> > +      */
> > +     if (ctx->rotate_necessary && ctx->nr_events) {
> > +             int type = get_event_type(event);
> > +             /*
> > +              * In case we removed a pinned event, then we need to
> > +              * resched for both pinned and flexible events. The
> > +              * opposite is not true. A pinned event can never be
> > +              * inactive due to multiplexing.
> > +              */
> > +             if (type & EVENT_PINNED)
> > +                     type |= EVENT_FLEXIBLE;
> > +             ctx_resched(cpuctx, cpuctx->task_ctx, type);
> > +     }
>
> What you're relying on is that ->rotate_necessary implies ->is_active
> and there's pending events. And if we tighten ->rotate_necessary you can
> remove the && ->nr_events.
>
Imagine I have 6 events and 4 counters and I do delete them all before
the timer expires.
Then, I can be in a situation where rotate_necessary is still true and
yet have no more events
in the context. That is because only ctx_sched_out() clears
rotate_necessary, IIRC. So that
is why there is the && nr_events. Now, calling ctx_resched() with no
events wouldn't probably
cause any harm, just wasted work.  So if by tightening, I am guessing
you mean clearing
rotate_necessary earlier. But that would be tricky because the only
reliable way of clearing
it is when you know you are about the reschedule everything. Removing
an event by itself
may not be enough to eliminate multiplexing.


> > @@ -2232,6 +2270,35 @@ static void __perf_event_disable(struct perf_event *event,
> >               event_sched_out(event, cpuctx, ctx);
> >
> >       perf_event_set_state(event, PERF_EVENT_STATE_OFF);
> > +     /*
> > +      * sanity check that event_sched_out() does not and will not
> > +      * change the state of ctx->rotate_necessary
> > +      */
> > +     WARN_ON_ONCE(was_necessary != event->ctx->rotate_necessary);
> > +
> > +     /*
> > +      * if we disable an event AND we were multiplexing then, that means
> > +      * we had more events than we have counters for, and thus, at least,
> > +      * one event was in INACTIVE state. Now, that we disabled an event,
> > +      * we need to resched to give a chance to all events to be scheduled,
> > +      * otherwise some may get stuck.
> > +      *
> > +      * Note that this is not a duplicate of the same code in
> > +      * __perf_remove_from_context()
> > +      * because events can be disabled without being removed.
>
> It _IS_ a duplicate, it is the _exact_ same code twice. What you're
> trying to say is that we need it in both places, but that's something
> else entirely.
>

Will refactor.

> > +      */
> > +     if (ctx->rotate_necessary && ctx->nr_events) {
> > +             int type = get_event_type(event);
> > +             /*
> > +              * In case we removed a pinned event, then we need to
> > +              * resched for both pinned and flexible events. The
> > +              * opposite is not true. A pinned event can never be
> > +              * inactive due to multiplexing.
> > +              */
> > +             if (type & EVENT_PINNED)
> > +                     type |= EVENT_FLEXIBLE;
> > +             ctx_resched(cpuctx, cpuctx->task_ctx, type);
> > +     }
> >  }
>
>
Stephane Eranian Oct. 23, 2019, 7:30 a.m. UTC | #7
On Mon, Oct 21, 2019 at 3:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> > This patch complements the following commit:
> > 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> >
> > The fix from Song addresses the consequences of the problem but
> > not the cause. This patch fixes the causes and can sit on top of
> > Song's patch.
>
> I'm tempted to say the other way around.
>
> Consider the case where you claim fixed2 with a pinned event and then
> have another fixed2 in the flexible list. At that point you're _never_
> going to run any other flexible events (without Song's patch).
>
In that case, there is no deactivation or removal of events, so yes, my patch
will not help that case. I said his patch is still useful. You gave one example,
even though in this case the rotate will not yield a reschedule of that flexible
event because fixed2 is used by a pinned event. So checking for it, will not
really help.

> This patch isn't going to help with that. Similarly, Songs patch helps
> with your situation where it will allow rotation to resume after you
> disable/remove all active events (while you still have pending events).
>
Yes, it will unblock the case where active events are deactivated or
removed. But it will delay the unblocking until the next mux timer
expires. And I am saying this is too far away in many cases. For instance,
we do not run with the 1ms timer for uncore, this is way too much overhead.
Imagine this timer is set to 10ms or event 100ms, just with Song's patch, the
inactive events would have to wait for up to 100ms to be scheduled again.
This is not acceptable for us.

> > This patch fixes a scheduling problem in the core functions of
> > perf_events. Under certain conditions, some events would not be
> > scheduled even though many counters would be available. This
> > is related to multiplexing and is architecture agnostic and
> > PMU agnostic (i.e., core or uncore).
> >
> > This problem can easily be reproduced when you have two perf
> > stat sessions. The first session does not cause multiplexing,
> > let's say it is measuring 1 event, E1. While it is measuring,
> > a second session starts and causes multiplexing. Let's say it
> > adds 6 events, B1-B6. Now, 7 events compete and are multiplexed.
> > When the second session terminates, all 6 (B1-B6) events are
> > removed. Normally, you'd expect the E1 event to continue to run
> > with no multiplexing. However, the problem is that depending on
> > the state Of E1 when B1-B6 are removed, it may never be scheduled
> > again. If E1 was inactive at the time of removal, despite the
> > multiplexing hrtimer still firing, it will not find any active
> > events and will not try to reschedule. This is what Song's patch
> > fixes. It forces the multiplexing code to consider non-active events.
>
> This; so Song's patch fixes the fundamental problem of the rotation not
> working right under certain conditions.
>
> > However, the cause is not addressed. The kernel should not rely on
> > the multiplexing hrtimer to unblock inactive events. That timer
> > can have abitrary duration in the milliseconds. Until the timer
> > fires, counters are available, but no measurable events are using
> > them. We do not want to introduce blind spots of arbitrary durations.
>
> This I disagree with -- you don't get a guarantee other than
> timer_period/n when you multiplex, and idling the counters until the
> next tick doesn't violate that at all.

My take is that if you have free counters and "idling" events, the kernel
should take every effort to schedule them as soon as they become available.
In the situation I described in the patch, once I remove the active
events, there
is no more reasons for multiplexing, all the counters are free (ignore
watchdog).
Now you may be arguing, that it may take more time to ctx_resched() then to
wait for the timer to expire. But I am not sure I buy that. Similarly,
I am not sure
there is code to cancel an active mux hrtimer when we clear rotate_necessary.
Maybe we just let it lapse and clear itself via a ctx_sched_out() in
the rotation code.

>
> > This patch addresses the cause of the problem, by checking that,
> > when an event is disabled or removed and the context was multiplexing
> > events, inactive events gets immediately a chance to be scheduled by
> > calling ctx_resched(). The rescheduling is done on  event of equal
> > or lower priority types.  With that in place, as soon as a counter
> > is freed, schedulable inactive events may run, thereby eliminating
> > a blind spot.
>
> Disagreed, Song's patch removed the fundamental blind spot of rotation
> completely failing.
Sure it removed the infinite blocking of schedulable events. My patch addresses
the issue of having free counters following a deactivation/removal and
not scheduling
the idling events on them, thereby creating a blind spot where no
event is monitoring.

>
> This just slightly optimizes counter usage -- at a cost of having to
> reprogram the counters more often.
>
Only on deactivation/removal AND multiplexing so it is not every time but
only where there is an opportunity to keep the counters busy.

> Not saying we shouldn't do this, but this justification is just all
> sorts of wrong.

I think the patches are not mutually exclusive. They sort of complement each
other. My problem was that if you have free counters and idling events, you may
not need to do one more rotation and wait for it to resume active
collection, especially
when the timer may be set to a value larger than default. In my case,
I debug my issue
without Song's patch. I discovered it at the time I was ready to
submit my patch to LKML.
My patch by itself solved the problem I was seeing. Song patch is
needed to solve the
problem of the first flexible event not schedulable while other could
be. My patch does
not address that. It would address the case where that ref-cycles
pinned event is removed
and the flexible ref-cycles could take the counter immediately. So, I
am okay with you saying
this is an optimization and I think it is useful.
Peter Zijlstra Oct. 23, 2019, 9:37 a.m. UTC | #8
On Wed, Oct 23, 2019 at 12:06:43AM -0700, Stephane Eranian wrote:
> On Mon, Oct 21, 2019 at 3:06 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:

> > > +      * others removed. There is a way to get removed and not be disabled first.
> > > +      */
> > > +     if (ctx->rotate_necessary && ctx->nr_events) {
> > > +             int type = get_event_type(event);
> > > +             /*
> > > +              * In case we removed a pinned event, then we need to
> > > +              * resched for both pinned and flexible events. The
> > > +              * opposite is not true. A pinned event can never be
> > > +              * inactive due to multiplexing.
> > > +              */
> > > +             if (type & EVENT_PINNED)
> > > +                     type |= EVENT_FLEXIBLE;
> > > +             ctx_resched(cpuctx, cpuctx->task_ctx, type);
> > > +     }
> >
> > What you're relying on is that ->rotate_necessary implies ->is_active
> > and there's pending events. And if we tighten ->rotate_necessary you can
> > remove the && ->nr_events.
> >
> Imagine I have 6 events and 4 counters and I do delete them all before
> the timer expires.  Then, I can be in a situation where
> rotate_necessary is still true and yet have no more events in the
> context. That is because only ctx_sched_out() clears rotate_necessary,
> IIRC. So that is why there is the && nr_events. Now, calling
> ctx_resched() with no events wouldn't probably cause any harm, just
> wasted work.

> So if by tightening, I am guessing you mean clearing rotate_necessary
> earlier. But that would be tricky because the only reliable way of
> clearing it is when you know you are about the reschedule everything.
> Removing an event by itself may not be enough to eliminate
> multiplexing.

I think you're over-thinking things. The thing you test 'ctx->nr_events'
has a clear place where it drops to 0.

If we add

	ctx->nr_events--;
+	if (!ctx->nr_events && ctx->rotate_necessary)
+		ctx->rotate_necessary = 0;

to list_del_event(), we can get rid of that.

Further, since we set it on reschedule, I propose you change the above
like:

	if (ctx->rotate_necessary) {
		int type = get_event_type(event);
		/*
		 * comment..
		 */
		if (type & EVENT_PINNED)
			type |= EVENT_FLEXIBLE;
+		/*
+		 * Will be reset by ctx_resched()'s flexible_sched_in().
+		 */
+		ctx->rotate_necessary = 0;
		ctx_resched(cpuctx, cpuctx->task_ctx, type);
	}

Then rotate_necessary will be tight.
Peter Zijlstra Oct. 23, 2019, 11:02 a.m. UTC | #9
On Wed, Oct 23, 2019 at 12:30:03AM -0700, Stephane Eranian wrote:
> On Mon, Oct 21, 2019 at 3:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> > > This patch complements the following commit:
> > > 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> > >
> > > The fix from Song addresses the consequences of the problem but
> > > not the cause. This patch fixes the causes and can sit on top of
> > > Song's patch.
> >
> > I'm tempted to say the other way around.
> >
> > Consider the case where you claim fixed2 with a pinned event and then
> > have another fixed2 in the flexible list. At that point you're _never_
> > going to run any other flexible events (without Song's patch).
> >
> In that case, there is no deactivation or removal of events, so yes, my patch
> will not help that case. I said his patch is still useful. You gave one example,
> even though in this case the rotate will not yield a reschedule of that flexible
> event because fixed2 is used by a pinned event. So checking for it, will not
> really help.

Stick 10 cycle events after the fixed2 flexible event. Without Song's
patch you'll never see those 10 cycle events get scheduled.

> > This patch isn't going to help with that. Similarly, Songs patch helps
> > with your situation where it will allow rotation to resume after you
> > disable/remove all active events (while you still have pending events).
> >
> Yes, it will unblock the case where active events are deactivated or
> removed. But it will delay the unblocking until the next mux timer
> expires. And I am saying this is too far away in many cases. For instance,
> we do not run with the 1ms timer for uncore, this is way too much overhead.
> Imagine this timer is set to 10ms or event 100ms, just with Song's patch, the
> inactive events would have to wait for up to 100ms to be scheduled again.
> This is not acceptable for us.

Then how was it acceptible to mux in the first place? And if
multiplexing wasn't acceptible, then why were you doing it?

> > > However, the cause is not addressed. The kernel should not rely on
> > > the multiplexing hrtimer to unblock inactive events. That timer
> > > can have abitrary duration in the milliseconds. Until the timer
> > > fires, counters are available, but no measurable events are using
> > > them. We do not want to introduce blind spots of arbitrary durations.
> >
> > This I disagree with -- you don't get a guarantee other than
> > timer_period/n when you multiplex, and idling the counters until the
> > next tick doesn't violate that at all.
>
> My take is that if you have free counters and "idling" events, the kernel
> should take every effort to schedule them as soon as they become available.
> In the situation I described in the patch, once I remove the active
> events, there
> is no more reasons for multiplexing, all the counters are free (ignore
> watchdog).

That's fine; all I'm arguing is that the current behaviour doesn't
violate the guarantees given. Now you want to improve counter
utilization (at a cost) and that is fine. Just don't argue that there's
something broken -- there is not.

Your patch also does not fix something more fundamental than Song's
patch did. Quite the reverse. Yours is purely a utilization efficiency
thing, while Song's addressed a correctness issue.

> Now you may be arguing, that it may take more time to ctx_resched() then to
> wait for the timer to expire. But I am not sure I buy that.

I'm not arguing that. All I'm saying is that fairness is not affected.

> Similarly, I am not sure there is code to cancel an active mux hrtimer
> when we clear rotate_necessary.  Maybe we just let it lapse and clear
> itself via a ctx_sched_out() in the rotation code.

Yes, we let it lapse and disable itself, I don't see the problem with
that -- also remember that the timer services two contexts.

> > > This patch addresses the cause of the problem, by checking that,
> > > when an event is disabled or removed and the context was multiplexing
> > > events, inactive events gets immediately a chance to be scheduled by
> > > calling ctx_resched(). The rescheduling is done on  event of equal
> > > or lower priority types.  With that in place, as soon as a counter
> > > is freed, schedulable inactive events may run, thereby eliminating
> > > a blind spot.
> >
> > Disagreed, Song's patch removed the fundamental blind spot of rotation
> > completely failing.

> Sure it removed the infinite blocking of schedulable events. My patch
> addresses the issue of having free counters following a
> deactivation/removal and not scheduling the idling events on them,
> thereby creating a blind spot where no event is monitoring.

If the counters were removed later (like a us before their slice
expired) there would not have been much idle time.

Either way, fairness does not mandate we schedule immediately.

> > This just slightly optimizes counter usage -- at a cost of having to
> > reprogram the counters more often.
> >
> Only on deactivation/removal AND multiplexing so it is not every time but
> only where there is an opportunity to keep the counters busy.

Sure, but it's still non-zero :-)

> > Not saying we shouldn't do this, but this justification is just all
> > sorts of wrong.
>
> I think the patches are not mutually exclusive.

I never said they were. And I'm not opposed to your patch. What I
objected to was the mischaracterization of it in the Changelog.

Present it as an optimization and all should be well.
Peter Zijlstra Oct. 23, 2019, 3:29 p.m. UTC | #10
On Wed, Oct 23, 2019 at 11:37:57AM +0200, Peter Zijlstra wrote:
> Further, since we set it on reschedule, I propose you change the above
> like:
> 
> 	if (ctx->rotate_necessary) {
> 		int type = get_event_type(event);
> 		/*
> 		 * comment..
> 		 */
> 		if (type & EVENT_PINNED)
> 			type |= EVENT_FLEXIBLE;
> +		/*
> +		 * Will be reset by ctx_resched()'s flexible_sched_in().
> +		 */
> +		ctx->rotate_necessary = 0;
> 		ctx_resched(cpuctx, cpuctx->task_ctx, type);
> 	}

n/m, that is already done through ctx_sched_out().
Stephane Eranian Oct. 23, 2019, 5:44 p.m. UTC | #11
On Wed, Oct 23, 2019 at 4:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 23, 2019 at 12:30:03AM -0700, Stephane Eranian wrote:
> > On Mon, Oct 21, 2019 at 3:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Oct 17, 2019 at 05:27:46PM -0700, Stephane Eranian wrote:
> > > > This patch complements the following commit:
> > > > 7fa343b7fdc4 ("perf/core: Fix corner case in perf_rotate_context()")
> > > >
> > > > The fix from Song addresses the consequences of the problem but
> > > > not the cause. This patch fixes the causes and can sit on top of
> > > > Song's patch.
> > >
> > > I'm tempted to say the other way around.
> > >
> > > Consider the case where you claim fixed2 with a pinned event and then
> > > have another fixed2 in the flexible list. At that point you're _never_
> > > going to run any other flexible events (without Song's patch).
> > >
> > In that case, there is no deactivation or removal of events, so yes, my patch
> > will not help that case. I said his patch is still useful. You gave one example,
> > even though in this case the rotate will not yield a reschedule of that flexible
> > event because fixed2 is used by a pinned event. So checking for it, will not
> > really help.
>
> Stick 10 cycle events after the fixed2 flexible event. Without Song's
> patch you'll never see those 10 cycle events get scheduled.
>
> > > This patch isn't going to help with that. Similarly, Songs patch helps
> > > with your situation where it will allow rotation to resume after you
> > > disable/remove all active events (while you still have pending events).
> > >
> > Yes, it will unblock the case where active events are deactivated or
> > removed. But it will delay the unblocking until the next mux timer
> > expires. And I am saying this is too far away in many cases. For instance,
> > we do not run with the 1ms timer for uncore, this is way too much overhead.
> > Imagine this timer is set to 10ms or event 100ms, just with Song's patch, the
> > inactive events would have to wait for up to 100ms to be scheduled again.
> > This is not acceptable for us.
>
> Then how was it acceptible to mux in the first place? And if
> multiplexing wasn't acceptible, then why were you doing it?
>
> > > > However, the cause is not addressed. The kernel should not rely on
> > > > the multiplexing hrtimer to unblock inactive events. That timer
> > > > can have abitrary duration in the milliseconds. Until the timer
> > > > fires, counters are available, but no measurable events are using
> > > > them. We do not want to introduce blind spots of arbitrary durations.
> > >
> > > This I disagree with -- you don't get a guarantee other than
> > > timer_period/n when you multiplex, and idling the counters until the
> > > next tick doesn't violate that at all.
> >
> > My take is that if you have free counters and "idling" events, the kernel
> > should take every effort to schedule them as soon as they become available.
> > In the situation I described in the patch, once I remove the active
> > events, there
> > is no more reasons for multiplexing, all the counters are free (ignore
> > watchdog).
>
> That's fine; all I'm arguing is that the current behaviour doesn't
> violate the guarantees given. Now you want to improve counter
> utilization (at a cost) and that is fine. Just don't argue that there's
> something broken -- there is not.
>
> Your patch also does not fix something more fundamental than Song's
> patch did. Quite the reverse. Yours is purely a utilization efficiency
> thing, while Song's addressed a correctness issue.
>
Going back to Song's patch and his test case. It exposes a problem that was
introduced with the RB tree and multiple event list changes. In the event
scheduler, there was this guarantee that each event will get a chance to
be scheduled because each would eventually get to the head of the list and
thus get a chance to be scheduled as the first event of its priority
class, assuming
there was still at least one compatible counter available from higher
priority classes.
The scheduler also still stops at the first error. In Song's case
ref-cycles:D,ref-cycles,cycles,
the pinned event is commandeering fixed2. But I believe the rotation
code was not
rotating the list *even* if it could not schedule the first event.
That explained why
the cycle event could never be scheduled. That's a violation of the guarantee.
At each timer, the list must rotate. I think his patch somehow addresses this.

> > Now you may be arguing, that it may take more time to ctx_resched() then to
> > wait for the timer to expire. But I am not sure I buy that.
>
> I'm not arguing that. All I'm saying is that fairness is not affected.
>
> > Similarly, I am not sure there is code to cancel an active mux hrtimer
> > when we clear rotate_necessary.  Maybe we just let it lapse and clear
> > itself via a ctx_sched_out() in the rotation code.
>
> Yes, we let it lapse and disable itself, I don't see the problem with
> that -- also remember that the timer services two contexts.
>
> > > > This patch addresses the cause of the problem, by checking that,
> > > > when an event is disabled or removed and the context was multiplexing
> > > > events, inactive events gets immediately a chance to be scheduled by
> > > > calling ctx_resched(). The rescheduling is done on  event of equal
> > > > or lower priority types.  With that in place, as soon as a counter
> > > > is freed, schedulable inactive events may run, thereby eliminating
> > > > a blind spot.
> > >
> > > Disagreed, Song's patch removed the fundamental blind spot of rotation
> > > completely failing.
>
> > Sure it removed the infinite blocking of schedulable events. My patch
> > addresses the issue of having free counters following a
> > deactivation/removal and not scheduling the idling events on them,
> > thereby creating a blind spot where no event is monitoring.
>
> If the counters were removed later (like a us before their slice
> expired) there would not have been much idle time.
>
> Either way, fairness does not mandate we schedule immediately.
>
> > > This just slightly optimizes counter usage -- at a cost of having to
> > > reprogram the counters more often.
> > >
> > Only on deactivation/removal AND multiplexing so it is not every time but
> > only where there is an opportunity to keep the counters busy.
>
> Sure, but it's still non-zero :-)
>
> > > Not saying we shouldn't do this, but this justification is just all
> > > sorts of wrong.
> >
> > I think the patches are not mutually exclusive.
>
> I never said they were. And I'm not opposed to your patch. What I
> objected to was the mischaracterization of it in the Changelog.
>
> Present it as an optimization and all should be well.

I will respin it as an optimization then.
Thanks.

Patch
diff mbox series

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9ec0b0bfddbd..578587246ffb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2140,6 +2140,10 @@  group_sched_out(struct perf_event *group_event,
 
 #define DETACH_GROUP	0x01UL
 
+static void ctx_resched(struct perf_cpu_context *cpuctx,
+			struct perf_event_context *task_ctx,
+			enum event_type_t event_type);
+
 /*
  * Cross CPU call to remove a performance event
  *
@@ -2153,6 +2157,7 @@  __perf_remove_from_context(struct perf_event *event,
 			   void *info)
 {
 	unsigned long flags = (unsigned long)info;
+	int was_necessary = ctx->rotate_necessary;
 
 	if (ctx->is_active & EVENT_TIME) {
 		update_context_time(ctx);
@@ -2171,6 +2176,37 @@  __perf_remove_from_context(struct perf_event *event,
 			cpuctx->task_ctx = NULL;
 		}
 	}
+
+	/*
+	 * sanity check that event_sched_out() does not and will not
+	 * change the state of ctx->rotate_necessary
+	 */
+	WARN_ON(was_necessary != event->ctx->rotate_necessary);
+	/*
+	 * if we remove an event AND we were multiplexing then, that means
+	 * we had more events than we have counters for, and thus, at least,
+	 * one event was in INACTIVE state. Now, that we removed an event,
+	 * we need to resched to give a chance to all events to get scheduled,
+	 * otherwise some may get stuck.
+	 *
+	 * By the time this function is called the event is usually in the OFF
+	 * state.
+	 * Note that this is not a duplicate of the same code in _perf_event_disable()
+	 * because the call path are different. Some events may be simply disabled
+	 * others removed. There is a way to get removed and not be disabled first.
+	 */
+	if (ctx->rotate_necessary && ctx->nr_events) {
+		int type = get_event_type(event);
+		/*
+		 * In case we removed a pinned event, then we need to
+		 * resched for both pinned and flexible events. The
+		 * opposite is not true. A pinned event can never be
+		 * inactive due to multiplexing.
+		 */
+		if (type & EVENT_PINNED)
+			type |= EVENT_FLEXIBLE;
+		ctx_resched(cpuctx, cpuctx->task_ctx, type);
+	}
 }
 
 /*
@@ -2218,6 +2254,8 @@  static void __perf_event_disable(struct perf_event *event,
 				 struct perf_event_context *ctx,
 				 void *info)
 {
+	int was_necessary = ctx->rotate_necessary;
+
 	if (event->state < PERF_EVENT_STATE_INACTIVE)
 		return;
 
@@ -2232,6 +2270,35 @@  static void __perf_event_disable(struct perf_event *event,
 		event_sched_out(event, cpuctx, ctx);
 
 	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
+	/*
+	 * sanity check that event_sched_out() does not and will not
+	 * change the state of ctx->rotate_necessary
+	 */
+	WARN_ON_ONCE(was_necessary != event->ctx->rotate_necessary);
+
+	/*
+	 * if we disable an event AND we were multiplexing then, that means
+	 * we had more events than we have counters for, and thus, at least,
+	 * one event was in INACTIVE state. Now, that we disabled an event,
+	 * we need to resched to give a chance to all events to be scheduled,
+	 * otherwise some may get stuck.
+	 *
+	 * Note that this is not a duplicate of the same code in
+	 * __perf_remove_from_context()
+	 * because events can be disabled without being removed.
+	 */
+	if (ctx->rotate_necessary && ctx->nr_events) {
+		int type = get_event_type(event);
+		/*
+		 * In case we removed a pinned event, then we need to
+		 * resched for both pinned and flexible events. The
+		 * opposite is not true. A pinned event can never be
+		 * inactive due to multiplexing.
+		 */
+		if (type & EVENT_PINNED)
+			type |= EVENT_FLEXIBLE;
+		ctx_resched(cpuctx, cpuctx->task_ctx, type);
+	}
 }
 
 /*