* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-04 13:03 ` Peter Zijlstra
@ 2019-04-04 13:21 ` Thomas-Mich Richter
2019-04-05 10:18 ` Thomas-Mich Richter
2019-04-08 7:12 ` Thomas-Mich Richter
2 siblings, 0 replies; 22+ messages in thread
From: Thomas-Mich Richter @ 2019-04-04 13:21 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky
On 4/4/19 3:03 PM, Peter Zijlstra wrote:
> On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
>
>> That is not entirely the scenario I talked about, but *groan*.
>>
>> So what I meant was:
>>
>> CPU-0 CPU-n
>>
>> __schedule()
>> local_irq_disable()
>>
>> ...
>> deactivate_task(prev);
>>
>> try_to_wake_up(@p)
>> ...
>> smp_cond_load_acquire(&p->on_cpu, !VAL);
>>
>> <PMI>
>> ..
>> perf_event_disable_inatomic()
>> event->pending_disable = 1;
>> irq_work_queue() /* self-IPI */
>> </PMI>
>>
>> context_switch()
>> prepare_task_switch()
>> perf_event_task_sched_out()
>> // the above chain that clears pending_disable
>>
>> finish_task_switch()
>> finish_task()
>> smp_store_release(prev->on_cpu, 0);
>> /* finally.... */
>> // take woken
>> // context_switch to @p
>> finish_lock_switch()
>> raw_spin_unlock_irq()
>> /* w00t, IRQs enabled, self-IPI time */
>> <self-IPI>
>> perf_pending_event()
>> // event->pending_disable == 0
>> </self-IPI>
>>
>>
>> What you're suggesting, is that the time between:
>>
>> smp_store_release(prev->on_cpu, 0);
>>
>> and
>>
>> <self-IPI>
>>
>> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
>> the event there, trigger a PMI that calls perf_event_disable_inatomic()
>> _again_ (this would mean irq_work_queue() failing, which we don't check)
>> (and schedule out again, although that's not required).
>>
>> This being virt that might actually be possible if (v)CPU-0 takes a nap
>> I suppose.
>>
>> Let me think about this a little more...
>
> Does the below cure things? It's not exactly pretty, but it could just
> do the trick.
>
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index dfc4bab0b02b..d496e6911442 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event,
> event->pmu->del(event, 0);
> event->oncpu = -1;
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (event->pending_disable == smp_processor_id()) {
> + event->pending_disable = -1;
> state = PERF_EVENT_STATE_OFF;
> }
> perf_event_set_state(event, state);
> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
>
> void perf_event_disable_inatomic(struct perf_event *event)
> {
> - event->pending_disable = 1;
> + event->pending_disable = smp_processor_id();
> irq_work_queue(&event->pending);
> }
>
> @@ -5822,8 +5822,8 @@ static void perf_pending_event(struct irq_work *entry)
> * and we won't recurse 'further'.
> */
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (event->pending_disable == smp_processor_id()) {
> + event->pending_disable = -1;
> perf_event_disable_local(event);
> }
>
> @@ -10236,6 +10236,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>
>
> init_waitqueue_head(&event->waitq);
> + event->pending_disable = -1;
> init_irq_work(&event->pending, perf_pending_event);
>
> mutex_init(&event->mmap_mutex);
>
Thanks for the quick reply, will give it a try...
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-04 13:03 ` Peter Zijlstra
2019-04-04 13:21 ` Thomas-Mich Richter
@ 2019-04-05 10:18 ` Thomas-Mich Richter
2019-04-05 11:46 ` Peter Zijlstra
2019-04-08 7:12 ` Thomas-Mich Richter
2 siblings, 1 reply; 22+ messages in thread
From: Thomas-Mich Richter @ 2019-04-05 10:18 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky
On 4/4/19 3:03 PM, Peter Zijlstra wrote:
> On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
>
>> That is not entirely the scenario I talked about, but *groan*.
>>
>> So what I meant was:
>>
>> CPU-0 CPU-n
>>
>> __schedule()
>> local_irq_disable()
>>
>> ...
>> deactivate_task(prev);
>>
>> try_to_wake_up(@p)
>> ...
>> smp_cond_load_acquire(&p->on_cpu, !VAL);
>>
>> <PMI>
>> ..
>> perf_event_disable_inatomic()
>> event->pending_disable = 1;
>> irq_work_queue() /* self-IPI */
>> </PMI>
>>
>> context_switch()
>> prepare_task_switch()
>> perf_event_task_sched_out()
>> // the above chain that clears pending_disable
>>
>> finish_task_switch()
>> finish_task()
>> smp_store_release(prev->on_cpu, 0);
>> /* finally.... */
>> // take woken
>> // context_switch to @p
>> finish_lock_switch()
>> raw_spin_unlock_irq()
>> /* w00t, IRQs enabled, self-IPI time */
>> <self-IPI>
>> perf_pending_event()
>> // event->pending_disable == 0
>> </self-IPI>
>>
>>
>> What you're suggesting, is that the time between:
>>
>> smp_store_release(prev->on_cpu, 0);
>>
>> and
>>
>> <self-IPI>
>>
>> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
>> the event there, trigger a PMI that calls perf_event_disable_inatomic()
>> _again_ (this would mean irq_work_queue() failing, which we don't check)
>> (and schedule out again, although that's not required).
>>
>> This being virt that might actually be possible if (v)CPU-0 takes a nap
>> I suppose.
>>
>> Let me think about this a little more...
>
> Does the below cure things? It's not exactly pretty, but it could just
> do the trick.
>
Thanks a lot for the patch, I have built a new kernel and let it run over the week end.
s390 does not have a PMI, all interrupts (including the measurement interrupts from
the PMU) are normal, maskable interrupts.
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-05 10:18 ` Thomas-Mich Richter
@ 2019-04-05 11:46 ` Peter Zijlstra
0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-04-05 11:46 UTC (permalink / raw)
To: Thomas-Mich Richter
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky
On Fri, Apr 05, 2019 at 12:18:54PM +0200, Thomas-Mich Richter wrote:
> On 4/4/19 3:03 PM, Peter Zijlstra wrote:
> > On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
> >
> >> That is not entirely the scenario I talked about, but *groan*.
> >>
> >> So what I meant was:
> >>
> >> CPU-0 CPU-n
> >>
> >> __schedule()
> >> local_irq_disable()
> >>
> >> ...
> >> deactivate_task(prev);
> >>
> >> try_to_wake_up(@p)
> >> ...
> >> smp_cond_load_acquire(&p->on_cpu, !VAL);
> >>
> >> <PMI>
> >> ..
> >> perf_event_disable_inatomic()
> >> event->pending_disable = 1;
> >> irq_work_queue() /* self-IPI */
> >> </PMI>
> >>
> >> context_switch()
> >> prepare_task_switch()
> >> perf_event_task_sched_out()
> >> // the above chain that clears pending_disable
> >>
> >> finish_task_switch()
> >> finish_task()
> >> smp_store_release(prev->on_cpu, 0);
> >> /* finally.... */
> >> // take woken
> >> // context_switch to @p
> >> finish_lock_switch()
> >> raw_spin_unlock_irq()
> >> /* w00t, IRQs enabled, self-IPI time */
> >> <self-IPI>
> >> perf_pending_event()
> >> // event->pending_disable == 0
> >> </self-IPI>
> >>
> >>
> >> What you're suggesting, is that the time between:
> >>
> >> smp_store_release(prev->on_cpu, 0);
> >>
> >> and
> >>
> >> <self-IPI>
> >>
> >> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
> >> the event there, trigger a PMI that calls perf_event_disable_inatomic()
> >> _again_ (this would mean irq_work_queue() failing, which we don't check)
> >> (and schedule out again, although that's not required).
> >>
> >> This being virt that might actually be possible if (v)CPU-0 takes a nap
> >> I suppose.
> >>
> >> Let me think about this a little more...
> >
> > Does the below cure things? It's not exactly pretty, but it could just
> > do the trick.
> >
>
> Thanks a lot for the patch, I have built a new kernel and let it run over the week end.
>
> s390 does not have a PMI, all interrupts (including the measurement interrupts from
> the PMU) are normal, maskable interrupts.
Please implement arch_irq_work_raise() for s390 though, traditionally
that was only required if you had NMI-like PMis, and IRQ based PMIs
would have to run irq_work_run() at the end of their handler.
The s390-sf handler does not do this, but even it if were to do that, it
wouldn't be sufficient, since you also drain the buffer from
pmu::stop().
Not doing this will get you some 'weird' behavoiur. As in, your signals
could be delivered very late etc.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-04 13:03 ` Peter Zijlstra
2019-04-04 13:21 ` Thomas-Mich Richter
2019-04-05 10:18 ` Thomas-Mich Richter
@ 2019-04-08 7:12 ` Thomas-Mich Richter
2019-04-08 8:22 ` Peter Zijlstra
2 siblings, 1 reply; 22+ messages in thread
From: Thomas-Mich Richter @ 2019-04-08 7:12 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky
On 4/4/19 3:03 PM, Peter Zijlstra wrote:
> On Thu, Apr 04, 2019 at 01:09:09PM +0200, Peter Zijlstra wrote:
>
>> That is not entirely the scenario I talked about, but *groan*.
>>
>> So what I meant was:
>>
>> CPU-0 CPU-n
>>
>> __schedule()
>> local_irq_disable()
>>
>> ...
>> deactivate_task(prev);
>>
>> try_to_wake_up(@p)
>> ...
>> smp_cond_load_acquire(&p->on_cpu, !VAL);
>>
>> <PMI>
>> ..
>> perf_event_disable_inatomic()
>> event->pending_disable = 1;
>> irq_work_queue() /* self-IPI */
>> </PMI>
>>
>> context_switch()
>> prepare_task_switch()
>> perf_event_task_sched_out()
>> // the above chain that clears pending_disable
>>
>> finish_task_switch()
>> finish_task()
>> smp_store_release(prev->on_cpu, 0);
>> /* finally.... */
>> // take woken
>> // context_switch to @p
>> finish_lock_switch()
>> raw_spin_unlock_irq()
>> /* w00t, IRQs enabled, self-IPI time */
>> <self-IPI>
>> perf_pending_event()
>> // event->pending_disable == 0
>> </self-IPI>
>>
>>
>> What you're suggesting, is that the time between:
>>
>> smp_store_release(prev->on_cpu, 0);
>>
>> and
>>
>> <self-IPI>
>>
>> on CPU-0 is sufficient for CPU-n to context switch to the task, enable
>> the event there, trigger a PMI that calls perf_event_disable_inatomic()
>> _again_ (this would mean irq_work_queue() failing, which we don't check)
>> (and schedule out again, although that's not required).
>>
>> This being virt that might actually be possible if (v)CPU-0 takes a nap
>> I suppose.
>>
>> Let me think about this a little more...
>
> Does the below cure things? It's not exactly pretty, but it could just
> do the trick.
>
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index dfc4bab0b02b..d496e6911442 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event,
> event->pmu->del(event, 0);
> event->oncpu = -1;
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (event->pending_disable == smp_processor_id()) {
> + event->pending_disable = -1;
> state = PERF_EVENT_STATE_OFF;
> }
> perf_event_set_state(event, state);
> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
>
> void perf_event_disable_inatomic(struct perf_event *event)
> {
> - event->pending_disable = 1;
> + event->pending_disable = smp_processor_id();
> irq_work_queue(&event->pending);
> }
>
> @@ -5822,8 +5822,8 @@ static void perf_pending_event(struct irq_work *entry)
> * and we won't recurse 'further'.
> */
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (event->pending_disable == smp_processor_id()) {
> + event->pending_disable = -1;
> perf_event_disable_local(event);
> }
>
> @@ -10236,6 +10236,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>
>
> init_waitqueue_head(&event->waitq);
> + event->pending_disable = -1;
> init_irq_work(&event->pending, perf_pending_event);
>
> mutex_init(&event->mmap_mutex);
>
Peter,
very good news, your fix ran over the weekend without any hit!!!
Thanks very much for your help. Do you submit this patch to the kernel mailing list?
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-08 7:12 ` Thomas-Mich Richter
@ 2019-04-08 8:22 ` Peter Zijlstra
2019-04-08 8:47 ` Thomas-Mich Richter
2019-04-08 9:50 ` Peter Zijlstra
0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-04-08 8:22 UTC (permalink / raw)
To: Thomas-Mich Richter
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky
On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
> > Does the below cure things? It's not exactly pretty, but it could just
> > do the trick.
> >
> > ---
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index dfc4bab0b02b..d496e6911442 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event,
> > event->pmu->del(event, 0);
> > event->oncpu = -1;
> >
> > - if (event->pending_disable) {
> > - event->pending_disable = 0;
> > + if (event->pending_disable == smp_processor_id()) {
> > + event->pending_disable = -1;
> > state = PERF_EVENT_STATE_OFF;
> > }
> > perf_event_set_state(event, state);
> > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
> >
> > void perf_event_disable_inatomic(struct perf_event *event)
> > {
> > - event->pending_disable = 1;
> > + event->pending_disable = smp_processor_id();
> > irq_work_queue(&event->pending);
> > }
> >
> > @@ -5822,8 +5822,8 @@ static void perf_pending_event(struct irq_work *entry)
> > * and we won't recurse 'further'.
> > */
> >
> > - if (event->pending_disable) {
> > - event->pending_disable = 0;
> > + if (event->pending_disable == smp_processor_id()) {
> > + event->pending_disable = -1;
> > perf_event_disable_local(event);
> > }
> >
> > @@ -10236,6 +10236,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> >
> >
> > init_waitqueue_head(&event->waitq);
> > + event->pending_disable = -1;
> > init_irq_work(&event->pending, perf_pending_event);
> >
> > mutex_init(&event->mmap_mutex);
> >
>
> Peter,
>
> very good news, your fix ran over the weekend without any hit!!!
>
> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
Most excellent, let me go write a Changelog.
Could I convince you to implement arch_irq_work_raise() for s390?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-08 8:22 ` Peter Zijlstra
@ 2019-04-08 8:47 ` Thomas-Mich Richter
2019-04-08 9:50 ` Peter Zijlstra
1 sibling, 0 replies; 22+ messages in thread
From: Thomas-Mich Richter @ 2019-04-08 8:47 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky
On 4/8/19 10:22 AM, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>>> Does the below cure things? It's not exactly pretty, but it could just
>>> do the trick.
>>>
>>> ---
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index dfc4bab0b02b..d496e6911442 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event,
>>> event->pmu->del(event, 0);
>>> event->oncpu = -1;
>>>
>>> - if (event->pending_disable) {
>>> - event->pending_disable = 0;
>>> + if (event->pending_disable == smp_processor_id()) {
>>> + event->pending_disable = -1;
>>> state = PERF_EVENT_STATE_OFF;
>>> }
>>> perf_event_set_state(event, state);
>>> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
>>>
>>> void perf_event_disable_inatomic(struct perf_event *event)
>>> {
>>> - event->pending_disable = 1;
>>> + event->pending_disable = smp_processor_id();
>>> irq_work_queue(&event->pending);
>>> }
>>>
>>> @@ -5822,8 +5822,8 @@ static void perf_pending_event(struct irq_work *entry)
>>> * and we won't recurse 'further'.
>>> */
>>>
>>> - if (event->pending_disable) {
>>> - event->pending_disable = 0;
>>> + if (event->pending_disable == smp_processor_id()) {
>>> + event->pending_disable = -1;
>>> perf_event_disable_local(event);
>>> }
>>>
>>> @@ -10236,6 +10236,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>>>
>>>
>>> init_waitqueue_head(&event->waitq);
>>> + event->pending_disable = -1;
>>> init_irq_work(&event->pending, perf_pending_event);
>>>
>>> mutex_init(&event->mmap_mutex);
>>>
>>
>> Peter,
>>
>> very good news, your fix ran over the weekend without any hit!!!
>>
>> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
>
> Most excellent, let me go write a Changelog.
>
> Could I convince you to implement arch_irq_work_raise() for s390?
>
Yes, I am convinced, however I need to discuss this with the s390 maintainers
Martin Schwidesfky and Heiko Carstens first.
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-08 8:22 ` Peter Zijlstra
2019-04-08 8:47 ` Thomas-Mich Richter
@ 2019-04-08 9:50 ` Peter Zijlstra
2019-04-08 13:28 ` Thomas-Mich Richter
` (2 more replies)
1 sibling, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-04-08 9:50 UTC (permalink / raw)
To: Thomas-Mich Richter
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky, mark.rutland, jolsa
On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
> > very good news, your fix ran over the weekend without any hit!!!
> >
> > Thanks very much for your help. Do you submit this patch to the kernel mailing list?
>
> Most excellent, let me go write a Changelog.
Hi Thomas, find below.
Sadly, while writing the Changelog I ended up with a 'completely'
differet patch again, could I bother you to test this one too?
---
Subject: perf: Fix perf_event_disable_inatomic()
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 4 Apr 2019 15:03:00 +0200
Thomas-Mich Richter reported he triggered a WARN from event_function_local()
on his s390. The problem boils down to:
CPU-A CPU-B
perf_event_overflow()
perf_event_disable_inatomic()
@pending_disable = 1
irq_work_queue();
sched-out
event_sched_out()
@pending_disable = 0
sched-in
perf_event_overflow()
perf_event_disable_inatomic()
@pending_disable = 1;
irq_work_queue(); // FAILS
irq_work_run()
perf_pending_event()
if (@pending_disable)
perf_event_disable_local(); // WHOOPS
The problem exists in generic, but s390 is particularly sensitive
because it doesn't implement arch_irq_work_raise(), nor does it call
irq_work_run() from it's PMU interrupt handler (nor would that be
sufficient in this case, because s390 also generates
perf_event_overflow() from pmu::stop). Add to that the fact that s390
is a virtual architecture and (virtual) CPU-A can stall long enough
for the above race to happen, even if it would self-IPI.
Adding an irq_work_syn() to event_sched_in() would work for all hardare
PMUs that properly use irq_work_run() but fails for software PMUs.
Instead encode the CPU number in @pending_disable, such that we can
tell which CPU requested the disable. This then allows us to detect
the above scenario and even redirect the IPI to make up for the failed
queue.
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Hendrik Brueckner <brueckner@linux.ibm.com>
Cc: acme@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Reported-by: Thomas-Mich Richter <tmricht@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/events/core.c | 52 ++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 43 insertions(+), 9 deletions(-)
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event
event->pmu->del(event, 0);
event->oncpu = -1;
- if (event->pending_disable) {
- event->pending_disable = 0;
+ if (READ_ONCE(event->pending_disable) >= 0) {
+ WRITE_ONCE(event->pending_disable, -1);
state = PERF_EVENT_STATE_OFF;
}
perf_event_set_state(event, state);
@@ -2198,7 +2198,8 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
void perf_event_disable_inatomic(struct perf_event *event)
{
- event->pending_disable = 1;
+ WRITE_ONCE(event->pending_disable, smp_processor_id());
+ /* can fail, see perf_pending_event_disable() */
irq_work_queue(&event->pending);
}
@@ -5810,10 +5811,45 @@ void perf_event_wakeup(struct perf_event
}
}
+static void perf_pending_event_disable(struct perf_event *event)
+{
+ int cpu = READ_ONCE(event->pending_disable);
+
+ if (cpu < 0)
+ return;
+
+ if (cpu == smp_processor_id()) {
+ WRITE_ONCE(event->pending_disable, -1);
+ perf_event_disable_local(event);
+ return;
+ }
+
+ /*
+ * CPU-A CPU-B
+ *
+ * perf_event_disable_inatomic()
+ * @pending_disable = CPU-A;
+ * irq_work_queue();
+ *
+ * sched-out
+ * @pending_disable = -1;
+ *
+ * sched-in
+ * perf_event_disable_inatomic()
+ * @pending_disable = CPU-B;
+ * irq_work_queue(); // FAILS
+ *
+ * irq_work_run()
+ * perf_pending_event()
+ *
+ * But the event runs on CPU-B and wants disabling there.
+ */
+ irq_work_queue_on(&event->pending, cpu);
+}
+
static void perf_pending_event(struct irq_work *entry)
{
- struct perf_event *event = container_of(entry,
- struct perf_event, pending);
+ struct perf_event *event = container_of(entry, struct perf_event, pending);
int rctx;
rctx = perf_swevent_get_recursion_context();
@@ -5822,10 +5858,7 @@ static void perf_pending_event(struct ir
* and we won't recurse 'further'.
*/
- if (event->pending_disable) {
- event->pending_disable = 0;
- perf_event_disable_local(event);
- }
+ perf_pending_event_disable(event);
if (event->pending_wakeup) {
event->pending_wakeup = 0;
@@ -10236,6 +10269,7 @@ perf_event_alloc(struct perf_event_attr
init_waitqueue_head(&event->waitq);
+ event->pending_disable = -1;
init_irq_work(&event->pending, perf_pending_event);
mutex_init(&event->mmap_mutex);
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-08 9:50 ` Peter Zijlstra
@ 2019-04-08 13:28 ` Thomas-Mich Richter
2019-04-09 6:07 ` Thomas-Mich Richter
2019-04-09 8:53 ` Mark Rutland
2 siblings, 0 replies; 22+ messages in thread
From: Thomas-Mich Richter @ 2019-04-08 13:28 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky, mark.rutland, jolsa
On 4/8/19 11:50 AM, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
>> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>
>>> very good news, your fix ran over the weekend without any hit!!!
>>>
>>> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
>>
>> Most excellent, let me go write a Changelog.
>
> Hi Thomas, find below.
>
> Sadly, while writing the Changelog I ended up with a 'completely'
> differet patch again, could I bother you to test this one too?
>
Hi Peter, no problem, test is now running overnight.
Let you know the outcome tomorrow...
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-08 9:50 ` Peter Zijlstra
2019-04-08 13:28 ` Thomas-Mich Richter
@ 2019-04-09 6:07 ` Thomas-Mich Richter
2019-04-09 8:29 ` Peter Zijlstra
2019-04-09 8:53 ` Mark Rutland
2 siblings, 1 reply; 22+ messages in thread
From: Thomas-Mich Richter @ 2019-04-09 6:07 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky, mark.rutland, jolsa
On 4/8/19 11:50 AM, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
>> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>
>>> very good news, your fix ran over the weekend without any hit!!!
>>>
>>> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
>>
>> Most excellent, let me go write a Changelog.
>
> Hi Thomas, find below.
>
> Sadly, while writing the Changelog I ended up with a 'completely'
> differet patch again, could I bother you to test this one too?
>
Hi Peter,
I verified your second patch, it ran overnight without any message in the
kernel log. Thanks again.
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-09 6:07 ` Thomas-Mich Richter
@ 2019-04-09 8:29 ` Peter Zijlstra
0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2019-04-09 8:29 UTC (permalink / raw)
To: Thomas-Mich Richter
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky, mark.rutland, jolsa
On Tue, Apr 09, 2019 at 08:07:49AM +0200, Thomas-Mich Richter wrote:
> On 4/8/19 11:50 AM, Peter Zijlstra wrote:
> > On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
> >> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
> >
> >>> very good news, your fix ran over the weekend without any hit!!!
> >>>
> >>> Thanks very much for your help. Do you submit this patch to the kernel mailing list?
> >>
> >> Most excellent, let me go write a Changelog.
> >
> > Hi Thomas, find below.
> >
> > Sadly, while writing the Changelog I ended up with a 'completely'
> > differet patch again, could I bother you to test this one too?
> >
>
> Hi Peter,
>
> I verified your second patch, it ran overnight without any message in the
> kernel log. Thanks again.
>
> Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Most excellent, thanks for that!
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-08 9:50 ` Peter Zijlstra
2019-04-08 13:28 ` Thomas-Mich Richter
2019-04-09 6:07 ` Thomas-Mich Richter
@ 2019-04-09 8:53 ` Mark Rutland
2019-04-10 13:51 ` Thomas-Mich Richter
2 siblings, 1 reply; 22+ messages in thread
From: Mark Rutland @ 2019-04-09 8:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas-Mich Richter, Kees Cook, acme, Linux Kernel Mailing List,
Heiko Carstens, Hendrik Brueckner, Martin Schwidefsky, jolsa
On Mon, Apr 08, 2019 at 11:50:31AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>
> > > very good news, your fix ran over the weekend without any hit!!!
> > >
> > > Thanks very much for your help. Do you submit this patch to the kernel mailing list?
> >
> > Most excellent, let me go write a Changelog.
>
> Hi Thomas, find below.
>
> Sadly, while writing the Changelog I ended up with a 'completely'
> differet patch again, could I bother you to test this one too?
>
> ---
> Subject: perf: Fix perf_event_disable_inatomic()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Thu, 4 Apr 2019 15:03:00 +0200
>
> Thomas-Mich Richter reported he triggered a WARN from event_function_local()
> on his s390. The problem boils down to:
>
> CPU-A CPU-B
>
> perf_event_overflow()
> perf_event_disable_inatomic()
> @pending_disable = 1
> irq_work_queue();
>
> sched-out
> event_sched_out()
> @pending_disable = 0
>
> sched-in
> perf_event_overflow()
> perf_event_disable_inatomic()
> @pending_disable = 1;
> irq_work_queue(); // FAILS
>
> irq_work_run()
> perf_pending_event()
> if (@pending_disable)
> perf_event_disable_local(); // WHOOPS
>
> The problem exists in generic, but s390 is particularly sensitive
> because it doesn't implement arch_irq_work_raise(), nor does it call
> irq_work_run() from it's PMU interrupt handler (nor would that be
> sufficient in this case, because s390 also generates
> perf_event_overflow() from pmu::stop). Add to that the fact that s390
> is a virtual architecture and (virtual) CPU-A can stall long enough
> for the above race to happen, even if it would self-IPI.
>
> Adding an irq_work_syn() to event_sched_in() would work for all hardare
> PMUs that properly use irq_work_run() but fails for software PMUs.
Typo: s/syn/sync/
>
> Instead encode the CPU number in @pending_disable, such that we can
> tell which CPU requested the disable. This then allows us to detect
> the above scenario and even redirect the IPI to make up for the failed
> queue.
>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Hendrik Brueckner <brueckner@linux.ibm.com>
> Cc: acme@redhat.com
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Reported-by: Thomas-Mich Richter <tmricht@linux.ibm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
I can't think of a nicer way of handling this, so FWIW:
Acked-by: Mark Rutland <mark.rutland@arm.com>
Mark.
> ---
> kernel/events/core.c | 52 ++++++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 43 insertions(+), 9 deletions(-)
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2009,8 +2009,8 @@ event_sched_out(struct perf_event *event
> event->pmu->del(event, 0);
> event->oncpu = -1;
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> + if (READ_ONCE(event->pending_disable) >= 0) {
> + WRITE_ONCE(event->pending_disable, -1);
> state = PERF_EVENT_STATE_OFF;
> }
> perf_event_set_state(event, state);
> @@ -2198,7 +2198,8 @@ EXPORT_SYMBOL_GPL(perf_event_disable);
>
> void perf_event_disable_inatomic(struct perf_event *event)
> {
> - event->pending_disable = 1;
> + WRITE_ONCE(event->pending_disable, smp_processor_id());
> + /* can fail, see perf_pending_event_disable() */
> irq_work_queue(&event->pending);
> }
>
> @@ -5810,10 +5811,45 @@ void perf_event_wakeup(struct perf_event
> }
> }
>
> +static void perf_pending_event_disable(struct perf_event *event)
> +{
> + int cpu = READ_ONCE(event->pending_disable);
> +
> + if (cpu < 0)
> + return;
> +
> + if (cpu == smp_processor_id()) {
> + WRITE_ONCE(event->pending_disable, -1);
> + perf_event_disable_local(event);
> + return;
> + }
> +
> + /*
> + * CPU-A CPU-B
> + *
> + * perf_event_disable_inatomic()
> + * @pending_disable = CPU-A;
> + * irq_work_queue();
> + *
> + * sched-out
> + * @pending_disable = -1;
> + *
> + * sched-in
> + * perf_event_disable_inatomic()
> + * @pending_disable = CPU-B;
> + * irq_work_queue(); // FAILS
> + *
> + * irq_work_run()
> + * perf_pending_event()
> + *
> + * But the event runs on CPU-B and wants disabling there.
> + */
> + irq_work_queue_on(&event->pending, cpu);
> +}
> +
> static void perf_pending_event(struct irq_work *entry)
> {
> - struct perf_event *event = container_of(entry,
> - struct perf_event, pending);
> + struct perf_event *event = container_of(entry, struct perf_event, pending);
> int rctx;
>
> rctx = perf_swevent_get_recursion_context();
> @@ -5822,10 +5858,7 @@ static void perf_pending_event(struct ir
> * and we won't recurse 'further'.
> */
>
> - if (event->pending_disable) {
> - event->pending_disable = 0;
> - perf_event_disable_local(event);
> - }
> + perf_pending_event_disable(event);
>
> if (event->pending_wakeup) {
> event->pending_wakeup = 0;
> @@ -10236,6 +10269,7 @@ perf_event_alloc(struct perf_event_attr
>
>
> init_waitqueue_head(&event->waitq);
> + event->pending_disable = -1;
> init_irq_work(&event->pending, perf_pending_event);
>
> mutex_init(&event->mmap_mutex);
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-09 8:53 ` Mark Rutland
@ 2019-04-10 13:51 ` Thomas-Mich Richter
2019-04-10 14:33 ` Peter Zijlstra
0 siblings, 1 reply; 22+ messages in thread
From: Thomas-Mich Richter @ 2019-04-10 13:51 UTC (permalink / raw)
To: Mark Rutland, Peter Zijlstra
Cc: Kees Cook, acme, Linux Kernel Mailing List, Heiko Carstens,
Hendrik Brueckner, Martin Schwidefsky, jolsa
On 4/9/19 10:53 AM, Mark Rutland wrote:
> On Mon, Apr 08, 2019 at 11:50:31AM +0200, Peter Zijlstra wrote:
>> On Mon, Apr 08, 2019 at 10:22:29AM +0200, Peter Zijlstra wrote:
>>> On Mon, Apr 08, 2019 at 09:12:28AM +0200, Thomas-Mich Richter wrote:
>>
.....
>>
>> Instead encode the CPU number in @pending_disable, such that we can
>> tell which CPU requested the disable. This then allows us to detect
>> the above scenario and even redirect the IPI to make up for the failed
>> queue.
>>
>> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Hendrik Brueckner <brueckner@linux.ibm.com>
>> Cc: acme@redhat.com
>> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Jiri Olsa <jolsa@redhat.com>
>> Reported-by: Thomas-Mich Richter <tmricht@linux.ibm.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>
> I can't think of a nicer way of handling this, so FWIW:
>
> Acked-by: Mark Rutland <mark.rutland@arm.com>
>
> Mark.
>
>> ---
Thanks for the fix with commit id 86071b11317550d994b55ce5e31aa06bcad783b5.
However doing an fgrep on the pending_disable member of struct perf_event
reveals two more hits in file kernel/events/ringbuffer.c when events
have auxiliary data.
Not sure if this needs to be addresses too, just wanted to let you know.
--
Thomas Richter, Dept 3252, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Matthias Hartmann
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-10 13:51 ` Thomas-Mich Richter
@ 2019-04-10 14:33 ` Peter Zijlstra
2019-04-11 12:06 ` Alexander Shishkin
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2019-04-10 14:33 UTC (permalink / raw)
To: Thomas-Mich Richter
Cc: Mark Rutland, Kees Cook, acme, Linux Kernel Mailing List,
Heiko Carstens, Hendrik Brueckner, Martin Schwidefsky, jolsa,
Alexander Shishkin
On Wed, Apr 10, 2019 at 03:51:24PM +0200, Thomas-Mich Richter wrote:
> Thanks for the fix with commit id 86071b11317550d994b55ce5e31aa06bcad783b5.
>
> However doing an fgrep on the pending_disable member of struct perf_event
> reveals two more hits in file kernel/events/ringbuffer.c when events
> have auxiliary data.
>
> Not sure if this needs to be addresses too, just wanted to let you know.
*groan* indeed, and yes that would need fixing too. I completely missed
the AUX stuff using that too.
I think we can simply do the below on top, Alexander?
---
kernel/events/ring_buffer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 678ccec60d8f..ab2b7b38adc5 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -392,7 +392,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
* store that will be enabled on successful return
*/
if (!handle->size) { /* A, matches D */
- event->pending_disable = 1;
+ event->pending_disable = smp_processor_id();
perf_output_wakeup(handle);
local_set(&rb->aux_nest, 0);
goto err_put;
@@ -480,7 +480,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size)
if (wakeup) {
if (handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)
- handle->event->pending_disable = 1;
+ handle->event->pending_disable = smp_processor_id();
perf_output_wakeup(handle);
}
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: WARN_ON_ONCE() hit at kernel/events/core.c:330
2019-04-10 14:33 ` Peter Zijlstra
@ 2019-04-11 12:06 ` Alexander Shishkin
0 siblings, 0 replies; 22+ messages in thread
From: Alexander Shishkin @ 2019-04-11 12:06 UTC (permalink / raw)
To: Peter Zijlstra, Thomas-Mich Richter
Cc: Mark Rutland, Kees Cook, acme, Linux Kernel Mailing List,
Heiko Carstens, Hendrik Brueckner, Martin Schwidefsky, jolsa,
alexander.shishkin
Peter Zijlstra <peterz@infradead.org> writes:
> On Wed, Apr 10, 2019 at 03:51:24PM +0200, Thomas-Mich Richter wrote:
>> Thanks for the fix with commit id 86071b11317550d994b55ce5e31aa06bcad783b5.
>>
>> However doing an fgrep on the pending_disable member of struct perf_event
>> reveals two more hits in file kernel/events/ringbuffer.c when events
>> have auxiliary data.
>>
>> Not sure if this needs to be addresses too, just wanted to let you know.
>
> *groan* indeed, and yes that would need fixing too. I completely missed
> the AUX stuff using that too.
>
> I think we can simply do the below on top, Alexander?
Looks good. FWIW, renaming pending_disable in the first place to match
the fact that it's a cpu# now, would have probably caught all these.
> ---
> kernel/events/ring_buffer.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 678ccec60d8f..ab2b7b38adc5 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -392,7 +392,7 @@ void *perf_aux_output_begin(struct perf_output_handle *handle,
> * store that will be enabled on successful return
> */
> if (!handle->size) { /* A, matches D */
> - event->pending_disable = 1;
> + event->pending_disable = smp_processor_id();
> perf_output_wakeup(handle);
> local_set(&rb->aux_nest, 0);
> goto err_put;
> @@ -480,7 +480,7 @@ void perf_aux_output_end(struct perf_output_handle *handle, unsigned long size)
>
> if (wakeup) {
> if (handle->aux_flags & PERF_AUX_FLAG_TRUNCATED)
> - handle->event->pending_disable = 1;
> + handle->event->pending_disable = smp_processor_id();
> perf_output_wakeup(handle);
> }
Thanks,
--
Alex
^ permalink raw reply [flat|nested] 22+ messages in thread