linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf/core: fix a possible deadlock scenario
@ 2018-07-16 21:51 Cong Wang
  2018-07-18  8:19 ` Jiri Olsa
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2018-07-16 21:51 UTC (permalink / raw)
  To: linux-kernel
  Cc: Cong Wang, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, stable

hrtimer_cancel() busy-waits for the hrtimer callback to stop,
pretty much like del_timer_sync(). This creates a possible deadlock
scenario where we hold a spinlock before calling hrtimer_cancel()
while in trying to acquire the same spinlock in the callback.

This kind of deadlock is already known and is catchable by lockdep,
like for del_timer_sync(), we can add lockdep annotations. However,
it is still missing for hrtimer_cancel(). (I have a WIP patch to make
it complete for hrtimer_cancel() but it breaks booting.)

And there is such a deadlock scenario in kernel/events/core.c too,
well actually, it is a simpler version: the hrtimer callback waits
for itself to finish on the same CPU! It sounds stupid but it is
not obvious at all, it hides very deeply in the perf event code:

cpu_clock_event_init():
  perf_swevent_init_hrtimer():
    hwc->hrtimer.function = perf_swevent_hrtimer;

perf_swevent_hrtimer():
  __perf_event_overflow():
    __perf_event_account_interrupt():
      perf_adjust_period():
        pmu->stop():
        cpu_clock_event_stop():
          perf_swevent_cancel():
            hrtimer_cancel()

Getting stuck in a timer doesn't sound very scary, however, in this
case, its consequences are a disaster:

perf_event_overflow() which calls __perf_event_overflow() is called
in NMI handler too, so it is racy with hrtimer callback as disabling
IRQ can't possibly disable NMI. This means this hrtimer callback
once interrupted by an NMI handler could deadlock within NMI!

As a further consequence, other IRQ handling is blocked too, notably
the IPI handler, especially when smp_call_function_*() waits for their
callbacks synchronously. This is why we saw so many soft lockup's in
smp_call_function_single() given how widely they are used in kernel.
Ironically, perf event code uses synchronous smp_call_function_single()
heavily too.

The fix is not easy. To minimize the impact, ideally we should just
avoid busy waiting when it is called within the hrtimer callback on
the same CPU, there is no reason to wait for itself to finish anyway.
Probably it doesn't even need to cancel itself either since it will
restart by pmu->start() later. There are two possible fixes here:

1. Modify hrtimer API to detect if a hrtimer callback is running
on the same CPU now. This does not look pretty though.

2. Passing some information from perf_swevent_hrtimer() down to
perf_swevent_cancel().

So I pick the latter approach, it is simple and straightforward.

Note, currently perf_swevent_hrtimer() still races with
perf_event_overflow() in NMI on the same CPU anyway, given there is no
lock around and probably locking does not even help. But it is nothing
new, and the race itself is not bad either, at most we have some
inconsistent updates on the event sample period.

Fixes: abd50713944c ("perf: Reimplement frequency driven sampling")
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/linux/perf_event.h |  3 +++
 kernel/events/core.c       | 43 +++++++++++++++++++++++++++----------------
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1fa12887ec02..aab39b8aa720 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -310,6 +310,9 @@ struct pmu {
 #define PERF_EF_START	0x01		/* start the counter when adding    */
 #define PERF_EF_RELOAD	0x02		/* reload the counter when starting */
 #define PERF_EF_UPDATE	0x04		/* update the counter when stopping */
+#define PERF_EF_NO_WAIT	0x08		/* do not wait when stopping, for
+					 * example, waiting for a timer
+					 */
 
 	/*
 	 * Adds/Removes a counter to/from the PMU, can be done inside a
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8f0434a9951a..f15832346b35 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3555,7 +3555,8 @@ do {					\
 static DEFINE_PER_CPU(int, perf_throttled_count);
 static DEFINE_PER_CPU(u64, perf_throttled_seq);
 
-static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bool disable)
+static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count,
+			       bool disable, bool nowait)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	s64 period, sample_period;
@@ -3574,8 +3575,13 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
 	hwc->sample_period = sample_period;
 
 	if (local64_read(&hwc->period_left) > 8*sample_period) {
-		if (disable)
-			event->pmu->stop(event, PERF_EF_UPDATE);
+		if (disable) {
+			int flags = PERF_EF_UPDATE;
+
+			if (nowait)
+				flags |= PERF_EF_NO_WAIT;
+			event->pmu->stop(event, flags);
+		}
 
 		local64_set(&hwc->period_left, 0);
 
@@ -3645,7 +3651,7 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
 		 * twice.
 		 */
 		if (delta > 0)
-			perf_adjust_period(event, period, delta, false);
+			perf_adjust_period(event, period, delta, false, false);
 
 		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
 	next:
@@ -7681,7 +7687,8 @@ static void perf_log_itrace_start(struct perf_event *event)
 }
 
 static int
-__perf_event_account_interrupt(struct perf_event *event, int throttle)
+__perf_event_account_interrupt(struct perf_event *event, int throttle,
+			       bool nowait)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	int ret = 0;
@@ -7710,7 +7717,8 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)
 		hwc->freq_time_stamp = now;
 
 		if (delta > 0 && delta < 2*TICK_NSEC)
-			perf_adjust_period(event, delta, hwc->last_period, true);
+			perf_adjust_period(event, delta, hwc->last_period, true,
+					   nowait);
 	}
 
 	return ret;
@@ -7718,7 +7726,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)
 
 int perf_event_account_interrupt(struct perf_event *event)
 {
-	return __perf_event_account_interrupt(event, 1);
+	return __perf_event_account_interrupt(event, 1, false);
 }
 
 /*
@@ -7727,7 +7735,7 @@ int perf_event_account_interrupt(struct perf_event *event)
 
 static int __perf_event_overflow(struct perf_event *event,
 				   int throttle, struct perf_sample_data *data,
-				   struct pt_regs *regs)
+				   struct pt_regs *regs, bool nowait)
 {
 	int events = atomic_read(&event->event_limit);
 	int ret = 0;
@@ -7739,7 +7747,7 @@ static int __perf_event_overflow(struct perf_event *event,
 	if (unlikely(!is_sampling_event(event)))
 		return 0;
 
-	ret = __perf_event_account_interrupt(event, throttle);
+	ret = __perf_event_account_interrupt(event, throttle, nowait);
 
 	/*
 	 * XXX event_limit might not quite work as expected on inherited
@@ -7768,7 +7776,7 @@ int perf_event_overflow(struct perf_event *event,
 			  struct perf_sample_data *data,
 			  struct pt_regs *regs)
 {
-	return __perf_event_overflow(event, 1, data, regs);
+	return __perf_event_overflow(event, 1, data, regs, true);
 }
 
 /*
@@ -7831,7 +7839,7 @@ static void perf_swevent_overflow(struct perf_event *event, u64 overflow,
 
 	for (; overflow; overflow--) {
 		if (__perf_event_overflow(event, throttle,
-					    data, regs)) {
+					    data, regs, false)) {
 			/*
 			 * We inhibit the overflow from happening when
 			 * hwc->interrupts == MAX_INTERRUPTS.
@@ -9110,7 +9118,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 
 	if (regs && !perf_exclude_event(event, regs)) {
 		if (!(event->attr.exclude_idle && is_idle_task(current)))
-			if (__perf_event_overflow(event, 1, &data, regs))
+			if (__perf_event_overflow(event, 1, &data, regs, true))
 				ret = HRTIMER_NORESTART;
 	}
 
@@ -9141,7 +9149,7 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
 		      HRTIMER_MODE_REL_PINNED);
 }
 
-static void perf_swevent_cancel_hrtimer(struct perf_event *event)
+static void perf_swevent_cancel_hrtimer(struct perf_event *event, bool sync)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
@@ -9149,7 +9157,10 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 		ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
 		local64_set(&hwc->period_left, ktime_to_ns(remaining));
 
-		hrtimer_cancel(&hwc->hrtimer);
+		if (sync)
+			hrtimer_cancel(&hwc->hrtimer);
+		else
+			hrtimer_try_to_cancel(&hwc->hrtimer);
 	}
 }
 
@@ -9200,7 +9211,7 @@ static void cpu_clock_event_start(struct perf_event *event, int flags)
 
 static void cpu_clock_event_stop(struct perf_event *event, int flags)
 {
-	perf_swevent_cancel_hrtimer(event);
+	perf_swevent_cancel_hrtimer(event, flags & PERF_EF_NO_WAIT);
 	cpu_clock_event_update(event);
 }
 
@@ -9277,7 +9288,7 @@ static void task_clock_event_start(struct perf_event *event, int flags)
 
 static void task_clock_event_stop(struct perf_event *event, int flags)
 {
-	perf_swevent_cancel_hrtimer(event);
+	perf_swevent_cancel_hrtimer(event, flags & PERF_EF_NO_WAIT);
 	task_clock_event_update(event, event->ctx->time);
 }
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf/core: fix a possible deadlock scenario
  2018-07-16 21:51 [PATCH] perf/core: fix a possible deadlock scenario Cong Wang
@ 2018-07-18  8:19 ` Jiri Olsa
  2018-07-18 20:21   ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Jiri Olsa @ 2018-07-18  8:19 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Namhyung Kim,
	stable

On Mon, Jul 16, 2018 at 02:51:01PM -0700, Cong Wang wrote:
> hrtimer_cancel() busy-waits for the hrtimer callback to stop,
> pretty much like del_timer_sync(). This creates a possible deadlock
> scenario where we hold a spinlock before calling hrtimer_cancel()
> while in trying to acquire the same spinlock in the callback.
> 
> This kind of deadlock is already known and is catchable by lockdep,
> like for del_timer_sync(), we can add lockdep annotations. However,
> it is still missing for hrtimer_cancel(). (I have a WIP patch to make
> it complete for hrtimer_cancel() but it breaks booting.)
> 
> And there is such a deadlock scenario in kernel/events/core.c too,
> well actually, it is a simpler version: the hrtimer callback waits
> for itself to finish on the same CPU! It sounds stupid but it is
> not obvious at all, it hides very deeply in the perf event code:
> 
> cpu_clock_event_init():
>   perf_swevent_init_hrtimer():
>     hwc->hrtimer.function = perf_swevent_hrtimer;
> 
> perf_swevent_hrtimer():
>   __perf_event_overflow():
>     __perf_event_account_interrupt():
>       perf_adjust_period():
>         pmu->stop():
>         cpu_clock_event_stop():
>           perf_swevent_cancel():
>             hrtimer_cancel()
> 
> Getting stuck in a timer doesn't sound very scary, however, in this

sound scary enough for me ;-) were you able to hit it?

> case, its consequences are a disaster:
> 
> perf_event_overflow() which calls __perf_event_overflow() is called
> in NMI handler too, so it is racy with hrtimer callback as disabling
> IRQ can't possibly disable NMI. This means this hrtimer callback
> once interrupted by an NMI handler could deadlock within NMI!

hum, the swevent pmu does not triger NMI, so that timer
will never be touched in NMI context

jirka

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf/core: fix a possible deadlock scenario
  2018-07-18  8:19 ` Jiri Olsa
@ 2018-07-18 20:21   ` Cong Wang
  2018-07-19 13:28     ` Jiri Olsa
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2018-07-18 20:21 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Namhyung Kim,
	stable

On Wed, Jul 18, 2018 at 1:19 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Mon, Jul 16, 2018 at 02:51:01PM -0700, Cong Wang wrote:
> > hrtimer_cancel() busy-waits for the hrtimer callback to stop,
> > pretty much like del_timer_sync(). This creates a possible deadlock
> > scenario where we hold a spinlock before calling hrtimer_cancel()
> > while in trying to acquire the same spinlock in the callback.
> >
> > This kind of deadlock is already known and is catchable by lockdep,
> > like for del_timer_sync(), we can add lockdep annotations. However,
> > it is still missing for hrtimer_cancel(). (I have a WIP patch to make
> > it complete for hrtimer_cancel() but it breaks booting.)
> >
> > And there is such a deadlock scenario in kernel/events/core.c too,
> > well actually, it is a simpler version: the hrtimer callback waits
> > for itself to finish on the same CPU! It sounds stupid but it is
> > not obvious at all, it hides very deeply in the perf event code:
> >
> > cpu_clock_event_init():
> >   perf_swevent_init_hrtimer():
> >     hwc->hrtimer.function = perf_swevent_hrtimer;
> >
> > perf_swevent_hrtimer():
> >   __perf_event_overflow():
> >     __perf_event_account_interrupt():
> >       perf_adjust_period():
> >         pmu->stop():
> >         cpu_clock_event_stop():
> >           perf_swevent_cancel():
> >             hrtimer_cancel()
> >
> > Getting stuck in a timer doesn't sound very scary, however, in this
>
> sound scary enough for me ;-) were you able to hit it?

Yeah, we have seen hundreds of soft lockup's in smp function call
which is probably caused by this bug. It dates back to 3.10 kernel,
seriously. :) Unfortunately we don't have a reproducer, so we can't
verify if this patch really fixes those soft lockup's until we deploy
the fix to probably thousands of machines in data center.


>
> > case, its consequences are a disaster:
> >
> > perf_event_overflow() which calls __perf_event_overflow() is called
> > in NMI handler too, so it is racy with hrtimer callback as disabling
> > IRQ can't possibly disable NMI. This means this hrtimer callback
> > once interrupted by an NMI handler could deadlock within NMI!
>
> hum, the swevent pmu does not triger NMI, so that timer
> will never be touched in NMI context

Good to know! So I worry too much here.

But still, given hrtimer interrupt is called with IRQ disabled, getting
stuck in a hrtimer callback could also block IPI handler, therefore
causing soft lockups.

Let me know if you want me to adjust the changelog for this.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] perf/core: fix a possible deadlock scenario
  2018-07-18 20:21   ` Cong Wang
@ 2018-07-19 13:28     ` Jiri Olsa
  2018-07-19 19:12       ` [PATCH v2] " Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Jiri Olsa @ 2018-07-19 13:28 UTC (permalink / raw)
  To: Cong Wang
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Namhyung Kim,
	stable

On Wed, Jul 18, 2018 at 01:21:11PM -0700, Cong Wang wrote:

SNIP

> > hum, the swevent pmu does not triger NMI, so that timer
> > will never be touched in NMI context
> 
> Good to know! So I worry too much here.
> 
> But still, given hrtimer interrupt is called with IRQ disabled, getting
> stuck in a hrtimer callback could also block IPI handler, therefore
> causing soft lockups.
> 
> Let me know if you want me to adjust the changelog for this.

yep, it's confusing, I think it should be removed from changelog

jirka

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] perf/core: fix a possible deadlock scenario
  2018-07-19 13:28     ` Jiri Olsa
@ 2018-07-19 19:12       ` Cong Wang
  2018-07-20 11:52         ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2018-07-19 19:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Cong Wang, Peter Zijlstra, Ingo Molnar, Linus Torvalds,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, stable

hrtimer_cancel() busy-waits for the hrtimer callback to stop,
pretty much like del_timer_sync(). This creates a possible deadlock
scenario where we hold a spinlock before calling hrtimer_cancel()
while in trying to acquire the same spinlock in the callback.

This kind of deadlock is already known and is catchable by lockdep,
like for del_timer_sync(), we can add lockdep annotations. However,
it is still missing for hrtimer_cancel(). (I have a WIP patch to make
it complete for hrtimer_cancel() but it breaks booting.)

And there is such a deadlock scenario in kernel/events/core.c too,
well actually, it is a simpler version: the hrtimer callback waits
for itself to finish on the same CPU! It sounds stupid but it is
not obvious at all, it hides very deeply in the perf event code:

cpu_clock_event_init():
  perf_swevent_init_hrtimer():
    hwc->hrtimer.function = perf_swevent_hrtimer;

perf_swevent_hrtimer():
  __perf_event_overflow():
    __perf_event_account_interrupt():
      perf_adjust_period():
        pmu->stop():
        cpu_clock_event_stop():
          perf_swevent_cancel():
            hrtimer_cancel()

Getting stuck in an hrtimer is a disaster:

Interrupts are disabled for hrtimer_interrupt(), this means other
interrupt handling is blocked too, notably the IPI handler, especially
when smp_call_function_*() waits for their callbacks synchronously.
This is probably why we saw hundreds of soft lockup's inside
smp_call_function_single(), given how widely they are used in kernel.
Ironically, perf event code uses synchronous smp_call_function_single()
heavily too.

The fix is not easy. To minimize the impact, ideally we should just
avoid busy waiting when it is called within the hrtimer callback on
the same CPU, there is no reason to wait for itself to finish anyway.
Probably it doesn't even need to cancel itself either since it will
restart by pmu->start() later. There are two possible fixes here:

1. Modify hrtimer API to detect if a hrtimer callback is running
on the same CPU now. It does not look pretty to adjust it only
for this special case though.

2. Passing some information from perf_swevent_hrtimer() down to
perf_swevent_cancel().

So I pick the latter approach, it is nicer and straightforward.

Note, Jiri pointed out that the swevent pmu does not triger NMI and
this timer will never be touched in NMI context, so we don't race
with NMI handler here.

Fixes: abd50713944c ("perf: Reimplement frequency driven sampling")
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/linux/perf_event.h |  3 +++
 kernel/events/core.c       | 43 +++++++++++++++++++++++++++----------------
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1fa12887ec02..aab39b8aa720 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -310,6 +310,9 @@ struct pmu {
 #define PERF_EF_START	0x01		/* start the counter when adding    */
 #define PERF_EF_RELOAD	0x02		/* reload the counter when starting */
 #define PERF_EF_UPDATE	0x04		/* update the counter when stopping */
+#define PERF_EF_NO_WAIT	0x08		/* do not wait when stopping, for
+					 * example, waiting for a timer
+					 */
 
 	/*
 	 * Adds/Removes a counter to/from the PMU, can be done inside a
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8f0434a9951a..f15832346b35 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3555,7 +3555,8 @@ do {					\
 static DEFINE_PER_CPU(int, perf_throttled_count);
 static DEFINE_PER_CPU(u64, perf_throttled_seq);
 
-static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bool disable)
+static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count,
+			       bool disable, bool nowait)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	s64 period, sample_period;
@@ -3574,8 +3575,13 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
 	hwc->sample_period = sample_period;
 
 	if (local64_read(&hwc->period_left) > 8*sample_period) {
-		if (disable)
-			event->pmu->stop(event, PERF_EF_UPDATE);
+		if (disable) {
+			int flags = PERF_EF_UPDATE;
+
+			if (nowait)
+				flags |= PERF_EF_NO_WAIT;
+			event->pmu->stop(event, flags);
+		}
 
 		local64_set(&hwc->period_left, 0);
 
@@ -3645,7 +3651,7 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
 		 * twice.
 		 */
 		if (delta > 0)
-			perf_adjust_period(event, period, delta, false);
+			perf_adjust_period(event, period, delta, false, false);
 
 		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
 	next:
@@ -7681,7 +7687,8 @@ static void perf_log_itrace_start(struct perf_event *event)
 }
 
 static int
-__perf_event_account_interrupt(struct perf_event *event, int throttle)
+__perf_event_account_interrupt(struct perf_event *event, int throttle,
+			       bool nowait)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	int ret = 0;
@@ -7710,7 +7717,8 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)
 		hwc->freq_time_stamp = now;
 
 		if (delta > 0 && delta < 2*TICK_NSEC)
-			perf_adjust_period(event, delta, hwc->last_period, true);
+			perf_adjust_period(event, delta, hwc->last_period, true,
+					   nowait);
 	}
 
 	return ret;
@@ -7718,7 +7726,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)
 
 int perf_event_account_interrupt(struct perf_event *event)
 {
-	return __perf_event_account_interrupt(event, 1);
+	return __perf_event_account_interrupt(event, 1, false);
 }
 
 /*
@@ -7727,7 +7735,7 @@ int perf_event_account_interrupt(struct perf_event *event)
 
 static int __perf_event_overflow(struct perf_event *event,
 				   int throttle, struct perf_sample_data *data,
-				   struct pt_regs *regs)
+				   struct pt_regs *regs, bool nowait)
 {
 	int events = atomic_read(&event->event_limit);
 	int ret = 0;
@@ -7739,7 +7747,7 @@ static int __perf_event_overflow(struct perf_event *event,
 	if (unlikely(!is_sampling_event(event)))
 		return 0;
 
-	ret = __perf_event_account_interrupt(event, throttle);
+	ret = __perf_event_account_interrupt(event, throttle, nowait);
 
 	/*
 	 * XXX event_limit might not quite work as expected on inherited
@@ -7768,7 +7776,7 @@ int perf_event_overflow(struct perf_event *event,
 			  struct perf_sample_data *data,
 			  struct pt_regs *regs)
 {
-	return __perf_event_overflow(event, 1, data, regs);
+	return __perf_event_overflow(event, 1, data, regs, true);
 }
 
 /*
@@ -7831,7 +7839,7 @@ static void perf_swevent_overflow(struct perf_event *event, u64 overflow,
 
 	for (; overflow; overflow--) {
 		if (__perf_event_overflow(event, throttle,
-					    data, regs)) {
+					    data, regs, false)) {
 			/*
 			 * We inhibit the overflow from happening when
 			 * hwc->interrupts == MAX_INTERRUPTS.
@@ -9110,7 +9118,7 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 
 	if (regs && !perf_exclude_event(event, regs)) {
 		if (!(event->attr.exclude_idle && is_idle_task(current)))
-			if (__perf_event_overflow(event, 1, &data, regs))
+			if (__perf_event_overflow(event, 1, &data, regs, true))
 				ret = HRTIMER_NORESTART;
 	}
 
@@ -9141,7 +9149,7 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
 		      HRTIMER_MODE_REL_PINNED);
 }
 
-static void perf_swevent_cancel_hrtimer(struct perf_event *event)
+static void perf_swevent_cancel_hrtimer(struct perf_event *event, bool sync)
 {
 	struct hw_perf_event *hwc = &event->hw;
 
@@ -9149,7 +9157,10 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 		ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
 		local64_set(&hwc->period_left, ktime_to_ns(remaining));
 
-		hrtimer_cancel(&hwc->hrtimer);
+		if (sync)
+			hrtimer_cancel(&hwc->hrtimer);
+		else
+			hrtimer_try_to_cancel(&hwc->hrtimer);
 	}
 }
 
@@ -9200,7 +9211,7 @@ static void cpu_clock_event_start(struct perf_event *event, int flags)
 
 static void cpu_clock_event_stop(struct perf_event *event, int flags)
 {
-	perf_swevent_cancel_hrtimer(event);
+	perf_swevent_cancel_hrtimer(event, flags & PERF_EF_NO_WAIT);
 	cpu_clock_event_update(event);
 }
 
@@ -9277,7 +9288,7 @@ static void task_clock_event_start(struct perf_event *event, int flags)
 
 static void task_clock_event_stop(struct perf_event *event, int flags)
 {
-	perf_swevent_cancel_hrtimer(event);
+	perf_swevent_cancel_hrtimer(event, flags & PERF_EF_NO_WAIT);
 	task_clock_event_update(event, event->ctx->time);
 }
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] perf/core: fix a possible deadlock scenario
  2018-07-19 19:12       ` [PATCH v2] " Cong Wang
@ 2018-07-20 11:52         ` Peter Zijlstra
  2018-07-23 23:16           ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2018-07-20 11:52 UTC (permalink / raw)
  To: Cong Wang
  Cc: linux-kernel, Ingo Molnar, Linus Torvalds,
	Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, stable

On Thu, Jul 19, 2018 at 12:12:53PM -0700, Cong Wang wrote:
> hrtimer_cancel() busy-waits for the hrtimer callback to stop,
> pretty much like del_timer_sync(). This creates a possible deadlock
> scenario where we hold a spinlock before calling hrtimer_cancel()
> while in trying to acquire the same spinlock in the callback.

Has this actually been observed?

> cpu_clock_event_init():
>   perf_swevent_init_hrtimer():
>     hwc->hrtimer.function = perf_swevent_hrtimer;
> 
> perf_swevent_hrtimer():
>   __perf_event_overflow():
>     __perf_event_account_interrupt():
>       perf_adjust_period():
>         pmu->stop():
>         cpu_clock_event_stop():
>           perf_swevent_cancel():
>             hrtimer_cancel()

Please explain how a hrtimer event ever gets to perf_adjust_period().
Last I checked perf_swevent_init_hrtimer() results in attr.freq=0.

> Getting stuck in an hrtimer is a disaster:

You'll get NMI watchdog splats. Getting stuck in NMI context is far more
'interesting :-)

> +#define PERF_EF_NO_WAIT	0x08		/* do not wait when stopping, for
> +					 * example, waiting for a timer
> +					 */

That's a broken comment style.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] perf/core: fix a possible deadlock scenario
  2018-07-20 11:52         ` Peter Zijlstra
@ 2018-07-23 23:16           ` Cong Wang
  2018-07-24  1:35             ` Cong Wang
  2018-07-24  9:12             ` Peter Zijlstra
  0 siblings, 2 replies; 12+ messages in thread
From: Cong Wang @ 2018-07-23 23:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Linus Torvalds, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, stable

On Fri, Jul 20, 2018 at 4:52 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jul 19, 2018 at 12:12:53PM -0700, Cong Wang wrote:
> > hrtimer_cancel() busy-waits for the hrtimer callback to stop,
> > pretty much like del_timer_sync(). This creates a possible deadlock
> > scenario where we hold a spinlock before calling hrtimer_cancel()
> > while in trying to acquire the same spinlock in the callback.
>
> Has this actually been observed?


Without lockdep annotation, it is not easy to observe.


>
> > cpu_clock_event_init():
> >   perf_swevent_init_hrtimer():
> >     hwc->hrtimer.function = perf_swevent_hrtimer;
> >
> > perf_swevent_hrtimer():
> >   __perf_event_overflow():
> >     __perf_event_account_interrupt():
> >       perf_adjust_period():
> >         pmu->stop():
> >         cpu_clock_event_stop():
> >           perf_swevent_cancel():
> >             hrtimer_cancel()
>
> Please explain how a hrtimer event ever gets to perf_adjust_period().
> Last I checked perf_swevent_init_hrtimer() results in attr.freq=0.

Good point.

I thought attr.freq is specified by user-space, but seems
perf_swevent_init_hrtimer() clears it purposely and will not change
after initialization, interesting...


>
> > Getting stuck in an hrtimer is a disaster:
>
> You'll get NMI watchdog splats. Getting stuck in NMI context is far more
> 'interesting :-)

Yes, I did see a stack trace in perf_swevent_hrtimer() which led
me here. But I have to admit among those hundreds of soft lockup's,
I only saw one showing swevent hrtimer backtrace.

Previously I thought this is because of NMI handler race, but Jiri
pointed out the race doesn't exist.


>
> > +#define PERF_EF_NO_WAIT      0x08            /* do not wait when stopping, for
> > +                                      * example, waiting for a timer
> > +                                      */
>
> That's a broken comment style.

It is picked by checkpatch.pl, not me, I chose a different one and got
a complain. :)

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] perf/core: fix a possible deadlock scenario
  2018-07-23 23:16           ` Cong Wang
@ 2018-07-24  1:35             ` Cong Wang
  2018-07-24  1:44               ` Cong Wang
  2018-07-24  9:12             ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Cong Wang @ 2018-07-24  1:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Linus Torvalds, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen

Hi, Peter, Andi

While reviewing the deadlock, I find out it looks like we could have the
following infinite recursion too:

perf_event_account_interrupt()
__perf_event_account_interrupt()
perf_adjust_period()
event->pmu->stop
x86_pmu_stop()
x86_pmu.disable()
intel_pmu_disable_event()
intel_pmu_pebs_disable()
intel_pmu_drain_pebs_buffer()
intel_pmu_drain_pebs_nhm()
<repeat....>

This time is pure hardware events, attr.freq must be non-zero.

And, we could enter this infinite recursion in NMI handler too:

intel_pmu_handle_irq()
perf_event_overflow()
__perf_event_overflow()
__perf_event_account_interrupt()
....

Or this is impossible too?

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] perf/core: fix a possible deadlock scenario
  2018-07-24  1:35             ` Cong Wang
@ 2018-07-24  1:44               ` Cong Wang
  2018-07-24  9:18                 ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Cong Wang @ 2018-07-24  1:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Linus Torvalds, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen

On Mon, Jul 23, 2018 at 6:35 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Hi, Peter, Andi
>
> While reviewing the deadlock, I find out it looks like we could have the
> following infinite recursion too:
>
> perf_event_account_interrupt()
> __perf_event_account_interrupt()
> perf_adjust_period()
> event->pmu->stop
> x86_pmu_stop()
> x86_pmu.disable()

Hmm, x86_pmu_stop() calls __test_and_clear_bit(), so
we should not call x86_pmu.disable() twice here.



> intel_pmu_disable_event()
> intel_pmu_pebs_disable()
> intel_pmu_drain_pebs_buffer()
> intel_pmu_drain_pebs_nhm()
> <repeat....>
>
> This time is pure hardware events, attr.freq must be non-zero.
>
> And, we could enter this infinite recursion in NMI handler too:
>
> intel_pmu_handle_irq()
> perf_event_overflow()
> __perf_event_overflow()
> __perf_event_account_interrupt()
> ....
>
> Or this is impossible too?
>
> Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] perf/core: fix a possible deadlock scenario
  2018-07-23 23:16           ` Cong Wang
  2018-07-24  1:35             ` Cong Wang
@ 2018-07-24  9:12             ` Peter Zijlstra
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2018-07-24  9:12 UTC (permalink / raw)
  To: Cong Wang
  Cc: LKML, Ingo Molnar, Linus Torvalds, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, stable

On Mon, Jul 23, 2018 at 04:16:23PM -0700, Cong Wang wrote:

> > > +#define PERF_EF_NO_WAIT      0x08            /* do not wait when stopping, for
> > > +                                      * example, waiting for a timer
> > > +                                      */
> >
> > That's a broken comment style.
> 
> It is picked by checkpatch.pl, not me, I chose a different one and got
> a complain. :)

Ignore checkpatch when it's wrong ;-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] perf/core: fix a possible deadlock scenario
  2018-07-24  1:44               ` Cong Wang
@ 2018-07-24  9:18                 ` Peter Zijlstra
  2018-07-25 18:44                   ` Cong Wang
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2018-07-24  9:18 UTC (permalink / raw)
  To: Cong Wang
  Cc: LKML, Ingo Molnar, Linus Torvalds, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen

On Mon, Jul 23, 2018 at 06:44:43PM -0700, Cong Wang wrote:
> On Mon, Jul 23, 2018 at 6:35 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Hi, Peter, Andi
> >
> > While reviewing the deadlock, I find out it looks like we could have the
> > following infinite recursion too:
> >
> > perf_event_account_interrupt()
> > __perf_event_account_interrupt()
> > perf_adjust_period()
> > event->pmu->stop
> > x86_pmu_stop()
> > x86_pmu.disable()
> 
> Hmm, x86_pmu_stop() calls __test_and_clear_bit(), so
> we should not call x86_pmu.disable() twice here.

Right, but since we set HES_UPTODATE after calling
x86_perf_event_update() that can in fact recurse.

Now, I don't think that'll be fatal, but it might be good to test that.

If you pick these patches:

  https://lkml.kernel.org/r/20170928121823.430053219@infradead.org

use force_early_printk (and actually configure a serial early_printk)
and put a printk() in x86_pmu_stop() and then run the perf_fuzzer or
something to try and reproduce.

But paranoia suggets moving that HES_UPTODATE thing one line up.

> > intel_pmu_disable_event()
> > intel_pmu_pebs_disable()
> > intel_pmu_drain_pebs_buffer()
> > intel_pmu_drain_pebs_nhm()
> > <repeat....>
> >
> > This time is pure hardware events, attr.freq must be non-zero.
> >
> > And, we could enter this infinite recursion in NMI handler too:
> >
> > intel_pmu_handle_irq()
> > perf_event_overflow()
> > __perf_event_overflow()
> > __perf_event_account_interrupt()
> > ....
> >
> > Or this is impossible too?

I'm not sure I see this second one.. can you be a little more specific?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] perf/core: fix a possible deadlock scenario
  2018-07-24  9:18                 ` Peter Zijlstra
@ 2018-07-25 18:44                   ` Cong Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Cong Wang @ 2018-07-25 18:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Linus Torvalds, Arnaldo Carvalho de Melo,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Andi Kleen

On Tue, Jul 24, 2018 at 2:18 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jul 23, 2018 at 06:44:43PM -0700, Cong Wang wrote:
> > On Mon, Jul 23, 2018 at 6:35 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > Hi, Peter, Andi
> > >
> > > While reviewing the deadlock, I find out it looks like we could have the
> > > following infinite recursion too:
> > >
> > > perf_event_account_interrupt()
> > > __perf_event_account_interrupt()
> > > perf_adjust_period()
> > > event->pmu->stop
> > > x86_pmu_stop()
> > > x86_pmu.disable()
> >
> > Hmm, x86_pmu_stop() calls __test_and_clear_bit(), so
> > we should not call x86_pmu.disable() twice here.
>
> Right, but since we set HES_UPTODATE after calling
> x86_perf_event_update() that can in fact recurse.

I don't see how HES_UPTODATE flag or x86_perf_event_update()
could affect the path on this call chain.


>
> Now, I don't think that'll be fatal, but it might be good to test that.
>
> If you pick these patches:
>
>   https://lkml.kernel.org/r/20170928121823.430053219@infradead.org
>
> use force_early_printk (and actually configure a serial early_printk)
> and put a printk() in x86_pmu_stop() and then run the perf_fuzzer or
> something to try and reproduce.

Is this patchset to make printk() working in NMI context?
But printk() is already used in NMI context, see perf_event_print_debug()
which is called in intel_pmu_handle_irq().


>
> But paranoia suggets moving that HES_UPTODATE thing one line up.
>
> > > intel_pmu_disable_event()
> > > intel_pmu_pebs_disable()
> > > intel_pmu_drain_pebs_buffer()
> > > intel_pmu_drain_pebs_nhm()
> > > <repeat....>
> > >
> > > This time is pure hardware events, attr.freq must be non-zero.
> > >
> > > And, we could enter this infinite recursion in NMI handler too:
> > >
> > > intel_pmu_handle_irq()
> > > perf_event_overflow()
> > > __perf_event_overflow()
> > > __perf_event_account_interrupt()
> > > ....
> > >
> > > Or this is impossible too?
>
> I'm not sure I see this second one.. can you be a little more specific?

In fact, it is this:

intel_pmu_handle_irq()
x86_pmu.drain_pebs()
intel_pmu_drain_pebs_nhm()
perf_event_account_interrupt()
__perf_event_account_interrupt()
perf_adjust_period()
event->pmu->stop()
x86_pmu_stop()
x86_pmu.disable()
intel_pmu_disable_event()
intel_pmu_pebs_disable()
intel_pmu_drain_pebs_buffer()
<repeat....>


Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-07-25 18:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-16 21:51 [PATCH] perf/core: fix a possible deadlock scenario Cong Wang
2018-07-18  8:19 ` Jiri Olsa
2018-07-18 20:21   ` Cong Wang
2018-07-19 13:28     ` Jiri Olsa
2018-07-19 19:12       ` [PATCH v2] " Cong Wang
2018-07-20 11:52         ` Peter Zijlstra
2018-07-23 23:16           ` Cong Wang
2018-07-24  1:35             ` Cong Wang
2018-07-24  1:44               ` Cong Wang
2018-07-24  9:18                 ` Peter Zijlstra
2018-07-25 18:44                   ` Cong Wang
2018-07-24  9:12             ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).