All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] hrtimer (related) fixes
@ 2015-04-15  9:41 Peter Zijlstra
  2015-04-15  9:41 ` [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer() Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-15  9:41 UTC (permalink / raw)
  To: tglx; +Cc: mingo, peterz, eranian, linux-kernel

Hi Thomas,

I spoke to you about these patches/bugs a fair few months ago but never got
around to properly fixing them. With your recent hrtimer work it got a lot
simpler, thanks!

So here are 3 patches, the first of which fixes a hrtimer bug that should be
much easier to expose after the next two patches.

The second one attempts to clean up some brainmelt in the CFS bandwidth timer;
the 'problem' there is that do not want to loose either the timer or a period.

 - one could loose the timer by observing it active and not (re)starting it while
   the callback had already decided to NORESTART.
 - one could loose a period by having the restart forward the expiry time such
   that the handler would not observe the overrun and properly account for it.

This is now all solved by not forwarding the timer from (re)start when active
and then unconditionally starting it. Yay for being able to start expired
timers -- note that it should still be very rare, since when the timer is
active it will most likely have just forwarded the expiry time from the
handler.

But note that unconditionally starting the timer would have easily triggered
the bug fixed by the first patch.

The third patch is an alternative to the perf patch you send out a few days
ago; it got stuck in this queue because it too suffers from these problems,
albeit at a much reduced risk because the code in question is 'never' triggered
(as demonstrated by this fail being there forever).


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()
  2015-04-15  9:41 [PATCH 0/3] hrtimer (related) fixes Peter Zijlstra
@ 2015-04-15  9:41 ` Peter Zijlstra
  2015-04-15 10:26   ` Thomas Gleixner
  2015-04-15  9:41 ` [PATCH 2/3] sched: Cleanup bandwidth timers Peter Zijlstra
  2015-04-15  9:41 ` [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage Peter Zijlstra
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-15  9:41 UTC (permalink / raw)
  To: tglx
  Cc: mingo, linux-kernel, Ben Segall, Roman Gushchin, Paul Turner,
	Peter Zijlstra

[-- Attachment #1: peterz-hrtimer-hrtimer_start-vs-function.patch --]
[-- Type: text/plain, Size: 2172 bytes --]

Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.

If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.

Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.

NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.

To that effect, add a WARN when someone tries to forward an already
enqueued timer.

Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks")
Cc: Ben Segall <bsegall@google.com>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/time/hrtimer.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -791,6 +791,9 @@ u64 hrtimer_forward(struct hrtimer *time
 	if (delta.tv64 < 0)
 		return 0;
 
+	if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+		return 0;
+
 	if (interval.tv64 < hrtimer_resolution)
 		interval.tv64 = hrtimer_resolution;
 
@@ -1131,11 +1134,14 @@ static void __run_hrtimer(struct hrtimer
 	 * Note: We clear the CALLBACK bit after enqueue_hrtimer and
 	 * we do not reprogramm the event hardware. Happens either in
 	 * hrtimer_start_range_ns() or in hrtimer_interrupt()
+	 *
+	 * Note: Because we dropped the cpu_base->lock above,
+	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
+	 * for us already.
 	 */
-	if (restart != HRTIMER_NORESTART) {
-		BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+	if (restart != HRTIMER_NORESTART &&
+	    !(timer->state & HRTIMER_STATE_ENQUEUED))
 		enqueue_hrtimer(timer, base);
-	}
 
 	WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));
 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/3] sched: Cleanup bandwidth timers
  2015-04-15  9:41 [PATCH 0/3] hrtimer (related) fixes Peter Zijlstra
  2015-04-15  9:41 ` [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer() Peter Zijlstra
@ 2015-04-15  9:41 ` Peter Zijlstra
  2015-04-16 20:03   ` bsegall
  2015-04-22 19:15   ` [tip:timers/core] " tip-bot for Peter Zijlstra
  2015-04-15  9:41 ` [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage Peter Zijlstra
  2 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-15  9:41 UTC (permalink / raw)
  To: tglx
  Cc: mingo, linux-kernel, Paul Turner, Ben Segall, Roman Gushchin,
	Peter Zijlstra

[-- Attachment #1: peterz-sched-fair-bw-mess.patch --]
[-- Type: text/plain, Size: 8227 bytes --]

Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().

The more I look at that code the more I'm convinced its crack, that
entire __start_cfs_bandwidth() thing is brain melting, we don't need to
cancel a timer before starting it, *hrtimer_start*() will happily remove
the timer for you if its still enqueued.

Removing that, removes a big part of the problem, no more ugly cancel
loop to get stuck in.

So now, if I understand things right, the entire reason you have this
cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
accidentally loose the timer.

It appears to me that it should be possible to guarantee that same by
unconditionally (re)starting the timer when !queued. Because regardless
what hrtimer::function will return, if we beat it to (re)enqueue the
timer, it doesn't matter.

Now, because hrtimers don't come with any serialization guarantees we
must ensure both handler and (re)start loop serialize their access to
the hrtimer to avoid both trying to forward the timer at the same
time.

Update the rt bandwidth timer to match.

This effectively reverts: 09dc4ab03936 ("sched/fair: Fix
tg_set_cfs_bandwidth() deadlock on rq->lock").

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Ben Segall <bsegall@google.com>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/core.c  |   15 ++++++------
 kernel/sched/fair.c  |   59 ++++++++++++---------------------------------------
 kernel/sched/rt.c    |   14 +++++-------
 kernel/sched/sched.h |    4 +--
 4 files changed, 31 insertions(+), 61 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -92,10 +92,13 @@
 
 void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	if (hrtimer_active(period_timer))
-		return;
+	/*
+	 * Do not forward the expiration time of active timers;
+	 * we do not want to loose an overrun.
+	 */
+	if (!hrtimer_active(period_timer))
+		hrtimer_forward_now(period_timer, period);
 
-	hrtimer_forward_now(period_timer, period);
 	hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
@@ -8098,10 +8101,8 @@ static int tg_set_cfs_bandwidth(struct t
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 	/* restart the period timer (if active) to handle new period expiry */
-	if (runtime_enabled && cfs_b->timer_active) {
-		/* force a reprogram */
-		__start_cfs_bandwidth(cfs_b, true);
-	}
+	if (runtime_enabled)
+		start_cfs_bandwidth(cfs_b);
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_online_cpu(i) {
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3476,16 +3476,7 @@ static int assign_cfs_rq_runtime(struct
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		/*
-		 * If the bandwidth pool has become inactive, then at least one
-		 * period must have elapsed since the last consumption.
-		 * Refresh the global state and ensure bandwidth timer becomes
-		 * active.
-		 */
-		if (!cfs_b->timer_active) {
-			__refill_cfs_bandwidth_runtime(cfs_b);
-			__start_cfs_bandwidth(cfs_b, false);
-		}
+		start_cfs_bandwidth(cfs_b);
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -3634,6 +3625,7 @@ static void throttle_cfs_rq(struct cfs_r
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long task_delta, dequeue = 1;
+	bool empty;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
@@ -3663,13 +3655,21 @@ static void throttle_cfs_rq(struct cfs_r
 	cfs_rq->throttled = 1;
 	cfs_rq->throttled_clock = rq_clock(rq);
 	raw_spin_lock(&cfs_b->lock);
+	empty = list_empty(&cfs_rq->throttled_list);
+
 	/*
 	 * Add to the _head_ of the list, so that an already-started
 	 * distribute_cfs_runtime will not see us
 	 */
 	list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
-	if (!cfs_b->timer_active)
-		__start_cfs_bandwidth(cfs_b, false);
+
+	/*
+	 * If we're the first throttled task, make sure the bandwidth
+	 * timer is running.
+	 */
+	if (empty)
+		start_cfs_bandwidth(cfs_b);
+
 	raw_spin_unlock(&cfs_b->lock);
 }
 
@@ -3784,13 +3784,6 @@ static int do_sched_cfs_period_timer(str
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	/*
-	 * if we have relooped after returning idle once, we need to update our
-	 * status as actually running, so that other cpus doing
-	 * __start_cfs_bandwidth will stop trying to cancel us.
-	 */
-	cfs_b->timer_active = 1;
-
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
 	if (!throttled) {
@@ -3835,7 +3828,6 @@ static int do_sched_cfs_period_timer(str
 	return 0;
 
 out_deactivate:
-	cfs_b->timer_active = 0;
 	return 1;
 }
 
@@ -3999,6 +3991,7 @@ static enum hrtimer_restart sched_cfs_sl
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, slack_timer);
+
 	do_sched_cfs_slack_timer(cfs_b);
 
 	return HRTIMER_NORESTART;
@@ -4008,15 +4001,12 @@ static enum hrtimer_restart sched_cfs_pe
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
-	ktime_t now;
 	int overrun;
 	int idle = 0;
 
 	raw_spin_lock(&cfs_b->lock);
 	for (;;) {
-		now = hrtimer_cb_get_time(timer);
-		overrun = hrtimer_forward(timer, now, cfs_b->period);
-
+		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
@@ -4047,27 +4037,8 @@ static void init_cfs_rq_runtime(struct c
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
-/* requires cfs_b->lock, may release to reprogram timer */
-void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
+void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
-	/*
-	 * The timer may be active because we're trying to set a new bandwidth
-	 * period or because we're racing with the tear-down path
-	 * (timer_active==0 becomes visible before the hrtimer call-back
-	 * terminates).  In either case we ensure that it's re-programmed
-	 */
-	while (unlikely(hrtimer_active(&cfs_b->period_timer)) &&
-	       hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) {
-		/* bounce the lock to allow do_sched_cfs_period_timer to run */
-		raw_spin_unlock(&cfs_b->lock);
-		cpu_relax();
-		raw_spin_lock(&cfs_b->lock);
-		/* if someone else restarted the timer then we're done */
-		if (!force && cfs_b->timer_active)
-			return;
-	}
-
-	cfs_b->timer_active = 1;
 	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
 }
 
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -18,19 +18,20 @@ static enum hrtimer_restart sched_rt_per
 {
 	struct rt_bandwidth *rt_b =
 		container_of(timer, struct rt_bandwidth, rt_period_timer);
-	ktime_t now;
-	int overrun;
 	int idle = 0;
+	int overrun;
 
+	raw_spin_lock(&rt_b->rt_runtime_lock);
 	for (;;) {
-		now = hrtimer_cb_get_time(timer);
-		overrun = hrtimer_forward(timer, now, rt_b->rt_period);
-
+		overrun = hrtimer_forward_now(timer, rt_b->rt_period);
 		if (!overrun)
 			break;
 
+		raw_spin_unlock(&rt_b->rt_runtime_lock);
 		idle = do_sched_rt_period_timer(rt_b, overrun);
+		raw_spin_lock(&rt_b->rt_runtime_lock);
 	}
+	raw_spin_unlock(&rt_b->rt_runtime_lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
@@ -52,9 +53,6 @@ static void start_rt_bandwidth(struct rt
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
-	if (hrtimer_active(&rt_b->rt_period_timer))
-		return;
-
 	raw_spin_lock(&rt_b->rt_runtime_lock);
 	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -215,7 +215,7 @@ struct cfs_bandwidth {
 	s64 hierarchical_quota;
 	u64 runtime_expires;
 
-	int idle, timer_active;
+	int idle;
 	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
@@ -306,7 +306,7 @@ extern void init_cfs_bandwidth(struct cf
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
+extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
 extern void free_rt_sched_group(struct task_group *tg);



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage
  2015-04-15  9:41 [PATCH 0/3] hrtimer (related) fixes Peter Zijlstra
  2015-04-15  9:41 ` [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer() Peter Zijlstra
  2015-04-15  9:41 ` [PATCH 2/3] sched: Cleanup bandwidth timers Peter Zijlstra
@ 2015-04-15  9:41 ` Peter Zijlstra
  2015-04-15 13:48   ` David Ahern
                     ` (2 more replies)
  2 siblings, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-15  9:41 UTC (permalink / raw)
  To: tglx; +Cc: mingo, linux-kernel, Stephane Eranian, Peter Zijlstra

[-- Attachment #1: peterz-perf-fix-mux-hrtimer.patch --]
[-- Type: text/plain, Size: 7087 bytes --]

Thomas stumbled over the hrtimer_forward_now() in
perf_event_mux_interval_ms_store() and noticed its broken-ness.

You cannot just change the expiry time of an active timer, it will
destroy the red-black tree order and cause havoc.

Change it to (re)start the timer instead, (re)starting a timer will
dequeue and enqueue a timer and therefore preserve rb-tree order.

Since we cannot enqueue remotely, wrap the thing in
cpu_function_call(), this however mandates that we restrict ourselves
to online cpus. Also serialize the entire setting so we don't get
multiple concurrent threads trying to update to different values.

Also fix a problem in perf_mux_hrtimer_restart(), checking against
hrtimer_active() can actually loose us the timer when timer->state ==
HRTIMER_STATE_CALLBACK and the callback has already decided NORESTART.

Furthermore it doesn't make any sense to test
hrtimer_callback_running() when we already tested hrtimer_active(),
but with the above change, we explicitly must call it when
callback_running.

Lastly, rename a few functions:

  s/perf_cpu_hrtimer_/perf_mux_hrtimer_/ -- because I could not find
                                            the mux timer function

  s/\<hr\>/timer/ -- because that's the normal way of calling things.

Fixes: 62b856397927 ("perf: Add sysfs entry to adjust multiplexing interval per PMU")
Cc: Stephane Eranian <eranian@google.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   62 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 35 insertions(+), 27 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -51,9 +51,11 @@
 
 static struct workqueue_struct *perf_wq;
 
+typedef int (*remote_function_f)(void *);
+
 struct remote_function_call {
 	struct task_struct	*p;
-	int			(*func)(void *info);
+	remote_function_f	func;
 	void			*info;
 	int			ret;
 };
@@ -86,7 +88,7 @@ static void remote_function(void *data)
  *	    -EAGAIN - when the process moved away
  */
 static int
-task_function_call(struct task_struct *p, int (*func) (void *info), void *info)
+task_function_call(struct task_struct *p, remote_function_f func, void *info)
 {
 	struct remote_function_call data = {
 		.p	= p,
@@ -110,7 +112,7 @@ task_function_call(struct task_struct *p
  *
  * returns: @func return value or -ENXIO when the cpu is offline
  */
-static int cpu_function_call(int cpu, int (*func) (void *info), void *info)
+static int cpu_function_call(int cpu, remote_function_f func, void *info)
 {
 	struct remote_function_call data = {
 		.p	= NULL,
@@ -747,7 +749,7 @@ perf_cgroup_mark_enabled(struct perf_eve
 /*
  * function must be called with interrupts disbled
  */
-static enum hrtimer_restart perf_cpu_hrtimer_handler(struct hrtimer *hr)
+static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
 {
 	struct perf_cpu_context *cpuctx;
 	enum hrtimer_restart ret = HRTIMER_NORESTART;
@@ -771,7 +773,7 @@ static enum hrtimer_restart perf_cpu_hrt
 }
 
 /* CPU is going down */
-void perf_cpu_hrtimer_cancel(int cpu)
+void perf_mux_hrtimer_cancel(int cpu)
 {
 	struct perf_cpu_context *cpuctx;
 	struct pmu *pmu;
@@ -798,11 +800,11 @@ void perf_cpu_hrtimer_cancel(int cpu)
 	local_irq_restore(flags);
 }
 
-static void __perf_cpu_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
+static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 {
-	struct hrtimer *hr = &cpuctx->hrtimer;
+	struct hrtimer *timer = &cpuctx->hrtimer;
 	struct pmu *pmu = cpuctx->ctx.pmu;
-	int timer;
+	u64 interval;
 
 	/* no multiplexing needed for SW PMU */
 	if (pmu->task_ctx_nr == perf_sw_context)
@@ -812,29 +814,29 @@ static void __perf_cpu_hrtimer_init(stru
 	 * check default is sane, if not set then force to
 	 * default interval (1/tick)
 	 */
-	timer = pmu->hrtimer_interval_ms;
-	if (timer < 1)
-		timer = pmu->hrtimer_interval_ms = PERF_CPU_HRTIMER;
+	interval = pmu->hrtimer_interval_ms;
+	if (interval < 1)
+		interval = pmu->hrtimer_interval_ms = PERF_CPU_HRTIMER;
 
-	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
+	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
-	hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
-	hr->function = perf_cpu_hrtimer_handler;
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+	timer->function = perf_mux_hrtimer_handler;
 }
 
-static void perf_cpu_hrtimer_restart(struct perf_cpu_context *cpuctx)
+static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
 {
-	struct hrtimer *hr = &cpuctx->hrtimer;
+	struct hrtimer *timer = &cpuctx->hrtimer;
 	struct pmu *pmu = cpuctx->ctx.pmu;
 
 	/* not for SW PMU */
 	if (pmu->task_ctx_nr == perf_sw_context)
-		return;
+		return 0;
 
-	if (hrtimer_active(hr))
-		return;
+	if (hrtimer_is_queued(timer))
+		return 0;
 
-	hrtimer_start(hr, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
+	hrtimer_start(timer, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
 }
 
 void perf_pmu_disable(struct pmu *pmu)
@@ -1913,7 +1915,7 @@ group_sched_in(struct perf_event *group_
 
 	if (event_sched_in(group_event, cpuctx, ctx)) {
 		pmu->cancel_txn(pmu);
-		perf_cpu_hrtimer_restart(cpuctx);
+		perf_mux_hrtimer_restart(cpuctx);
 		return -EAGAIN;
 	}
 
@@ -1960,7 +1962,7 @@ group_sched_in(struct perf_event *group_
 
 	pmu->cancel_txn(pmu);
 
-	perf_cpu_hrtimer_restart(cpuctx);
+	perf_mux_hrtimer_restart(cpuctx);
 
 	return -EAGAIN;
 }
@@ -2233,7 +2235,7 @@ static int __perf_event_enable(void *inf
 		 */
 		if (leader != event) {
 			group_sched_out(leader, cpuctx, ctx);
-			perf_cpu_hrtimer_restart(cpuctx);
+			perf_mux_hrtimer_restart(cpuctx);
 		}
 		if (leader->attr.pinned) {
 			update_group_times(leader);
@@ -7143,6 +7145,8 @@ perf_event_mux_interval_ms_show(struct d
 	return snprintf(page, PAGE_SIZE-1, "%d\n", pmu->hrtimer_interval_ms);
 }
 
+static DEFINE_MUTEX(mux_interval_mutex);
+
 static ssize_t
 perf_event_mux_interval_ms_store(struct device *dev,
 				 struct device_attribute *attr,
@@ -7162,17 +7166,21 @@ perf_event_mux_interval_ms_store(struct
 	if (timer == pmu->hrtimer_interval_ms)
 		return count;
 
+	mutex_lock(&mux_interval_mutex);
 	pmu->hrtimer_interval_ms = timer;
 
 	/* update all cpuctx for this PMU */
-	for_each_possible_cpu(cpu) {
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
 		struct perf_cpu_context *cpuctx;
 		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
 		cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
 
-		if (hrtimer_active(&cpuctx->hrtimer))
-			hrtimer_forward_now(&cpuctx->hrtimer, cpuctx->hrtimer_interval);
+		cpu_function_call(cpu,
+			(remote_function_f)perf_mux_hrtimer_restart, cpuctx);
 	}
+	put_online_cpus();
+	mutex_unlock(&mux_interval_mutex);
 
 	return count;
 }
@@ -7277,7 +7285,7 @@ int perf_pmu_register(struct pmu *pmu, c
 		lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
 		cpuctx->ctx.pmu = pmu;
 
-		__perf_cpu_hrtimer_init(cpuctx, cpu);
+		__perf_mux_hrtimer_init(cpuctx, cpu);
 
 		cpuctx->unique_pmu = pmu;
 	}



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()
  2015-04-15  9:41 ` [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer() Peter Zijlstra
@ 2015-04-15 10:26   ` Thomas Gleixner
  2015-04-15 11:31     ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2015-04-15 10:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, Ben Segall, Roman Gushchin, Paul Turner

On Wed, 15 Apr 2015, Peter Zijlstra wrote:

> hrtimer: Fix race between hrtimer_start() and __run_hrtimer()

I don't think that subject line is correct.

Back in the early hrtimer days we made deliberately the design
decision that this kind of usage is forbidden. The reason for this is
that the hrtimer infrastructure cannot provide proper
serialization. So we thought it would be a sane restruction that
restarting a timer from the callback should not be mixed with
concurrent restarts from a different call site.

So I rather prefer a subject line like this

hrtimer: Allow concurrent hrtimer_start() for self restarting timers

or such.

> Because we drop cpu_base->lock around calling hrtimer::function, it is
> possible for hrtimer_start() to come in between and enqueue the timer.
> 
> If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
> because HRTIMER_STATE_ENQUEUED will be set.
> 
> Since the above is a perfectly valid scenario, remove the BUG_ON and
> make the enqueue_hrtimer() call conditional on the timer not being
> enqueued already.
> 
> NOTE: in that concurrent scenario its entirely common for both sites
> to want to modify the hrtimer, since hrtimers don't provide
> serialization themselves be sure to provide some such that the
> hrtimer::function and the hrtimer_start() caller don't both try and
> fudge the expiration state at the same time.

Right.
 
> To that effect, add a WARN when someone tries to forward an already
> enqueued timer.

The warnon itself is nice, but what about sites which use
hrtimer_set_expires() and hrtimer_start_expires()?

Other than that I can see why you want that ...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()
  2015-04-15 10:26   ` Thomas Gleixner
@ 2015-04-15 11:31     ` Peter Zijlstra
  2015-04-15 11:35       ` Thomas Gleixner
  2015-04-22 19:15       ` [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-15 11:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, linux-kernel, Ben Segall, Roman Gushchin, Paul Turner

On Wed, Apr 15, 2015 at 12:26:58PM +0200, Thomas Gleixner wrote:
> On Wed, 15 Apr 2015, Peter Zijlstra wrote:
> > hrtimer: Fix race between hrtimer_start() and __run_hrtimer()
> 
> I don't think that subject line is correct.
> 
> Back in the early hrtimer days we made deliberately the design
> decision that this kind of usage is forbidden. The reason for this is
> that the hrtimer infrastructure cannot provide proper
> serialization. So we thought it would be a sane restruction that
> restarting a timer from the callback should not be mixed with
> concurrent restarts from a different call site.

Ah I was not aware. Until I changed the locking it was possible simply
because everything was serialized by the base lock. So the concurrent
start would either land before the callback or after it but not in the
middle like it can now.

> So I rather prefer a subject line like this
> 
> hrtimer: Allow concurrent hrtimer_start() for self restarting timers
> 

/me copy/paste, done! :-)

> > To that effect, add a WARN when someone tries to forward an already
> > enqueued timer.
> 
> The warnon itself is nice, but what about sites which use
> hrtimer_set_expires() and hrtimer_start_expires()?

They are all inlines, furthermore forward is the most common way to
change the expiry of periodic / self restarting timers so would gain us
most.

How about this then?

---
Subject: hrtimer: Allow concurrent hrtimer_start() for self restarting timers
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue May 20 15:49:48 CEST 2014

Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.

If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.

Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.

NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.

To that effect, add a WARN when someone tries to forward an already
enqueued timer, the most common way to change the expiry of self
restarting timers. Ideally we'd put the WARN in everything modifying
the expiry but most of that is inlines and we don't need the bloat.

Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks")
Cc: Ben Segall <bsegall@google.com>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/time/hrtimer.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -791,6 +791,9 @@ u64 hrtimer_forward(struct hrtimer *time
 	if (delta.tv64 < 0)
 		return 0;
 
+	if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+		return 0;
+
 	if (interval.tv64 < hrtimer_resolution)
 		interval.tv64 = hrtimer_resolution;
 
@@ -1131,11 +1134,14 @@ static void __run_hrtimer(struct hrtimer
 	 * Note: We clear the CALLBACK bit after enqueue_hrtimer and
 	 * we do not reprogramm the event hardware. Happens either in
 	 * hrtimer_start_range_ns() or in hrtimer_interrupt()
+	 *
+	 * Note: Because we dropped the cpu_base->lock above,
+	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
+	 * for us already.
 	 */
-	if (restart != HRTIMER_NORESTART) {
-		BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+	if (restart != HRTIMER_NORESTART &&
+	    !(timer->state & HRTIMER_STATE_ENQUEUED))
 		enqueue_hrtimer(timer, base);
-	}
 
 	WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()
  2015-04-15 11:31     ` Peter Zijlstra
@ 2015-04-15 11:35       ` Thomas Gleixner
  2015-04-15 11:43         ` Peter Zijlstra
  2015-04-22 19:15       ` [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2015-04-15 11:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, linux-kernel, Ben Segall, Roman Gushchin, Paul Turner

On Wed, 15 Apr 2015, Peter Zijlstra wrote:
> On Wed, Apr 15, 2015 at 12:26:58PM +0200, Thomas Gleixner wrote:
> > On Wed, 15 Apr 2015, Peter Zijlstra wrote:
> > > hrtimer: Fix race between hrtimer_start() and __run_hrtimer()
> > 
> > I don't think that subject line is correct.
> > 
> > Back in the early hrtimer days we made deliberately the design
> > decision that this kind of usage is forbidden. The reason for this is
> > that the hrtimer infrastructure cannot provide proper
> > serialization. So we thought it would be a sane restruction that
> > restarting a timer from the callback should not be mixed with
> > concurrent restarts from a different call site.
> 
> Ah I was not aware. Until I changed the locking it was possible simply
> because everything was serialized by the base lock. So the concurrent
> start would either land before the callback or after it but not in the
> middle like it can now.
> 
> > So I rather prefer a subject line like this
> > 
> > hrtimer: Allow concurrent hrtimer_start() for self restarting timers
> > 
> 
> /me copy/paste, done! :-)
> 
> > > To that effect, add a WARN when someone tries to forward an already
> > > enqueued timer.
> > 
> > The warnon itself is nice, but what about sites which use
> > hrtimer_set_expires() and hrtimer_start_expires()?
> 
> They are all inlines, furthermore forward is the most common way to
> change the expiry of periodic / self restarting timers so would gain us
> most.

Right. I just wanted to mention it.

> How about this then?

Looks good. Should I add those 3 patches to the other pile of hrtimer
stuff?
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer()
  2015-04-15 11:35       ` Thomas Gleixner
@ 2015-04-15 11:43         ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-15 11:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: mingo, linux-kernel, Ben Segall, Roman Gushchin, Paul Turner

On Wed, Apr 15, 2015 at 01:35:15PM +0200, Thomas Gleixner wrote:
> > How about this then?
> 
> Looks good. Should I add those 3 patches to the other pile of hrtimer
> stuff?

Lets wait a little while for pjt/ben to look over the cfs bandwidth
change, but basically yes please.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage
  2015-04-15  9:41 ` [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage Peter Zijlstra
@ 2015-04-15 13:48   ` David Ahern
  2015-04-15 14:20     ` Peter Zijlstra
  2015-04-22 15:12   ` Thomas Gleixner
  2015-04-22 19:15   ` [tip:timers/core] " tip-bot for Peter Zijlstra
  2 siblings, 1 reply; 22+ messages in thread
From: David Ahern @ 2015-04-15 13:48 UTC (permalink / raw)
  To: Peter Zijlstra, tglx; +Cc: mingo, linux-kernel, Stephane Eranian

Hi Peter:

On 4/15/15 3:41 AM, Peter Zijlstra wrote:
> Thomas stumbled over the hrtimer_forward_now() in
> perf_event_mux_interval_ms_store() and noticed its broken-ness.
>
> You cannot just change the expiry time of an active timer, it will
> destroy the red-black tree order and cause havoc.
>
> Change it to (re)start the timer instead, (re)starting a timer will
> dequeue and enqueue a timer and therefore preserve rb-tree order.

Should this be fixed in stable trees as well?

David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage
  2015-04-15 13:48   ` David Ahern
@ 2015-04-15 14:20     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-04-15 14:20 UTC (permalink / raw)
  To: David Ahern; +Cc: tglx, mingo, linux-kernel, Stephane Eranian

On Wed, Apr 15, 2015 at 07:48:02AM -0600, David Ahern wrote:
> Hi Peter:
> 
> On 4/15/15 3:41 AM, Peter Zijlstra wrote:
> >Thomas stumbled over the hrtimer_forward_now() in
> >perf_event_mux_interval_ms_store() and noticed its broken-ness.
> >
> >You cannot just change the expiry time of an active timer, it will
> >destroy the red-black tree order and cause havoc.
> >
> >Change it to (re)start the timer instead, (re)starting a timer will
> >dequeue and enqueue a timer and therefore preserve rb-tree order.
> 
> Should this be fixed in stable trees as well?

Hmm, I was going to write how this all relies on the hrtimer rework, but
I think we might just be fine because all this is strictly cpu local and
serialized through irq-disable.

But yes, its can't be worse than it is now.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/3] sched: Cleanup bandwidth timers
  2015-04-15  9:41 ` [PATCH 2/3] sched: Cleanup bandwidth timers Peter Zijlstra
@ 2015-04-16 20:03   ` bsegall
  2015-04-22 19:15   ` [tip:timers/core] " tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: bsegall @ 2015-04-16 20:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, mingo, linux-kernel, Paul Turner, Roman Gushchin

Peter Zijlstra <peterz@infradead.org> writes:

> Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().
>
> The more I look at that code the more I'm convinced its crack, that
> entire __start_cfs_bandwidth() thing is brain melting, we don't need to
> cancel a timer before starting it, *hrtimer_start*() will happily remove
> the timer for you if its still enqueued.
>
> Removing that, removes a big part of the problem, no more ugly cancel
> loop to get stuck in.
>
> So now, if I understand things right, the entire reason you have this
> cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
> accidentally loose the timer.
>
> It appears to me that it should be possible to guarantee that same by
> unconditionally (re)starting the timer when !queued. Because regardless
> what hrtimer::function will return, if we beat it to (re)enqueue the
> timer, it doesn't matter.
>
> Now, because hrtimers don't come with any serialization guarantees we
> must ensure both handler and (re)start loop serialize their access to
> the hrtimer to avoid both trying to forward the timer at the same
> time.
>
> Update the rt bandwidth timer to match.
>
> This effectively reverts: 09dc4ab03936 ("sched/fair: Fix
> tg_set_cfs_bandwidth() deadlock on rq->lock").
>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paul Turner <pjt@google.com>
> Cc: Ben Segall <bsegall@google.com>
> Reported-by: Roman Gushchin <klamm@yandex-team.ru>

Reviewed-by: Ben Segall <bsegall@google.com>

So much nicer. Docs/comment issue only: s/loose/lose/


> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/core.c  |   15 ++++++------
>  kernel/sched/fair.c  |   59 ++++++++++++---------------------------------------
>  kernel/sched/rt.c    |   14 +++++-------
>  kernel/sched/sched.h |    4 +--
>  4 files changed, 31 insertions(+), 61 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -92,10 +92,13 @@
>  
>  void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
>  {
> -	if (hrtimer_active(period_timer))
> -		return;
> +	/*
> +	 * Do not forward the expiration time of active timers;
> +	 * we do not want to loose an overrun.
> +	 */
> +	if (!hrtimer_active(period_timer))
> +		hrtimer_forward_now(period_timer, period);
>  
> -	hrtimer_forward_now(period_timer, period);
>  	hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
>  }
>  
> @@ -8098,10 +8101,8 @@ static int tg_set_cfs_bandwidth(struct t
>  
>  	__refill_cfs_bandwidth_runtime(cfs_b);
>  	/* restart the period timer (if active) to handle new period expiry */
> -	if (runtime_enabled && cfs_b->timer_active) {
> -		/* force a reprogram */
> -		__start_cfs_bandwidth(cfs_b, true);
> -	}
> +	if (runtime_enabled)
> +		start_cfs_bandwidth(cfs_b);
>  	raw_spin_unlock_irq(&cfs_b->lock);
>  
>  	for_each_online_cpu(i) {
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3476,16 +3476,7 @@ static int assign_cfs_rq_runtime(struct
>  	if (cfs_b->quota == RUNTIME_INF)
>  		amount = min_amount;
>  	else {
> -		/*
> -		 * If the bandwidth pool has become inactive, then at least one
> -		 * period must have elapsed since the last consumption.
> -		 * Refresh the global state and ensure bandwidth timer becomes
> -		 * active.
> -		 */
> -		if (!cfs_b->timer_active) {
> -			__refill_cfs_bandwidth_runtime(cfs_b);
> -			__start_cfs_bandwidth(cfs_b, false);
> -		}
> +		start_cfs_bandwidth(cfs_b);
>  
>  		if (cfs_b->runtime > 0) {
>  			amount = min(cfs_b->runtime, min_amount);
> @@ -3634,6 +3625,7 @@ static void throttle_cfs_rq(struct cfs_r
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>  	struct sched_entity *se;
>  	long task_delta, dequeue = 1;
> +	bool empty;
>  
>  	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>  
> @@ -3663,13 +3655,21 @@ static void throttle_cfs_rq(struct cfs_r
>  	cfs_rq->throttled = 1;
>  	cfs_rq->throttled_clock = rq_clock(rq);
>  	raw_spin_lock(&cfs_b->lock);
> +	empty = list_empty(&cfs_rq->throttled_list);
> +
>  	/*
>  	 * Add to the _head_ of the list, so that an already-started
>  	 * distribute_cfs_runtime will not see us
>  	 */
>  	list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
> -	if (!cfs_b->timer_active)
> -		__start_cfs_bandwidth(cfs_b, false);
> +
> +	/*
> +	 * If we're the first throttled task, make sure the bandwidth
> +	 * timer is running.
> +	 */
> +	if (empty)
> +		start_cfs_bandwidth(cfs_b);
> +
>  	raw_spin_unlock(&cfs_b->lock);
>  }
>  
> @@ -3784,13 +3784,6 @@ static int do_sched_cfs_period_timer(str
>  	if (cfs_b->idle && !throttled)
>  		goto out_deactivate;
>  
> -	/*
> -	 * if we have relooped after returning idle once, we need to update our
> -	 * status as actually running, so that other cpus doing
> -	 * __start_cfs_bandwidth will stop trying to cancel us.
> -	 */
> -	cfs_b->timer_active = 1;
> -
>  	__refill_cfs_bandwidth_runtime(cfs_b);
>  
>  	if (!throttled) {
> @@ -3835,7 +3828,6 @@ static int do_sched_cfs_period_timer(str
>  	return 0;
>  
>  out_deactivate:
> -	cfs_b->timer_active = 0;
>  	return 1;
>  }
>  
> @@ -3999,6 +3991,7 @@ static enum hrtimer_restart sched_cfs_sl
>  {
>  	struct cfs_bandwidth *cfs_b =
>  		container_of(timer, struct cfs_bandwidth, slack_timer);
> +
>  	do_sched_cfs_slack_timer(cfs_b);
>  
>  	return HRTIMER_NORESTART;
> @@ -4008,15 +4001,12 @@ static enum hrtimer_restart sched_cfs_pe
>  {
>  	struct cfs_bandwidth *cfs_b =
>  		container_of(timer, struct cfs_bandwidth, period_timer);
> -	ktime_t now;
>  	int overrun;
>  	int idle = 0;
>  
>  	raw_spin_lock(&cfs_b->lock);
>  	for (;;) {
> -		now = hrtimer_cb_get_time(timer);
> -		overrun = hrtimer_forward(timer, now, cfs_b->period);
> -
> +		overrun = hrtimer_forward_now(timer, cfs_b->period);
>  		if (!overrun)
>  			break;
>  
> @@ -4047,27 +4037,8 @@ static void init_cfs_rq_runtime(struct c
>  	INIT_LIST_HEAD(&cfs_rq->throttled_list);
>  }
>  
> -/* requires cfs_b->lock, may release to reprogram timer */
> -void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
> +void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  {
> -	/*
> -	 * The timer may be active because we're trying to set a new bandwidth
> -	 * period or because we're racing with the tear-down path
> -	 * (timer_active==0 becomes visible before the hrtimer call-back
> -	 * terminates).  In either case we ensure that it's re-programmed
> -	 */
> -	while (unlikely(hrtimer_active(&cfs_b->period_timer)) &&
> -	       hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) {
> -		/* bounce the lock to allow do_sched_cfs_period_timer to run */
> -		raw_spin_unlock(&cfs_b->lock);
> -		cpu_relax();
> -		raw_spin_lock(&cfs_b->lock);
> -		/* if someone else restarted the timer then we're done */
> -		if (!force && cfs_b->timer_active)
> -			return;
> -	}
> -
> -	cfs_b->timer_active = 1;
>  	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
>  }
>  
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -18,19 +18,20 @@ static enum hrtimer_restart sched_rt_per
>  {
>  	struct rt_bandwidth *rt_b =
>  		container_of(timer, struct rt_bandwidth, rt_period_timer);
> -	ktime_t now;
> -	int overrun;
>  	int idle = 0;
> +	int overrun;
>  
> +	raw_spin_lock(&rt_b->rt_runtime_lock);
>  	for (;;) {
> -		now = hrtimer_cb_get_time(timer);
> -		overrun = hrtimer_forward(timer, now, rt_b->rt_period);
> -
> +		overrun = hrtimer_forward_now(timer, rt_b->rt_period);
>  		if (!overrun)
>  			break;
>  
> +		raw_spin_unlock(&rt_b->rt_runtime_lock);
>  		idle = do_sched_rt_period_timer(rt_b, overrun);
> +		raw_spin_lock(&rt_b->rt_runtime_lock);
>  	}
> +	raw_spin_unlock(&rt_b->rt_runtime_lock);
>  
>  	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
>  }
> @@ -52,9 +53,6 @@ static void start_rt_bandwidth(struct rt
>  	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
>  		return;
>  
> -	if (hrtimer_active(&rt_b->rt_period_timer))
> -		return;
> -
>  	raw_spin_lock(&rt_b->rt_runtime_lock);
>  	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
>  	raw_spin_unlock(&rt_b->rt_runtime_lock);
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -215,7 +215,7 @@ struct cfs_bandwidth {
>  	s64 hierarchical_quota;
>  	u64 runtime_expires;
>  
> -	int idle, timer_active;
> +	int idle;
>  	struct hrtimer period_timer, slack_timer;
>  	struct list_head throttled_cfs_rq;
>  
> @@ -306,7 +306,7 @@ extern void init_cfs_bandwidth(struct cf
>  extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
>  
>  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
> -extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
> +extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
>  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
>  
>  extern void free_rt_sched_group(struct task_group *tg);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage
  2015-04-15  9:41 ` [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage Peter Zijlstra
  2015-04-15 13:48   ` David Ahern
@ 2015-04-22 15:12   ` Thomas Gleixner
  2015-04-22 19:15   ` [tip:timers/core] " tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2015-04-22 15:12 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, Stephane Eranian

On Wed, 15 Apr 2015, Peter Zijlstra wrote:
> -static void perf_cpu_hrtimer_restart(struct perf_cpu_context *cpuctx)
> +static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
>  {
> -	struct hrtimer *hr = &cpuctx->hrtimer;
> +	struct hrtimer *timer = &cpuctx->hrtimer;
>  	struct pmu *pmu = cpuctx->ctx.pmu;
>  
>  	/* not for SW PMU */
>  	if (pmu->task_ctx_nr == perf_sw_context)
> -		return;
> +		return 0;
>  
> -	if (hrtimer_active(hr))
> -		return;
> +	if (hrtimer_is_queued(timer))
> +		return 0;
>  
> -	hrtimer_start(hr, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
> +	hrtimer_start(timer, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);

  	return 0;

>  }

I fixed it up when applying it.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers
  2015-04-15 11:31     ` Peter Zijlstra
  2015-04-15 11:35       ` Thomas Gleixner
@ 2015-04-22 19:15       ` tip-bot for Peter Zijlstra
  2015-05-12 13:52         ` Sasha Levin
  1 sibling, 1 reply; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2015-04-22 19:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: pjt, tglx, linux-kernel, peterz, bsegall, hpa, mingo, klamm

Commit-ID:  5de2755c8c8b3a6b8414870e2c284914a2b42e4d
Gitweb:     http://git.kernel.org/tip/5de2755c8c8b3a6b8414870e2c284914a2b42e4d
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 20 May 2014 15:49:48 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 22 Apr 2015 17:06:52 +0200

hrtimer: Allow concurrent hrtimer_start() for self restarting timers

Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.

If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.

Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.

NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.

To that effect, add a WARN when someone tries to forward an already
enqueued timer, the most common way to change the expiry of self
restarting timers. Ideally we'd put the WARN in everything modifying
the expiry but most of that is inlines and we don't need the bloat.

Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20150415113105.GT5029@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/hrtimer.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 3bac942..4adf320 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -799,6 +799,9 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
 	if (delta.tv64 < 0)
 		return 0;
 
+	if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
+		return 0;
+
 	if (interval.tv64 < hrtimer_resolution)
 		interval.tv64 = hrtimer_resolution;
 
@@ -1139,11 +1142,14 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	 * Note: We clear the CALLBACK bit after enqueue_hrtimer and
 	 * we do not reprogramm the event hardware. Happens either in
 	 * hrtimer_start_range_ns() or in hrtimer_interrupt()
+	 *
+	 * Note: Because we dropped the cpu_base->lock above,
+	 * hrtimer_start_range_ns() can have popped in and enqueued the timer
+	 * for us already.
 	 */
-	if (restart != HRTIMER_NORESTART) {
-		BUG_ON(timer->state != HRTIMER_STATE_CALLBACK);
+	if (restart != HRTIMER_NORESTART &&
+	    !(timer->state & HRTIMER_STATE_ENQUEUED))
 		enqueue_hrtimer(timer, base);
-	}
 
 	WARN_ON_ONCE(!(timer->state & HRTIMER_STATE_CALLBACK));
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:timers/core] sched: Cleanup bandwidth timers
  2015-04-15  9:41 ` [PATCH 2/3] sched: Cleanup bandwidth timers Peter Zijlstra
  2015-04-16 20:03   ` bsegall
@ 2015-04-22 19:15   ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2015-04-22 19:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, bsegall, mingo, klamm, pjt, tglx, linux-kernel, hpa

Commit-ID:  77a4d1a1b9a122ca1fa3507bd30aec1520d7a8a4
Gitweb:     http://git.kernel.org/tip/77a4d1a1b9a122ca1fa3507bd30aec1520d7a8a4
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 15 Apr 2015 11:41:57 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 22 Apr 2015 17:06:53 +0200

sched: Cleanup bandwidth timers

Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().

The more I look at that code the more I'm convinced its crack, that
entire __start_cfs_bandwidth() thing is brain melting, we don't need to
cancel a timer before starting it, *hrtimer_start*() will happily remove
the timer for you if its still enqueued.

Removing that, removes a big part of the problem, no more ugly cancel
loop to get stuck in.

So now, if I understand things right, the entire reason you have this
cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
accidentally lose the timer.

It appears to me that it should be possible to guarantee that same by
unconditionally (re)starting the timer when !queued. Because regardless
what hrtimer::function will return, if we beat it to (re)enqueue the
timer, it doesn't matter.

Now, because hrtimers don't come with any serialization guarantees we
must ensure both handler and (re)start loop serialize their access to
the hrtimer to avoid both trying to forward the timer at the same
time.

Update the rt bandwidth timer to match.

This effectively reverts: 09dc4ab03936 ("sched/fair: Fix
tg_set_cfs_bandwidth() deadlock on rq->lock").

Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c  | 15 ++++++-------
 kernel/sched/fair.c  | 59 +++++++++++++---------------------------------------
 kernel/sched/rt.c    | 14 ++++++-------
 kernel/sched/sched.h |  4 ++--
 4 files changed, 31 insertions(+), 61 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3026678..d8a6196 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -92,10 +92,13 @@
 
 void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	if (hrtimer_active(period_timer))
-		return;
+	/*
+	 * Do not forward the expiration time of active timers;
+	 * we do not want to loose an overrun.
+	 */
+	if (!hrtimer_active(period_timer))
+		hrtimer_forward_now(period_timer, period);
 
-	hrtimer_forward_now(period_timer, period);
 	hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
 }
 
@@ -8113,10 +8116,8 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
 
 	__refill_cfs_bandwidth_runtime(cfs_b);
 	/* restart the period timer (if active) to handle new period expiry */
-	if (runtime_enabled && cfs_b->timer_active) {
-		/* force a reprogram */
-		__start_cfs_bandwidth(cfs_b, true);
-	}
+	if (runtime_enabled)
+		start_cfs_bandwidth(cfs_b);
 	raw_spin_unlock_irq(&cfs_b->lock);
 
 	for_each_online_cpu(i) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 854881b..e3b32eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3476,16 +3476,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	if (cfs_b->quota == RUNTIME_INF)
 		amount = min_amount;
 	else {
-		/*
-		 * If the bandwidth pool has become inactive, then at least one
-		 * period must have elapsed since the last consumption.
-		 * Refresh the global state and ensure bandwidth timer becomes
-		 * active.
-		 */
-		if (!cfs_b->timer_active) {
-			__refill_cfs_bandwidth_runtime(cfs_b);
-			__start_cfs_bandwidth(cfs_b, false);
-		}
+		start_cfs_bandwidth(cfs_b);
 
 		if (cfs_b->runtime > 0) {
 			amount = min(cfs_b->runtime, min_amount);
@@ -3634,6 +3625,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 	long task_delta, dequeue = 1;
+	bool empty;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
@@ -3663,13 +3655,21 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->throttled = 1;
 	cfs_rq->throttled_clock = rq_clock(rq);
 	raw_spin_lock(&cfs_b->lock);
+	empty = list_empty(&cfs_rq->throttled_list);
+
 	/*
 	 * Add to the _head_ of the list, so that an already-started
 	 * distribute_cfs_runtime will not see us
 	 */
 	list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
-	if (!cfs_b->timer_active)
-		__start_cfs_bandwidth(cfs_b, false);
+
+	/*
+	 * If we're the first throttled task, make sure the bandwidth
+	 * timer is running.
+	 */
+	if (empty)
+		start_cfs_bandwidth(cfs_b);
+
 	raw_spin_unlock(&cfs_b->lock);
 }
 
@@ -3784,13 +3784,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	if (cfs_b->idle && !throttled)
 		goto out_deactivate;
 
-	/*
-	 * if we have relooped after returning idle once, we need to update our
-	 * status as actually running, so that other cpus doing
-	 * __start_cfs_bandwidth will stop trying to cancel us.
-	 */
-	cfs_b->timer_active = 1;
-
 	__refill_cfs_bandwidth_runtime(cfs_b);
 
 	if (!throttled) {
@@ -3835,7 +3828,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 	return 0;
 
 out_deactivate:
-	cfs_b->timer_active = 0;
 	return 1;
 }
 
@@ -3999,6 +3991,7 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, slack_timer);
+
 	do_sched_cfs_slack_timer(cfs_b);
 
 	return HRTIMER_NORESTART;
@@ -4008,15 +4001,12 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
-	ktime_t now;
 	int overrun;
 	int idle = 0;
 
 	raw_spin_lock(&cfs_b->lock);
 	for (;;) {
-		now = hrtimer_cb_get_time(timer);
-		overrun = hrtimer_forward(timer, now, cfs_b->period);
-
+		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
@@ -4047,27 +4037,8 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 }
 
-/* requires cfs_b->lock, may release to reprogram timer */
-void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
+void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
-	/*
-	 * The timer may be active because we're trying to set a new bandwidth
-	 * period or because we're racing with the tear-down path
-	 * (timer_active==0 becomes visible before the hrtimer call-back
-	 * terminates).  In either case we ensure that it's re-programmed
-	 */
-	while (unlikely(hrtimer_active(&cfs_b->period_timer)) &&
-	       hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) {
-		/* bounce the lock to allow do_sched_cfs_period_timer to run */
-		raw_spin_unlock(&cfs_b->lock);
-		cpu_relax();
-		raw_spin_lock(&cfs_b->lock);
-		/* if someone else restarted the timer then we're done */
-		if (!force && cfs_b->timer_active)
-			return;
-	}
-
-	cfs_b->timer_active = 1;
 	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 575da76..b0febf2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -18,19 +18,20 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
 {
 	struct rt_bandwidth *rt_b =
 		container_of(timer, struct rt_bandwidth, rt_period_timer);
-	ktime_t now;
-	int overrun;
 	int idle = 0;
+	int overrun;
 
+	raw_spin_lock(&rt_b->rt_runtime_lock);
 	for (;;) {
-		now = hrtimer_cb_get_time(timer);
-		overrun = hrtimer_forward(timer, now, rt_b->rt_period);
-
+		overrun = hrtimer_forward_now(timer, rt_b->rt_period);
 		if (!overrun)
 			break;
 
+		raw_spin_unlock(&rt_b->rt_runtime_lock);
 		idle = do_sched_rt_period_timer(rt_b, overrun);
+		raw_spin_lock(&rt_b->rt_runtime_lock);
 	}
+	raw_spin_unlock(&rt_b->rt_runtime_lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
 }
@@ -52,9 +53,6 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
-	if (hrtimer_active(&rt_b->rt_period_timer))
-		return;
-
 	raw_spin_lock(&rt_b->rt_runtime_lock);
 	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e0e1299..08606a1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -215,7 +215,7 @@ struct cfs_bandwidth {
 	s64 hierarchical_quota;
 	u64 runtime_expires;
 
-	int idle, timer_active;
+	int idle;
 	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
@@ -306,7 +306,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
-extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
+extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 
 extern void free_rt_sched_group(struct task_group *tg);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:timers/core] perf: Fix mux_interval hrtimer wreckage
  2015-04-15  9:41 ` [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage Peter Zijlstra
  2015-04-15 13:48   ` David Ahern
  2015-04-22 15:12   ` Thomas Gleixner
@ 2015-04-22 19:15   ` tip-bot for Peter Zijlstra
  2 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2015-04-22 19:15 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, tglx, linux-kernel, peterz, mingo

Commit-ID:  272325c4821f052092c41feac21f4a1a46f0ad48
Gitweb:     http://git.kernel.org/tip/272325c4821f052092c41feac21f4a1a46f0ad48
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 15 Apr 2015 11:41:58 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 22 Apr 2015 17:12:22 +0200

perf: Fix mux_interval hrtimer wreckage

Thomas stumbled over the hrtimer_forward_now() in
perf_event_mux_interval_ms_store() and noticed its broken-ness.

You cannot just change the expiry time of an active timer, it will
destroy the red-black tree order and cause havoc.

Change it to (re)start the timer instead, (re)starting a timer will
dequeue and enqueue a timer and therefore preserve rb-tree order.

Since we cannot enqueue remotely, wrap the thing in
cpu_function_call(), this however mandates that we restrict ourselves
to online cpus. Also serialize the entire setting so we don't get
multiple concurrent threads trying to update to different values.

Also fix a problem in perf_mux_hrtimer_restart(), checking against
hrtimer_active() can actually loose us the timer when timer->state ==
HRTIMER_STATE_CALLBACK and the callback has already decided NORESTART.

Furthermore it doesn't make any sense to test
hrtimer_callback_running() when we already tested hrtimer_active(),
but with the above change, we explicitly must call it when
callback_running.

Lastly, rename a few functions:

  s/perf_cpu_hrtimer_/perf_mux_hrtimer_/ -- because I could not find
                                            the mux timer function

  s/\<hr\>/timer/ -- because that's the normal way of calling things.

Fixes: 62b856397927 ("perf: Add sysfs entry to adjust multiplexing interval per PMU")
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150415095011.863052571@infradead.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/events/core.c | 63 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 05309fd..e7ed00b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -51,9 +51,11 @@
 
 static struct workqueue_struct *perf_wq;
 
+typedef int (*remote_function_f)(void *);
+
 struct remote_function_call {
 	struct task_struct	*p;
-	int			(*func)(void *info);
+	remote_function_f	func;
 	void			*info;
 	int			ret;
 };
@@ -86,7 +88,7 @@ static void remote_function(void *data)
  *	    -EAGAIN - when the process moved away
  */
 static int
-task_function_call(struct task_struct *p, int (*func) (void *info), void *info)
+task_function_call(struct task_struct *p, remote_function_f func, void *info)
 {
 	struct remote_function_call data = {
 		.p	= p,
@@ -110,7 +112,7 @@ task_function_call(struct task_struct *p, int (*func) (void *info), void *info)
  *
  * returns: @func return value or -ENXIO when the cpu is offline
  */
-static int cpu_function_call(int cpu, int (*func) (void *info), void *info)
+static int cpu_function_call(int cpu, remote_function_f func, void *info)
 {
 	struct remote_function_call data = {
 		.p	= NULL,
@@ -747,7 +749,7 @@ perf_cgroup_mark_enabled(struct perf_event *event,
 /*
  * function must be called with interrupts disbled
  */
-static enum hrtimer_restart perf_cpu_hrtimer_handler(struct hrtimer *hr)
+static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
 {
 	struct perf_cpu_context *cpuctx;
 	enum hrtimer_restart ret = HRTIMER_NORESTART;
@@ -771,7 +773,7 @@ static enum hrtimer_restart perf_cpu_hrtimer_handler(struct hrtimer *hr)
 }
 
 /* CPU is going down */
-void perf_cpu_hrtimer_cancel(int cpu)
+void perf_mux_hrtimer_cancel(int cpu)
 {
 	struct perf_cpu_context *cpuctx;
 	struct pmu *pmu;
@@ -798,11 +800,11 @@ void perf_cpu_hrtimer_cancel(int cpu)
 	local_irq_restore(flags);
 }
 
-static void __perf_cpu_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
+static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 {
-	struct hrtimer *hr = &cpuctx->hrtimer;
+	struct hrtimer *timer = &cpuctx->hrtimer;
 	struct pmu *pmu = cpuctx->ctx.pmu;
-	int timer;
+	u64 interval;
 
 	/* no multiplexing needed for SW PMU */
 	if (pmu->task_ctx_nr == perf_sw_context)
@@ -812,29 +814,30 @@ static void __perf_cpu_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 	 * check default is sane, if not set then force to
 	 * default interval (1/tick)
 	 */
-	timer = pmu->hrtimer_interval_ms;
-	if (timer < 1)
-		timer = pmu->hrtimer_interval_ms = PERF_CPU_HRTIMER;
+	interval = pmu->hrtimer_interval_ms;
+	if (interval < 1)
+		interval = pmu->hrtimer_interval_ms = PERF_CPU_HRTIMER;
 
-	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
+	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
-	hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
-	hr->function = perf_cpu_hrtimer_handler;
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+	timer->function = perf_mux_hrtimer_handler;
 }
 
-static void perf_cpu_hrtimer_restart(struct perf_cpu_context *cpuctx)
+static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
 {
-	struct hrtimer *hr = &cpuctx->hrtimer;
+	struct hrtimer *timer = &cpuctx->hrtimer;
 	struct pmu *pmu = cpuctx->ctx.pmu;
 
 	/* not for SW PMU */
 	if (pmu->task_ctx_nr == perf_sw_context)
-		return;
+		return 0;
 
-	if (hrtimer_active(hr))
-		return;
+	if (hrtimer_is_queued(timer))
+		return 0;
 
-	hrtimer_start(hr, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
+	hrtimer_start(timer, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
+	return 0;
 }
 
 void perf_pmu_disable(struct pmu *pmu)
@@ -1913,7 +1916,7 @@ group_sched_in(struct perf_event *group_event,
 
 	if (event_sched_in(group_event, cpuctx, ctx)) {
 		pmu->cancel_txn(pmu);
-		perf_cpu_hrtimer_restart(cpuctx);
+		perf_mux_hrtimer_restart(cpuctx);
 		return -EAGAIN;
 	}
 
@@ -1960,7 +1963,7 @@ group_error:
 
 	pmu->cancel_txn(pmu);
 
-	perf_cpu_hrtimer_restart(cpuctx);
+	perf_mux_hrtimer_restart(cpuctx);
 
 	return -EAGAIN;
 }
@@ -2233,7 +2236,7 @@ static int __perf_event_enable(void *info)
 		 */
 		if (leader != event) {
 			group_sched_out(leader, cpuctx, ctx);
-			perf_cpu_hrtimer_restart(cpuctx);
+			perf_mux_hrtimer_restart(cpuctx);
 		}
 		if (leader->attr.pinned) {
 			update_group_times(leader);
@@ -7143,6 +7146,8 @@ perf_event_mux_interval_ms_show(struct device *dev,
 	return snprintf(page, PAGE_SIZE-1, "%d\n", pmu->hrtimer_interval_ms);
 }
 
+static DEFINE_MUTEX(mux_interval_mutex);
+
 static ssize_t
 perf_event_mux_interval_ms_store(struct device *dev,
 				 struct device_attribute *attr,
@@ -7162,17 +7167,21 @@ perf_event_mux_interval_ms_store(struct device *dev,
 	if (timer == pmu->hrtimer_interval_ms)
 		return count;
 
+	mutex_lock(&mux_interval_mutex);
 	pmu->hrtimer_interval_ms = timer;
 
 	/* update all cpuctx for this PMU */
-	for_each_possible_cpu(cpu) {
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
 		struct perf_cpu_context *cpuctx;
 		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
 		cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
 
-		if (hrtimer_active(&cpuctx->hrtimer))
-			hrtimer_forward_now(&cpuctx->hrtimer, cpuctx->hrtimer_interval);
+		cpu_function_call(cpu,
+			(remote_function_f)perf_mux_hrtimer_restart, cpuctx);
 	}
+	put_online_cpus();
+	mutex_unlock(&mux_interval_mutex);
 
 	return count;
 }
@@ -7277,7 +7286,7 @@ skip_type:
 		lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
 		cpuctx->ctx.pmu = pmu;
 
-		__perf_cpu_hrtimer_init(cpuctx, cpu);
+		__perf_mux_hrtimer_init(cpuctx, cpu);
 
 		cpuctx->unique_pmu = pmu;
 	}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers
  2015-04-22 19:15       ` [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers tip-bot for Peter Zijlstra
@ 2015-05-12 13:52         ` Sasha Levin
  2015-05-13 13:43           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Sasha Levin @ 2015-05-12 13:52 UTC (permalink / raw)
  To: linux-kernel, pjt, tglx, klamm, mingo, bsegall, peterz, hpa

On 04/22/2015 03:15 PM, tip-bot for Peter Zijlstra wrote:
> Commit-ID:  5de2755c8c8b3a6b8414870e2c284914a2b42e4d
> Gitweb:     http://git.kernel.org/tip/5de2755c8c8b3a6b8414870e2c284914a2b42e4d
> Author:     Peter Zijlstra <peterz@infradead.org>
> AuthorDate: Tue, 20 May 2014 15:49:48 +0200
> Committer:  Thomas Gleixner <tglx@linutronix.de>
> CommitDate: Wed, 22 Apr 2015 17:06:52 +0200
> 
> hrtimer: Allow concurrent hrtimer_start() for self restarting timers
> 
> Because we drop cpu_base->lock around calling hrtimer::function, it is
> possible for hrtimer_start() to come in between and enqueue the timer.
> 
> If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
> because HRTIMER_STATE_ENQUEUED will be set.
> 
> Since the above is a perfectly valid scenario, remove the BUG_ON and
> make the enqueue_hrtimer() call conditional on the timer not being
> enqueued already.
> 
> NOTE: in that concurrent scenario its entirely common for both sites
> to want to modify the hrtimer, since hrtimers don't provide
> serialization themselves be sure to provide some such that the
> hrtimer::function and the hrtimer_start() caller don't both try and
> fudge the expiration state at the same time.
> 
> To that effect, add a WARN when someone tries to forward an already
> enqueued timer, the most common way to change the expiry of self
> restarting timers. Ideally we'd put the WARN in everything modifying
> the expiry but most of that is inlines and we don't need the bloat.
> 
> Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Roman Gushchin <klamm@yandex-team.ru>
> Cc: Paul Turner <pjt@google.com>
> Link: http://lkml.kernel.org/r/20150415113105.GT5029@twins.programming.kicks-ass.net
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/time/hrtimer.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 3bac942..4adf320 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -799,6 +799,9 @@ u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
>  	if (delta.tv64 < 0)
>  		return 0;
>  
> +	if (WARN_ON(timer->state & HRTIMER_STATE_ENQUEUED))
> +		return 0;
> +
>  	if (interval.tv64 < hrtimer_resolution)
>  		interval.tv64 = hrtimer_resolution;

Hey Peter,

I seem to be hitting this warning with the latest -next (2015-05-11):

[3344291.055800] WARNING: CPU: 0 PID: 9422 at kernel/time/hrtimer.c:802 hrtimer_forward+0x1f9/0x330()
[3344291.057883] Modules linked in:
[3344291.058734] CPU: 0 PID: 9422 Comm: trinity-main Tainted: G        W       4.1.0-rc3-next-20150511-sa
sha-00037-g87c65d4-dirty #2199
[3344291.061763]  ffff880003a88000 00000000d4a58de9 ffff880021007c88 ffffffffabc833ac
[3344291.063726]  0000000000000000 0000000000000000 ffff880021007cd8 ffffffffa21f0a86
[3344291.065656]  ffffffffae6936c8 ffffffffa2386319 ffff880021007cb8 ffff8800211f5f90
[3344291.067590] Call Trace:
[3344291.068085]  <IRQ>  [<ffffffffabc833ac>] dump_stack+0x4f/0x7b
[3344291.068949]  [<ffffffffa21f0a86>] warn_slowpath_common+0xc6/0x120
[3344291.070382]  [<ffffffffa21f0cca>] warn_slowpath_null+0x1a/0x20
[3344291.071078]  [<ffffffffa2386319>] hrtimer_forward+0x1f9/0x330
[3344291.071834]  [<ffffffffa24fc7a0>] perf_mux_hrtimer_handler+0x340/0x750
[3344291.072616]  [<ffffffffa238830c>] __hrtimer_run_queues+0x30c/0x10d0
[3344291.075592]  [<ffffffffa238a7ac>] hrtimer_interrupt+0x19c/0x480
[3344291.076340]  [<ffffffffa20c1bef>] local_apic_timer_interrupt+0x6f/0xc0
[3344291.077156]  [<ffffffffabcf3af3>] smp_trace_apic_timer_interrupt+0xe3/0x859
[3344291.077968]  [<ffffffffabcf1df0>] trace_apic_timer_interrupt+0x70/0x80
[3344291.078729]  <EOI>  [<ffffffffa2620890>] ? arch_local_irq_restore+0x10/0x30
[3344291.079582]  [<ffffffffa2626ff4>] __slab_alloc+0x704/0x790
[3344291.082482]  [<ffffffffa2627324>] kmem_cache_alloc+0x2a4/0x3a0
[3344291.083944]  [<ffffffffa27cdf6d>] proc_alloc_inode+0x1d/0x270
[3344291.084631]  [<ffffffffa26d1a7b>] alloc_inode+0x5b/0x170
[3344291.085278]  [<ffffffffa26d7501>] new_inode_pseudo+0x11/0xd0
[3344291.085950]  [<ffffffffa27ce698>] proc_get_inode+0x18/0x580
[3344291.086615]  [<ffffffffa27dd946>] proc_lookup_de+0xc6/0x160
[3344291.088717]  [<ffffffffa27f168c>] proc_tgid_net_lookup+0x5c/0xa0
[3344291.089435]  [<ffffffffa269f470>] lookup_real+0x90/0x120
[3344291.090071]  [<ffffffffa26a0183>] __lookup_hash+0xe3/0x100
[3344291.092954]  [<ffffffffa26a96d7>] walk_component+0x697/0xc10
[3344291.096353]  [<ffffffffa26ae0cf>] path_lookupat+0x13f/0x380
[3344291.097727]  [<ffffffffa26ae441>] filename_lookup+0x131/0x1f0
[3344291.098411]  [<ffffffffa26b4d11>] user_path_at_empty+0xc1/0x160
[3344291.101545]  [<ffffffffa26b4dc1>] user_path_at+0x11/0x20
[3344291.102274]  [<ffffffffa268f131>] vfs_fstatat+0xb1/0x140
[3344291.105842]  [<ffffffffa2690299>] SYSC_newfstatat+0x79/0xd0
[3344291.108737]  [<ffffffffa269041e>] SyS_newfstatat+0xe/0x10
[3344291.109377]  [<ffffffffabcf0ee2>] tracesys_phase2+0x88/0x8d


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers
  2015-05-12 13:52         ` Sasha Levin
@ 2015-05-13 13:43           ` Peter Zijlstra
  2015-05-13 13:54             ` Ingo Molnar
                               ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-05-13 13:43 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-kernel, pjt, tglx, klamm, mingo, bsegall, hpa

On Tue, May 12, 2015 at 09:52:09AM -0400, Sasha Levin wrote:
> Hey Peter,
> 
> I seem to be hitting this warning with the latest -next (2015-05-11):
> 
> [3344291.055800] WARNING: CPU: 0 PID: 9422 at kernel/time/hrtimer.c:802 hrtimer_forward+0x1f9/0x330()

Indeed.

So I can't seem to come up with a pretty solution :-(

The 'problem' is creating a periodic timer that can stop when there's no
work left and ensuring (re)start doesn't get lost or looses overruns.

The problem with the current code is that the re-start can come before
the callback does fwd, at which point the fwd will spew the above
warning (rightly so).

Now, conceptually its easy to detect if you're before or after the fwd
by comparing the expiration time against the current time. Of course,
that's expensive (and racy) because we don't have the current time.

Alternatively one could cache this state inside the timer, but then
everybody pays the overhead of maintaining this extra state, and that is
undesired.

The only other option that I could see is the external timer_active
variable, which I tried to kill before.

So I suppose the below (compile tested) patch should fix things, but
seeing how I've been up since 4am I might just have missed something
obvious :-)

Almost-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/perf_event.h |  4 ++++
 kernel/events/core.c       | 28 +++++++++++++++-------------
 kernel/sched/core.c        | 12 ------------
 kernel/sched/fair.c        | 17 +++++++++++++----
 kernel/sched/rt.c          |  8 +++++++-
 kernel/sched/sched.h       |  5 ++---
 6 files changed, 41 insertions(+), 33 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e86f85abeda7..1f5607b220e4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -566,8 +566,12 @@ struct perf_cpu_context {
 	struct perf_event_context	*task_ctx;
 	int				active_oncpu;
 	int				exclusive;
+
+	raw_spinlock_t			hrtimer_lock;
 	struct hrtimer			hrtimer;
 	ktime_t				hrtimer_interval;
+	unsigned int			hrtimer_active;
+
 	struct pmu			*unique_pmu;
 	struct perf_cgroup		*cgrp;
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84231a146dd5..3dac1d0c6072 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -752,24 +752,21 @@ perf_cgroup_mark_enabled(struct perf_event *event,
 static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
 {
 	struct perf_cpu_context *cpuctx;
-	enum hrtimer_restart ret = HRTIMER_NORESTART;
 	int rotations = 0;
 
 	WARN_ON(!irqs_disabled());
 
 	cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
-
 	rotations = perf_rotate_context(cpuctx);
 
-	/*
-	 * arm timer if needed
-	 */
-	if (rotations) {
+	raw_spin_lock(&cpuctx->hrtimer_lock);
+	if (rotations)
 		hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
-		ret = HRTIMER_RESTART;
-	}
+	else
+		cpuctx->hrtimer_active = 0;
+	raw_spin_unlock(&cpuctx->hrtimer_lock);
 
-	return ret;
+	return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
 }
 
 static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
@@ -792,7 +789,7 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 
 	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	timer->function = perf_mux_hrtimer_handler;
 }
 
@@ -800,15 +797,20 @@ static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
 {
 	struct hrtimer *timer = &cpuctx->hrtimer;
 	struct pmu *pmu = cpuctx->ctx.pmu;
+	unsigned long flags;
 
 	/* not for SW PMU */
 	if (pmu->task_ctx_nr == perf_sw_context)
 		return 0;
 
-	if (hrtimer_is_queued(timer))
-		return 0;
+	raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
+	if (!cpuctx->hrtimer_active) {
+		cpuctx->hrtimer_active = 1;
+		hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+		hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED);
+	}
+	raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);
 
-	hrtimer_start(timer, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
 	return 0;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 613b61e381ca..bd9182aa74c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -90,18 +90,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
 
-void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
-{
-	/*
-	 * Do not forward the expiration time of active timers;
-	 * we do not want to loose an overrun.
-	 */
-	if (!hrtimer_active(period_timer))
-		hrtimer_forward_now(period_timer, period);
-
-	hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
-}
-
 DEFINE_MUTEX(sched_domains_mutex);
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c1510abeefa..2c08783ee95c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3883,8 +3883,9 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
 	if (runtime_refresh_within(cfs_b, min_left))
 		return;
 
-	start_bandwidth_timer(&cfs_b->slack_timer,
-				ns_to_ktime(cfs_bandwidth_slack_period));
+	hrtimer_start(&cfs_b->slack_timer,
+			ns_to_ktime(cfs_bandwidth_slack_period),
+			HRTIMER_MODE_REL);
 }
 
 /* we know any runtime found here is valid as update_curr() precedes return */
@@ -4025,6 +4026,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 
 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
 	}
+	if (idle)
+		cfs_b->period_active = 0;
 	raw_spin_unlock(&cfs_b->lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -4038,7 +4041,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
-	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
@@ -4052,7 +4055,13 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
-	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+	lockdep_assert_held(&cfs_b->lock);
+
+	if (!cfs_b->period_active) {
+		cfs_b->period_active = 1;
+		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+	}
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8781a38a636e..7d7093c51f8d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -31,6 +31,8 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
 		idle = do_sched_rt_period_timer(rt_b, overrun);
 		raw_spin_lock(&rt_b->rt_runtime_lock);
 	}
+	if (idle)
+		rt_b->rt_period_active = 0;
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -54,7 +56,11 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
+	if (!rt_b->rt_period_active) {
+		rt_b->rt_period_active = 1;
+		hrtimer_forward_now(&rt_b->rt_period_timer, rt_b->rt_period);
+		hrtimer_start_expires(&rt_b->rt_period_timer, HRTIMER_MODE_ABS_PINNED);
+	}
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 460f98648afd..f10a445910c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -137,6 +137,7 @@ struct rt_bandwidth {
 	ktime_t			rt_period;
 	u64			rt_runtime;
 	struct hrtimer		rt_period_timer;
+	unsigned int		rt_period_active;
 };
 
 void __dl_clear_params(struct task_struct *p);
@@ -221,7 +222,7 @@ struct cfs_bandwidth {
 	s64 hierarchical_quota;
 	u64 runtime_expires;
 
-	int idle;
+	int idle, period_active;
 	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
@@ -1410,8 +1411,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
 static inline void sched_avg_update(struct rq *rq) { }
 #endif
 
-extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
-
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers
  2015-05-13 13:43           ` Peter Zijlstra
@ 2015-05-13 13:54             ` Ingo Molnar
  2015-05-13 17:25             ` bsegall
  2015-05-13 23:09             ` Sasha Levin
  2 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2015-05-13 13:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sasha Levin, linux-kernel, pjt, tglx, klamm, bsegall, hpa


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, May 12, 2015 at 09:52:09AM -0400, Sasha Levin wrote:
> > Hey Peter,
> > 
> > I seem to be hitting this warning with the latest -next (2015-05-11):
> > 
> > [3344291.055800] WARNING: CPU: 0 PID: 9422 at kernel/time/hrtimer.c:802 hrtimer_forward+0x1f9/0x330()
> 
> Indeed.
> 
> So I can't seem to come up with a pretty solution :-(

I'm running into this rather frequently as well, in -tip integration 
testing, on real hardware, so we need a solution :-/

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers
  2015-05-13 13:43           ` Peter Zijlstra
  2015-05-13 13:54             ` Ingo Molnar
@ 2015-05-13 17:25             ` bsegall
  2015-05-13 23:09             ` Sasha Levin
  2 siblings, 0 replies; 22+ messages in thread
From: bsegall @ 2015-05-13 17:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Sasha Levin, linux-kernel, pjt, tglx, klamm, mingo, hpa

Peter Zijlstra <peterz@infradead.org> writes:

> @@ -4038,7 +4041,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  	cfs_b->period = ns_to_ktime(default_cfs_period());
>  
>  	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
> -	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
>  	cfs_b->period_timer.function = sched_cfs_period_timer;
>  	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>  	cfs_b->slack_timer.function = sched_cfs_slack_timer;
> @@ -4052,7 +4055,13 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>  
>  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>  {
> -	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> +	lockdep_assert_held(&cfs_b->lock);
> +
> +	if (!cfs_b->period_active) {
> +		cfs_b->period_active = 1;
> +		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
> +		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
> +	}
>  }
>  

I don't think the cfsb period timer actually needs to be pinned, even
though it has generally run as such due to start_bandwidth_timer.

The period_active code looks fine.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers
  2015-05-13 13:43           ` Peter Zijlstra
  2015-05-13 13:54             ` Ingo Molnar
  2015-05-13 17:25             ` bsegall
@ 2015-05-13 23:09             ` Sasha Levin
  2015-05-14 10:23               ` Peter Zijlstra
  2 siblings, 1 reply; 22+ messages in thread
From: Sasha Levin @ 2015-05-13 23:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, pjt, tglx, klamm, mingo, bsegall, hpa

On 05/13/2015 09:43 AM, Peter Zijlstra wrote:
> On Tue, May 12, 2015 at 09:52:09AM -0400, Sasha Levin wrote:
>> Hey Peter,
>>
>> I seem to be hitting this warning with the latest -next (2015-05-11):
>>
>> [3344291.055800] WARNING: CPU: 0 PID: 9422 at kernel/time/hrtimer.c:802 hrtimer_forward+0x1f9/0x330()
> 
> Indeed.
> 
> So I can't seem to come up with a pretty solution :-(
> 
> The 'problem' is creating a periodic timer that can stop when there's no
> work left and ensuring (re)start doesn't get lost or looses overruns.
> 
> The problem with the current code is that the re-start can come before
> the callback does fwd, at which point the fwd will spew the above
> warning (rightly so).
> 
> Now, conceptually its easy to detect if you're before or after the fwd
> by comparing the expiration time against the current time. Of course,
> that's expensive (and racy) because we don't have the current time.
> 
> Alternatively one could cache this state inside the timer, but then
> everybody pays the overhead of maintaining this extra state, and that is
> undesired.
> 
> The only other option that I could see is the external timer_active
> variable, which I tried to kill before.
> 
> So I suppose the below (compile tested) patch should fix things, but
> seeing how I've been up since 4am I might just have missed something
> obvious :-)

I know exactly how you feel...

> Almost-Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Original warning is gone, but now I get:

[  281.213520] INFO: trying to register non-static key.
[  281.214251] the code is fine but needs lockdep annotation.
[  281.214251] turning off the locking correctness validator.
[  281.214251] CPU: 0 PID: 10071 Comm: trinity-main Tainted: G        W       4.1.0-rc3-next-20150512-sasha-00051-gd40f47b-dirty #2213
[  281.214251]  ffffffffae4a5b20 00000000cbbe31dc ffff88000267f698 ffffffffa7c8e8ab
[  281.214251]  0000000000000000 0000000000000000 ffff88000267f848 ffffffff9e303567
[  281.214251]  ffff88001d015260 0000000000000004 ffff88001d014d20 0000000000000005
[  281.214251] Call Trace:
[  281.214251] dump_stack (lib/dump_stack.c:52)
[  281.214251] __lock_acquire (kernel/locking/lockdep.c:786 kernel/locking/lockdep.c:3134)
[  281.214251] ? x86_schedule_events (arch/x86/kernel/cpu/perf_event.c:849 (discriminator 2))
[  281.214251] ? lockdep_reset_lock (kernel/locking/lockdep.c:3105)
[  281.214251] ? x86_pmu_commit_txn (arch/x86/kernel/cpu/perf_event.c:1098)
[  281.214251] ? x86_pmu_commit_txn (arch/x86/kernel/cpu/perf_event.c:1731)
[  281.214251] ? x86_pmu_cancel_txn (arch/x86/kernel/cpu/perf_event.c:1720)
[  281.214251] ? perf_event_update_userpage (include/linux/rcupdate.h:917 kernel/events/core.c:4256)
[  281.214251] lock_acquire (kernel/locking/lockdep.c:3658)
[  281.214251] ? perf_mux_hrtimer_restart (kernel/events/core.c:807)
[  281.214251] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:119 kernel/locking/spinlock.c:159)
[  281.214251] ? perf_mux_hrtimer_restart (kernel/events/core.c:807)
[  281.214251] perf_mux_hrtimer_restart (kernel/events/core.c:807)
[  281.214251] group_sched_in (kernel/events/core.c:1963)
[  281.214251] ? sched_clock_cpu (kernel/sched/clock.c:311)
[  281.214251] ctx_sched_in (kernel/events/core.c:2725 kernel/events/core.c:2756)
[  281.214251] perf_event_sched_in (kernel/events/core.c:2025)
[  281.214251] __perf_install_in_context (kernel/events/core.c:2082)
[  281.214251] ? perf_mux_hrtimer_handler (kernel/events/core.c:2035)
[  281.214251] remote_function (kernel/events/core.c:74)
[  281.214251] ? pmu_dev_release (kernel/events/core.c:66)
[  281.214251] generic_exec_single (kernel/smp.c:157)
[  281.214251] ? pmu_dev_release (kernel/events/core.c:66)
[  281.214251] smp_call_function_single (kernel/smp.c:300)
[  281.214251] ? generic_exec_single (kernel/smp.c:273)
[  281.214251] ? __lock_is_held (kernel/locking/lockdep.c:3572)
[  281.214251] task_function_call (kernel/events/core.c:101)
[  281.214251] ? remote_function (kernel/events/core.c:92)
[  281.214251] ? perf_mux_hrtimer_handler (kernel/events/core.c:2035)
[  281.214251] perf_install_in_context (kernel/events/core.c:2121)
[  281.214251] ? add_event_to_ctx (kernel/events/core.c:2102)
[  281.214251] ? alloc_perf_context (kernel/events/core.c:3306)
[  281.214251] ? perf_event_alloc (kernel/events/core.c:7569)
[  281.214251] SYSC_perf_event_open (kernel/events/core.c:8133)
[  281.214251] ? perf_event_alloc (kernel/events/core.c:7856)
[  281.214251] ? syscall_trace_enter_phase1 (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1480)
[  281.214251] SyS_perf_event_open (kernel/events/core.c:7853)
[  281.214251] tracesys_phase2 (arch/x86/kernel/entry_64.S:346)


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers
  2015-05-13 23:09             ` Sasha Levin
@ 2015-05-14 10:23               ` Peter Zijlstra
  2015-05-18 15:21                 ` [tip:timers/core] sched,perf: Fix periodic timers tip-bot for Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-05-14 10:23 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-kernel, pjt, tglx, klamm, mingo, bsegall, hpa

On Wed, May 13, 2015 at 07:09:40PM -0400, Sasha Levin wrote:
> On 05/13/2015 09:43 AM, Peter Zijlstra wrote:

> > So I suppose the below (compile tested) patch should fix things, but
> > seeing how I've been up since 4am I might just have missed something
> > obvious :-)
> 
> I know exactly how you feel...

How about I add a raw_spin_lock_init() for the new lock I added ;-)

This one seems to boot and do simple perf things with lockdep enabled.

---
 include/linux/perf_event.h |  4 ++++
 kernel/events/core.c       | 29 ++++++++++++++++-------------
 kernel/sched/core.c        | 12 ------------
 kernel/sched/fair.c        | 17 +++++++++++++----
 kernel/sched/rt.c          |  8 +++++++-
 kernel/sched/sched.h       |  5 ++---
 6 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e86f85abeda7..1f5607b220e4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -566,8 +566,12 @@ struct perf_cpu_context {
 	struct perf_event_context	*task_ctx;
 	int				active_oncpu;
 	int				exclusive;
+
+	raw_spinlock_t			hrtimer_lock;
 	struct hrtimer			hrtimer;
 	ktime_t				hrtimer_interval;
+	unsigned int			hrtimer_active;
+
 	struct pmu			*unique_pmu;
 	struct perf_cgroup		*cgrp;
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84231a146dd5..1c6c2826af1e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -752,24 +752,21 @@ perf_cgroup_mark_enabled(struct perf_event *event,
 static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
 {
 	struct perf_cpu_context *cpuctx;
-	enum hrtimer_restart ret = HRTIMER_NORESTART;
 	int rotations = 0;
 
 	WARN_ON(!irqs_disabled());
 
 	cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
-
 	rotations = perf_rotate_context(cpuctx);
 
-	/*
-	 * arm timer if needed
-	 */
-	if (rotations) {
+	raw_spin_lock(&cpuctx->hrtimer_lock);
+	if (rotations)
 		hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
-		ret = HRTIMER_RESTART;
-	}
+	else
+		cpuctx->hrtimer_active = 0;
+	raw_spin_unlock(&cpuctx->hrtimer_lock);
 
-	return ret;
+	return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
 }
 
 static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
@@ -792,7 +789,8 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 
 	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+	raw_spin_lock_init(&cpuctx->hrtimer_lock);
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	timer->function = perf_mux_hrtimer_handler;
 }
 
@@ -800,15 +798,20 @@ static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
 {
 	struct hrtimer *timer = &cpuctx->hrtimer;
 	struct pmu *pmu = cpuctx->ctx.pmu;
+	unsigned long flags;
 
 	/* not for SW PMU */
 	if (pmu->task_ctx_nr == perf_sw_context)
 		return 0;
 
-	if (hrtimer_is_queued(timer))
-		return 0;
+	raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
+	if (!cpuctx->hrtimer_active) {
+		cpuctx->hrtimer_active = 1;
+		hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+		hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED);
+	}
+	raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);
 
-	hrtimer_start(timer, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
 	return 0;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 613b61e381ca..bd9182aa74c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -90,18 +90,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
 
-void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
-{
-	/*
-	 * Do not forward the expiration time of active timers;
-	 * we do not want to loose an overrun.
-	 */
-	if (!hrtimer_active(period_timer))
-		hrtimer_forward_now(period_timer, period);
-
-	hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
-}
-
 DEFINE_MUTEX(sched_domains_mutex);
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c1510abeefa..2c08783ee95c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3883,8 +3883,9 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
 	if (runtime_refresh_within(cfs_b, min_left))
 		return;
 
-	start_bandwidth_timer(&cfs_b->slack_timer,
-				ns_to_ktime(cfs_bandwidth_slack_period));
+	hrtimer_start(&cfs_b->slack_timer,
+			ns_to_ktime(cfs_bandwidth_slack_period),
+			HRTIMER_MODE_REL);
 }
 
 /* we know any runtime found here is valid as update_curr() precedes return */
@@ -4025,6 +4026,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 
 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
 	}
+	if (idle)
+		cfs_b->period_active = 0;
 	raw_spin_unlock(&cfs_b->lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -4038,7 +4041,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
-	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
@@ -4052,7 +4055,13 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
-	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+	lockdep_assert_held(&cfs_b->lock);
+
+	if (!cfs_b->period_active) {
+		cfs_b->period_active = 1;
+		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+	}
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8781a38a636e..7d7093c51f8d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -31,6 +31,8 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
 		idle = do_sched_rt_period_timer(rt_b, overrun);
 		raw_spin_lock(&rt_b->rt_runtime_lock);
 	}
+	if (idle)
+		rt_b->rt_period_active = 0;
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -54,7 +56,11 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
+	if (!rt_b->rt_period_active) {
+		rt_b->rt_period_active = 1;
+		hrtimer_forward_now(&rt_b->rt_period_timer, rt_b->rt_period);
+		hrtimer_start_expires(&rt_b->rt_period_timer, HRTIMER_MODE_ABS_PINNED);
+	}
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 460f98648afd..f10a445910c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -137,6 +137,7 @@ struct rt_bandwidth {
 	ktime_t			rt_period;
 	u64			rt_runtime;
 	struct hrtimer		rt_period_timer;
+	unsigned int		rt_period_active;
 };
 
 void __dl_clear_params(struct task_struct *p);
@@ -221,7 +222,7 @@ struct cfs_bandwidth {
 	s64 hierarchical_quota;
 	u64 runtime_expires;
 
-	int idle;
+	int idle, period_active;
 	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
@@ -1410,8 +1411,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
 static inline void sched_avg_update(struct rq *rq) { }
 #endif
 
-extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
-
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:timers/core] sched,perf: Fix periodic timers
  2015-05-14 10:23               ` Peter Zijlstra
@ 2015-05-18 15:21                 ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2015-05-18 15:21 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: hpa, peterz, linux-kernel, sasha.levin, tglx, mingo

Commit-ID:  4cfafd3082afc707653aeb82e9f8e7b596fbbfd6
Gitweb:     http://git.kernel.org/tip/4cfafd3082afc707653aeb82e9f8e7b596fbbfd6
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Thu, 14 May 2015 12:23:11 +0200
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Mon, 18 May 2015 17:17:42 +0200

sched,perf: Fix periodic timers

In the below two commits (see Fixes) we have periodic timers that can
stop themselves when they're no longer required, but need to be
(re)-started when their idle condition changes.

Further complications is that we want the timer handler to always do
the forward such that it will always correctly deal with the overruns,
and we do not want to race such that the handler has already decided
to stop, but the (external) restart sees the timer still active and we
end up with a 'lost' timer.

The problem with the current code is that the re-start can come before
the callback does the forward, at which point the forward from the
callback will WARN about forwarding an enqueued timer.

Now, conceptually its easy to detect if you're before or after the fwd
by comparing the expiration time against the current time. Of course,
that's expensive (and racy) because we don't have the current time.

Alternatively one could cache this state inside the timer, but then
everybody pays the overhead of maintaining this extra state, and that
is undesired.

The only other option that I could see is the external timer_active
variable, which I tried to kill before. I would love a nicer interface
for this seemingly simple 'problem' but alas.

Fixes: 272325c4821f ("perf: Fix mux_interval hrtimer wreckage")
Fixes: 77a4d1a1b9a1 ("sched: Cleanup bandwidth timers")
Cc: pjt@google.com
Cc: tglx@linutronix.de
Cc: klamm@yandex-team.ru
Cc: mingo@kernel.org
Cc: bsegall@google.com
Cc: hpa@zytor.com
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
---
 include/linux/perf_event.h |  4 ++++
 kernel/events/core.c       | 29 ++++++++++++++++-------------
 kernel/sched/core.c        | 12 ------------
 kernel/sched/fair.c        | 17 +++++++++++++----
 kernel/sched/rt.c          |  8 +++++++-
 kernel/sched/sched.h       |  5 ++---
 6 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61992cf..cf3342a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -566,8 +566,12 @@ struct perf_cpu_context {
 	struct perf_event_context	*task_ctx;
 	int				active_oncpu;
 	int				exclusive;
+
+	raw_spinlock_t			hrtimer_lock;
 	struct hrtimer			hrtimer;
 	ktime_t				hrtimer_interval;
+	unsigned int			hrtimer_active;
+
 	struct pmu			*unique_pmu;
 	struct perf_cgroup		*cgrp;
 };
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f528829..d9c93f3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -752,24 +752,21 @@ perf_cgroup_mark_enabled(struct perf_event *event,
 static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
 {
 	struct perf_cpu_context *cpuctx;
-	enum hrtimer_restart ret = HRTIMER_NORESTART;
 	int rotations = 0;
 
 	WARN_ON(!irqs_disabled());
 
 	cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
-
 	rotations = perf_rotate_context(cpuctx);
 
-	/*
-	 * arm timer if needed
-	 */
-	if (rotations) {
+	raw_spin_lock(&cpuctx->hrtimer_lock);
+	if (rotations)
 		hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
-		ret = HRTIMER_RESTART;
-	}
+	else
+		cpuctx->hrtimer_active = 0;
+	raw_spin_unlock(&cpuctx->hrtimer_lock);
 
-	return ret;
+	return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
 }
 
 static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
@@ -792,7 +789,8 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 
 	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+	raw_spin_lock_init(&cpuctx->hrtimer_lock);
+	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	timer->function = perf_mux_hrtimer_handler;
 }
 
@@ -800,15 +798,20 @@ static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
 {
 	struct hrtimer *timer = &cpuctx->hrtimer;
 	struct pmu *pmu = cpuctx->ctx.pmu;
+	unsigned long flags;
 
 	/* not for SW PMU */
 	if (pmu->task_ctx_nr == perf_sw_context)
 		return 0;
 
-	if (hrtimer_is_queued(timer))
-		return 0;
+	raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
+	if (!cpuctx->hrtimer_active) {
+		cpuctx->hrtimer_active = 1;
+		hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+		hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED);
+	}
+	raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);
 
-	hrtimer_start(timer, cpuctx->hrtimer_interval, HRTIMER_MODE_REL_PINNED);
 	return 0;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d8a6196..e84aeb2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -90,18 +90,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/sched.h>
 
-void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
-{
-	/*
-	 * Do not forward the expiration time of active timers;
-	 * we do not want to loose an overrun.
-	 */
-	if (!hrtimer_active(period_timer))
-		hrtimer_forward_now(period_timer, period);
-
-	hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
-}
-
 DEFINE_MUTEX(sched_domains_mutex);
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e3b32eb..69be282 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3870,8 +3870,9 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b)
 	if (runtime_refresh_within(cfs_b, min_left))
 		return;
 
-	start_bandwidth_timer(&cfs_b->slack_timer,
-				ns_to_ktime(cfs_bandwidth_slack_period));
+	hrtimer_start(&cfs_b->slack_timer,
+			ns_to_ktime(cfs_bandwidth_slack_period),
+			HRTIMER_MODE_REL);
 }
 
 /* we know any runtime found here is valid as update_curr() precedes return */
@@ -4012,6 +4013,8 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 
 		idle = do_sched_cfs_period_timer(cfs_b, overrun);
 	}
+	if (idle)
+		cfs_b->period_active = 0;
 	raw_spin_unlock(&cfs_b->lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -4025,7 +4028,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
-	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
 	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
@@ -4039,7 +4042,13 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 {
-	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+	lockdep_assert_held(&cfs_b->lock);
+
+	if (!cfs_b->period_active) {
+		cfs_b->period_active = 1;
+		hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
+		hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
+	}
 }
 
 static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index b0febf2..e43da53 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -31,6 +31,8 @@ static enum hrtimer_restart sched_rt_period_timer(struct hrtimer *timer)
 		idle = do_sched_rt_period_timer(rt_b, overrun);
 		raw_spin_lock(&rt_b->rt_runtime_lock);
 	}
+	if (idle)
+		rt_b->rt_period_active = 0;
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 
 	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
@@ -54,7 +56,11 @@ static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
+	if (!rt_b->rt_period_active) {
+		rt_b->rt_period_active = 1;
+		hrtimer_forward_now(&rt_b->rt_period_timer, rt_b->rt_period);
+		hrtimer_start_expires(&rt_b->rt_period_timer, HRTIMER_MODE_ABS_PINNED);
+	}
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 08606a1..f9a58ef 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -131,6 +131,7 @@ struct rt_bandwidth {
 	ktime_t			rt_period;
 	u64			rt_runtime;
 	struct hrtimer		rt_period_timer;
+	unsigned int		rt_period_active;
 };
 
 void __dl_clear_params(struct task_struct *p);
@@ -215,7 +216,7 @@ struct cfs_bandwidth {
 	s64 hierarchical_quota;
 	u64 runtime_expires;
 
-	int idle;
+	int idle, period_active;
 	struct hrtimer period_timer, slack_timer;
 	struct list_head throttled_cfs_rq;
 
@@ -1406,8 +1407,6 @@ static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
 static inline void sched_avg_update(struct rq *rq) { }
 #endif
 
-extern void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period);
-
 /*
  * __task_rq_lock - lock the rq @p resides on.
  */

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-05-18 15:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-15  9:41 [PATCH 0/3] hrtimer (related) fixes Peter Zijlstra
2015-04-15  9:41 ` [PATCH 1/3] hrtimer: Fix race between hrtimer_start() and __run_hrtimer() Peter Zijlstra
2015-04-15 10:26   ` Thomas Gleixner
2015-04-15 11:31     ` Peter Zijlstra
2015-04-15 11:35       ` Thomas Gleixner
2015-04-15 11:43         ` Peter Zijlstra
2015-04-22 19:15       ` [tip:timers/core] hrtimer: Allow concurrent hrtimer_start() for self restarting timers tip-bot for Peter Zijlstra
2015-05-12 13:52         ` Sasha Levin
2015-05-13 13:43           ` Peter Zijlstra
2015-05-13 13:54             ` Ingo Molnar
2015-05-13 17:25             ` bsegall
2015-05-13 23:09             ` Sasha Levin
2015-05-14 10:23               ` Peter Zijlstra
2015-05-18 15:21                 ` [tip:timers/core] sched,perf: Fix periodic timers tip-bot for Peter Zijlstra
2015-04-15  9:41 ` [PATCH 2/3] sched: Cleanup bandwidth timers Peter Zijlstra
2015-04-16 20:03   ` bsegall
2015-04-22 19:15   ` [tip:timers/core] " tip-bot for Peter Zijlstra
2015-04-15  9:41 ` [PATCH 3/3] perf: Fix mux_interval hrtimer wreckage Peter Zijlstra
2015-04-15 13:48   ` David Ahern
2015-04-15 14:20     ` Peter Zijlstra
2015-04-22 15:12   ` Thomas Gleixner
2015-04-22 19:15   ` [tip:timers/core] " tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.