All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/5] Timekeeping fixes v2
@ 2013-12-11 19:11 John Stultz
  2013-12-11 19:11 ` [RFC][PATCH 1/5] timekeeping: Fix lost updates to tai adjustment John Stultz
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: John Stultz @ 2013-12-11 19:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Thomas Gleixner, Prarit Bhargava, Richard Cochran,
	Ingo Molnar

In looking into the lockdep splat reported by Sasha this week,
and came across a number of issues in the timekeeping code
(some related and some not).

Since I sent out v1, I came across another problem, as well as
a fix for the lockdep issue, so I wanted to resend the patches
for comment.

My current thought is that the first three patches are probably
urgent 3.13 items, where as the last two can probably be deferred
to 3.14.

The TAI timer delay thinko is particularly embarrassing, but sine
CLOCK_TAI was introduced in 3.10, its not exactly a regression, and
probably doesn't yet have any users outside of my tests.

Ingo/Thomas: Let me know if you agree with the first three patches
as urgent. I'd probably skip the second, but it provides the
infrastructure for the lockdep solution.

Sigh. Back-porting that lockdep fix to 3.10 will be a bit of a
pain, and I'll have to look to see if we need it even further back
or not.

thanks
-john

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>

John Stultz (5):
  timekeeping: Fix lost updates to tai adjustment
  timekeeping: Fix potential lost pv notification of time change
  timekeeping: Avoid possible deadlock from clock_was_set_delayed
  timekeeping: Fix CLOCK_TAI timer/nanosleep delays
  timekeeping: Fix missing timekeeping_update in suspend path

 kernel/time/timekeeping.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC][PATCH 1/5] timekeeping: Fix lost updates to tai adjustment
  2013-12-11 19:11 [RFC][PATCH 0/5] Timekeeping fixes v2 John Stultz
@ 2013-12-11 19:11 ` John Stultz
  2013-12-11 19:11   ` John Stultz
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: John Stultz @ 2013-12-11 19:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Thomas Gleixner, Prarit Bhargava, Richard Cochran,
	Ingo Molnar, stable

Since 48cdc135d4840 (Implement a shadow timekeeper), we have to
call timekeeping_update() after any adjustment to the timekeeping
structure in order to make sure that any adjustments to the structure
persist.

Unfortunately, the updates to the tai offset via adjtimex do not
trigger this update, causing adjustments to the tai offset to be
made and then over-written by the previous value at the next
update_wall_time() call.

This patch resovles the issue by calling timekeeping_update()
right after setting the tai offset.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable <stable@vger.kernel.org> #3.10+
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/timekeeping.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 87b4f00..6bad3d9 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -610,6 +610,7 @@ void timekeeping_set_tai_offset(s32 tai_offset)
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
 	write_seqcount_begin(&timekeeper_seq);
 	__timekeeping_set_tai_offset(tk, tai_offset);
+	timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 	clock_was_set();
@@ -1698,7 +1699,7 @@ int do_adjtimex(struct timex *txc)
 
 	if (tai != orig_tai) {
 		__timekeeping_set_tai_offset(tk, tai);
-		update_pvclock_gtod(tk, true);
+		timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
 		clock_was_set_delayed();
 	}
 	write_seqcount_end(&timekeeper_seq);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 2/5] timekeeping: Fix potential lost pv notification of time change
  2013-12-11 19:11 [RFC][PATCH 0/5] Timekeeping fixes v2 John Stultz
@ 2013-12-11 19:11   ` John Stultz
  2013-12-11 19:11   ` John Stultz
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: John Stultz @ 2013-12-11 19:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Thomas Gleixner, Ingo Molnar, David Vrabel,
	Konrad Rzeszutek Wilk, Prarit Bhargava, Richard Cochran,
	xen-devel, stable

In 780427f0e11 (Indicate that clock was set in the pvclock
gtod notifier), logic was added to pass a CLOCK_WAS_SET
notification to the pvclock notifier chain.

While that patch added a action flag returned from
accumulate_nsecs_to_secs(), it only uses the returned value
in one location, and not in the logarithmic accumulation.

This means if a leap second triggered during the logarithmic
accumulation (which is most likely where it would happen),
the notification that the clock was set would not make it to
the pv notifiers.

This patch extends the logarithmic_accumulation pass down
that action flag so proper notification will occur.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: <xen-devel@lists.xen.org>
Cc: stable <stable@vger.kernel.org> #3.11+
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/timekeeping.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6bad3d9..998ec751 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1295,7 +1295,7 @@ static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
  * Returns the unconsumed cycles.
  */
 static cycle_t logarithmic_accumulation(struct timekeeper *tk, cycle_t offset,
-						u32 shift)
+						u32 shift, unsigned int *action)
 {
 	cycle_t interval = tk->cycle_interval << shift;
 	u64 raw_nsecs;
@@ -1309,7 +1309,7 @@ static cycle_t logarithmic_accumulation(struct timekeeper *tk, cycle_t offset,
 	tk->cycle_last += interval;
 
 	tk->xtime_nsec += tk->xtime_interval << shift;
-	accumulate_nsecs_to_secs(tk);
+	*action |= accumulate_nsecs_to_secs(tk);
 
 	/* Accumulate raw time */
 	raw_nsecs = (u64)tk->raw_interval << shift;
@@ -1367,7 +1367,7 @@ static void update_wall_time(void)
 	struct timekeeper *tk = &shadow_timekeeper;
 	cycle_t offset;
 	int shift = 0, maxshift;
-	unsigned int action;
+	unsigned int action = 0;
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
@@ -1402,7 +1402,7 @@ static void update_wall_time(void)
 	maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1;
 	shift = min(shift, maxshift);
 	while (offset >= tk->cycle_interval) {
-		offset = logarithmic_accumulation(tk, offset, shift);
+		offset = logarithmic_accumulation(tk, offset, shift, &action);
 		if (offset < tk->cycle_interval<<shift)
 			shift--;
 	}
@@ -1420,7 +1420,7 @@ static void update_wall_time(void)
 	 * Finally, make sure that after the rounding
 	 * xtime_nsec isn't larger than NSEC_PER_SEC
 	 */
-	action = accumulate_nsecs_to_secs(tk);
+	action |= accumulate_nsecs_to_secs(tk);
 
 	write_seqcount_begin(&timekeeper_seq);
 	/* Update clock->cycle_last with the new value */
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 2/5] timekeeping: Fix potential lost pv notification of time change
@ 2013-12-11 19:11   ` John Stultz
  0 siblings, 0 replies; 20+ messages in thread
From: John Stultz @ 2013-12-11 19:11 UTC (permalink / raw)
  To: LKML
  Cc: Prarit Bhargava, Richard Cochran, stable, xen-devel, John Stultz,
	David Vrabel, Thomas Gleixner, Ingo Molnar

In 780427f0e11 (Indicate that clock was set in the pvclock
gtod notifier), logic was added to pass a CLOCK_WAS_SET
notification to the pvclock notifier chain.

While that patch added a action flag returned from
accumulate_nsecs_to_secs(), it only uses the returned value
in one location, and not in the logarithmic accumulation.

This means if a leap second triggered during the logarithmic
accumulation (which is most likely where it would happen),
the notification that the clock was set would not make it to
the pv notifiers.

This patch extends the logarithmic_accumulation pass down
that action flag so proper notification will occur.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: <xen-devel@lists.xen.org>
Cc: stable <stable@vger.kernel.org> #3.11+
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/timekeeping.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6bad3d9..998ec751 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1295,7 +1295,7 @@ static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
  * Returns the unconsumed cycles.
  */
 static cycle_t logarithmic_accumulation(struct timekeeper *tk, cycle_t offset,
-						u32 shift)
+						u32 shift, unsigned int *action)
 {
 	cycle_t interval = tk->cycle_interval << shift;
 	u64 raw_nsecs;
@@ -1309,7 +1309,7 @@ static cycle_t logarithmic_accumulation(struct timekeeper *tk, cycle_t offset,
 	tk->cycle_last += interval;
 
 	tk->xtime_nsec += tk->xtime_interval << shift;
-	accumulate_nsecs_to_secs(tk);
+	*action |= accumulate_nsecs_to_secs(tk);
 
 	/* Accumulate raw time */
 	raw_nsecs = (u64)tk->raw_interval << shift;
@@ -1367,7 +1367,7 @@ static void update_wall_time(void)
 	struct timekeeper *tk = &shadow_timekeeper;
 	cycle_t offset;
 	int shift = 0, maxshift;
-	unsigned int action;
+	unsigned int action = 0;
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&timekeeper_lock, flags);
@@ -1402,7 +1402,7 @@ static void update_wall_time(void)
 	maxshift = (64 - (ilog2(ntp_tick_length())+1)) - 1;
 	shift = min(shift, maxshift);
 	while (offset >= tk->cycle_interval) {
-		offset = logarithmic_accumulation(tk, offset, shift);
+		offset = logarithmic_accumulation(tk, offset, shift, &action);
 		if (offset < tk->cycle_interval<<shift)
 			shift--;
 	}
@@ -1420,7 +1420,7 @@ static void update_wall_time(void)
 	 * Finally, make sure that after the rounding
 	 * xtime_nsec isn't larger than NSEC_PER_SEC
 	 */
-	action = accumulate_nsecs_to_secs(tk);
+	action |= accumulate_nsecs_to_secs(tk);
 
 	write_seqcount_begin(&timekeeper_seq);
 	/* Update clock->cycle_last with the new value */
-- 
1.8.3.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-11 19:11 [RFC][PATCH 0/5] Timekeeping fixes v2 John Stultz
  2013-12-11 19:11 ` [RFC][PATCH 1/5] timekeeping: Fix lost updates to tai adjustment John Stultz
  2013-12-11 19:11   ` John Stultz
@ 2013-12-11 19:11 ` John Stultz
  2013-12-12 13:23   ` Ingo Molnar
  2013-12-12 16:34   ` Sasha Levin
  2013-12-11 19:11 ` [RFC][PATCH 4/5] timekeeping: Fix CLOCK_TAI timer/nanosleep delays John Stultz
  2013-12-11 19:11 ` [RFC][PATCH 5/5] timekeeping: Fix missing timekeeping_update in suspend path John Stultz
  4 siblings, 2 replies; 20+ messages in thread
From: John Stultz @ 2013-12-11 19:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Thomas Gleixner, Prarit Bhargava, Richard Cochran,
	Ingo Molnar, Sasha Levin, stable

As part of normal operaions, the hrtimer subsystem frequently calls
into the timekeeping code, creating a locking order of
  hrtimer locks -> timekeeping locks

clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
between the timekeeping the hrtimer subsystem, so that we could
notify the hrtimer subsytem the time had changed while holding
the timekeeping locks. This was done by scheduling delayed work
that would run later once we were out of the timekeeing code.

But unfortunately the lock chains are complex enoguh that in
scheduling delayed work, we end up eventually trying to grab
an hrtimer lock.

Sasha Levin noticed this in testing when the new seqlock lockdep
enablement triggered the following (somewhat abrieviated) message:

[  251.100221] ======================================================
[  251.100221] [ INFO: possible circular locking dependency detected ]
[  251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted
[  251.101967] -------------------------------------------------------
[  251.101967] kworker/10:1/4506 is trying to acquire lock:
[  251.101967]  (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70
[  251.101967]
[  251.101967] but task is already holding lock:
[  251.101967]  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
[  251.101967]
[  251.101967] which lock already depends on the new lock.
[  251.101967]
[  251.101967]
[  251.101967] the existing dependency chain (in reverse order) is:
[  251.101967]
-> #5 (hrtimer_bases.lock#11){-.-...}:
[snipped]
-> #4 (&rt_b->rt_runtime_lock){-.-...}:
[snipped]
-> #3 (&rq->lock){-.-.-.}:
[snipped]
-> #2 (&p->pi_lock){-.-.-.}:
[snipped]
-> #1 (&(&pool->lock)->rlock){-.-...}:
[  251.101967]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
[  251.101967]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
[  251.101967]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
[  251.101967]        [<ffffffff84398500>] _raw_spin_lock+0x40/0x80
[  251.101967]        [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0
[  251.101967]        [<ffffffff81154168>] queue_work_on+0x98/0x120
[  251.101967]        [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30
[  251.101967]        [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160
[  251.101967]        [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70
[  251.101967]        [<ffffffff843a4b49>] ia32_sysret+0x0/0x5
[  251.101967]
-> #0 (timekeeper_seq){----..}:
[snipped]
[  251.101967] other info that might help us debug this:
[  251.101967]
[  251.101967] Chain exists of:
  timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11

[  251.101967]  Possible unsafe locking scenario:
[  251.101967]
[  251.101967]        CPU0                    CPU1
[  251.101967]        ----                    ----
[  251.101967]   lock(hrtimer_bases.lock#11);
[  251.101967]                                lock(&rt_b->rt_runtime_lock);
[  251.101967]                                lock(hrtimer_bases.lock#11);
[  251.101967]   lock(timekeeper_seq);
[  251.101967]
[  251.101967]  *** DEADLOCK ***
[  251.101967]
[  251.101967] 3 locks held by kworker/10:1/4506:
[  251.101967]  #0:  (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
[  251.101967]  #1:  (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
[  251.101967]  #2:  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
[  251.101967]
[  251.101967] stack backtrace:
[  251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053
[  251.101967] Workqueue: events clock_was_set_work

So the best solution is to avoid calling clock_was_set_delayed() while
holding the timekeeping lock, and instead using a flag variable to
decide if we should call clock_was_set() once we've released the locks.

This works for the case here, where the do_adjtimex() was the deadlock
trigger point. Unfortuantely, in update_wall_time() we still hold
the jiffies lock, which would deadlock with the ipi triggered by
clock_was_set(), preventing us from calling it even after we drop the
timekeeping lock. So instead call clock_was_set_delayed() at that point.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: stable <stable@vger.kernel.org> #3.10+
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/timekeeping.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 998ec751..c1d36b6 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1278,7 +1278,6 @@ static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
 
 			__timekeeping_set_tai_offset(tk, tk->tai_offset - leap);
 
-			clock_was_set_delayed();
 			action = TK_CLOCK_WAS_SET;
 		}
 	}
@@ -1440,6 +1439,19 @@ static void update_wall_time(void)
 	write_seqcount_end(&timekeeper_seq);
 out:
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
+	if (action & TK_CLOCK_WAS_SET) {
+		/*
+		 * XXX -  I'd rather we just call clock_was_set(), but
+		 * since we're currently holding the jiffies lock, calling
+		 * clock_was_set would trigger an ipi which would then grab
+		 * the jiffies lock and we'd deadlock. :(
+		 * The right solution should probably be droping
+		 * the jiffies lock before calling update_wall_time
+		 * but that requires some rework of the tick sched
+		 * code.
+		 */
+		clock_was_set_delayed();
+	}
 }
 
 /**
@@ -1700,11 +1712,13 @@ int do_adjtimex(struct timex *txc)
 	if (tai != orig_tai) {
 		__timekeeping_set_tai_offset(tk, tai);
 		timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET);
-		clock_was_set_delayed();
 	}
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
+	if (tai != orig_tai)
+		clock_was_set();
+
 	ntp_notify_cmos_timer();
 
 	return ret;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 4/5] timekeeping: Fix CLOCK_TAI timer/nanosleep delays
  2013-12-11 19:11 [RFC][PATCH 0/5] Timekeeping fixes v2 John Stultz
                   ` (2 preceding siblings ...)
  2013-12-11 19:11 ` [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed John Stultz
@ 2013-12-11 19:11 ` John Stultz
  2013-12-12 13:25   ` Ingo Molnar
  2013-12-11 19:11 ` [RFC][PATCH 5/5] timekeeping: Fix missing timekeeping_update in suspend path John Stultz
  4 siblings, 1 reply; 20+ messages in thread
From: John Stultz @ 2013-12-11 19:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Thomas Gleixner, Prarit Bhargava, Richard Cochran,
	Ingo Molnar, stable

A think-o in the calculation of the monotonic -> tai time offset
results in CLOCK_TAI timers and nanosleeps to expire late (the
latency is ~2x the tai offset).

Fix this by adding the tai offset from the realtime offset instead
of subtracting.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable <stable@vger.kernel.org> #3.10+
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/timekeeping.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index c1d36b6..a4c5742 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -77,7 +77,7 @@ static void tk_set_wall_to_mono(struct timekeeper *tk, struct timespec wtm)
 	tk->wall_to_monotonic = wtm;
 	set_normalized_timespec(&tmp, -wtm.tv_sec, -wtm.tv_nsec);
 	tk->offs_real = timespec_to_ktime(tmp);
-	tk->offs_tai = ktime_sub(tk->offs_real, ktime_set(tk->tai_offset, 0));
+	tk->offs_tai = ktime_add(tk->offs_real, ktime_set(tk->tai_offset, 0));
 }
 
 static void tk_set_sleep_time(struct timekeeper *tk, struct timespec t)
@@ -595,7 +595,7 @@ s32 timekeeping_get_tai_offset(void)
 static void __timekeeping_set_tai_offset(struct timekeeper *tk, s32 tai_offset)
 {
 	tk->tai_offset = tai_offset;
-	tk->offs_tai = ktime_sub(tk->offs_real, ktime_set(tai_offset, 0));
+	tk->offs_tai = ktime_add(tk->offs_real, ktime_set(tai_offset, 0));
 }
 
 /**
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC][PATCH 5/5] timekeeping: Fix missing timekeeping_update in suspend path
  2013-12-11 19:11 [RFC][PATCH 0/5] Timekeeping fixes v2 John Stultz
                   ` (3 preceding siblings ...)
  2013-12-11 19:11 ` [RFC][PATCH 4/5] timekeeping: Fix CLOCK_TAI timer/nanosleep delays John Stultz
@ 2013-12-11 19:11 ` John Stultz
  4 siblings, 0 replies; 20+ messages in thread
From: John Stultz @ 2013-12-11 19:11 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Thomas Gleixner, Prarit Bhargava, Richard Cochran,
	Ingo Molnar, stable

Since 48cdc135d4840 (Implement a shadow timekeeper), we have to
call timekeeping_update() after any adjustment to the timekeeping
structure in order to make sure that any adjustments to the structure
persist.

In the timekeeping suspend path, we udpate the timekeeper
structure, so we should be sure to update the shadow-timekeeper
before releasing the timekeeping locks. Currently this isn't done.

In most cases, the next time related code to run would be
timekeeping_resume, which does update the shadow-timekeeper, but
in an abundence of caution, this patch adds the call to
timekeeping_update() in the suspend path.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable <stable@vger.kernel.org> #3.10+
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/timekeeping.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index a4c5742..25ec642 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1024,6 +1024,8 @@ static int timekeeping_suspend(void)
 		timekeeping_suspend_time =
 			timespec_add(timekeeping_suspend_time, delta_delta);
 	}
+
+	timekeeping_update(tk, TK_MIRROR);
 	write_seqcount_end(&timekeeper_seq);
 	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-11 19:11 ` [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed John Stultz
@ 2013-12-12 13:23   ` Ingo Molnar
  2013-12-12 18:53     ` John Stultz
  2013-12-12 16:34   ` Sasha Levin
  1 sibling, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2013-12-12 13:23 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Thomas Gleixner, Prarit Bhargava, Richard Cochran,
	Sasha Levin, stable


* John Stultz <john.stultz@linaro.org> wrote:

> As part of normal operaions, the hrtimer subsystem frequently calls
> into the timekeeping code, creating a locking order of
>   hrtimer locks -> timekeeping locks
> 
> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
> between the timekeeping the hrtimer subsystem, so that we could
> notify the hrtimer subsytem the time had changed while holding
> the timekeeping locks. This was done by scheduling delayed work
> that would run later once we were out of the timekeeing code.
> 
> But unfortunately the lock chains are complex enoguh that in
> scheduling delayed work, we end up eventually trying to grab
> an hrtimer lock.
> 
> Sasha Levin noticed this in testing when the new seqlock lockdep
> enablement triggered the following (somewhat abrieviated) message:
> 
> [  251.100221] ======================================================
> [  251.100221] [ INFO: possible circular locking dependency detected ]
> [  251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted
> [  251.101967] -------------------------------------------------------
> [  251.101967] kworker/10:1/4506 is trying to acquire lock:
> [  251.101967]  (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70
> [  251.101967]
> [  251.101967] but task is already holding lock:
> [  251.101967]  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
> [  251.101967]
> [  251.101967] which lock already depends on the new lock.
> [  251.101967]
> [  251.101967]
> [  251.101967] the existing dependency chain (in reverse order) is:
> [  251.101967]
> -> #5 (hrtimer_bases.lock#11){-.-...}:
> [snipped]
> -> #4 (&rt_b->rt_runtime_lock){-.-...}:
> [snipped]
> -> #3 (&rq->lock){-.-.-.}:
> [snipped]
> -> #2 (&p->pi_lock){-.-.-.}:
> [snipped]
> -> #1 (&(&pool->lock)->rlock){-.-...}:
> [  251.101967]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
> [  251.101967]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
> [  251.101967]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
> [  251.101967]        [<ffffffff84398500>] _raw_spin_lock+0x40/0x80
> [  251.101967]        [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0
> [  251.101967]        [<ffffffff81154168>] queue_work_on+0x98/0x120
> [  251.101967]        [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30
> [  251.101967]        [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160
> [  251.101967]        [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70
> [  251.101967]        [<ffffffff843a4b49>] ia32_sysret+0x0/0x5
> [  251.101967]
> -> #0 (timekeeper_seq){----..}:
> [snipped]
> [  251.101967] other info that might help us debug this:
> [  251.101967]
> [  251.101967] Chain exists of:
>   timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11
> 
> [  251.101967]  Possible unsafe locking scenario:
> [  251.101967]
> [  251.101967]        CPU0                    CPU1
> [  251.101967]        ----                    ----
> [  251.101967]   lock(hrtimer_bases.lock#11);
> [  251.101967]                                lock(&rt_b->rt_runtime_lock);
> [  251.101967]                                lock(hrtimer_bases.lock#11);
> [  251.101967]   lock(timekeeper_seq);
> [  251.101967]
> [  251.101967]  *** DEADLOCK ***
> [  251.101967]
> [  251.101967] 3 locks held by kworker/10:1/4506:
> [  251.101967]  #0:  (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
> [  251.101967]  #1:  (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
> [  251.101967]  #2:  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
> [  251.101967]
> [  251.101967] stack backtrace:
> [  251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053
> [  251.101967] Workqueue: events clock_was_set_work
> 
> So the best solution is to avoid calling clock_was_set_delayed() while
> holding the timekeeping lock, and instead using a flag variable to
> decide if we should call clock_was_set() once we've released the locks.
> 
> This works for the case here, where the do_adjtimex() was the deadlock
> trigger point. Unfortuantely, in update_wall_time() we still hold
> the jiffies lock, which would deadlock with the ipi triggered by
> clock_was_set(), preventing us from calling it even after we drop the
> timekeeping lock. So instead call clock_was_set_delayed() at that point.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Prarit Bhargava <prarit@redhat.com>
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Sasha Levin <sasha.levin@oracle.com>
> Cc: stable <stable@vger.kernel.org> #3.10+
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  kernel/time/timekeeping.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 998ec751..c1d36b6 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1278,7 +1278,6 @@ static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
>  
>  			__timekeeping_set_tai_offset(tk, tk->tai_offset - leap);
>  
> -			clock_was_set_delayed();
>  			action = TK_CLOCK_WAS_SET;
>  		}
>  	}
> @@ -1440,6 +1439,19 @@ static void update_wall_time(void)
>  	write_seqcount_end(&timekeeper_seq);
>  out:
>  	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
> +	if (action & TK_CLOCK_WAS_SET) {
> +		/*
> +		 * XXX -  I'd rather we just call clock_was_set(), but
> +		 * since we're currently holding the jiffies lock, calling
> +		 * clock_was_set would trigger an ipi which would then grab
> +		 * the jiffies lock and we'd deadlock. :(
> +		 * The right solution should probably be droping
> +		 * the jiffies lock before calling update_wall_time
> +		 * but that requires some rework of the tick sched
> +		 * code.

s/droping/
  dropping

> +		 */
> +		clock_was_set_delayed();

Hm, this feels like a hack. Is the 'rework of the tick sched code' 
going to happen too?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 4/5] timekeeping: Fix CLOCK_TAI timer/nanosleep delays
  2013-12-11 19:11 ` [RFC][PATCH 4/5] timekeeping: Fix CLOCK_TAI timer/nanosleep delays John Stultz
@ 2013-12-12 13:25   ` Ingo Molnar
  2013-12-12 18:31     ` John Stultz
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2013-12-12 13:25 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Thomas Gleixner, Prarit Bhargava, Richard Cochran, stable


* John Stultz <john.stultz@linaro.org> wrote:

> A think-o in the calculation of the monotonic -> tai time offset
> results in CLOCK_TAI timers and nanosleeps to expire late (the
> latency is ~2x the tai offset).
> 
> Fix this by adding the tai offset from the realtime offset instead
> of subtracting.

Hm, it looks like the whole CLOCK_TAI feature was rushed in, with not 
enough testing done.

If the bugs extend to more than this two-liner then for -stable it 
might be better to just disable CLOCK_TAI (userspace can deal with it 
just fine), and queue up the right fixes for the next merge window or 
so.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-11 19:11 ` [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed John Stultz
  2013-12-12 13:23   ` Ingo Molnar
@ 2013-12-12 16:34   ` Sasha Levin
  2013-12-12 18:32     ` Sasha Levin
  1 sibling, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2013-12-12 16:34 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Thomas Gleixner, Prarit Bhargava, Richard Cochran, Ingo Molnar, stable

On 12/11/2013 02:11 PM, John Stultz wrote:
> As part of normal operaions, the hrtimer subsystem frequently calls
> into the timekeeping code, creating a locking order of
>    hrtimer locks -> timekeeping locks
>
> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
> between the timekeeping the hrtimer subsystem, so that we could
> notify the hrtimer subsytem the time had changed while holding
> the timekeeping locks. This was done by scheduling delayed work
> that would run later once we were out of the timekeeing code.
>
> But unfortunately the lock chains are complex enoguh that in
> scheduling delayed work, we end up eventually trying to grab
> an hrtimer lock.
>
> Sasha Levin noticed this in testing when the new seqlock lockdep
> enablement triggered the following (somewhat abrieviated) message:

[snip]

This seems to work for me, I don't see the lockdep spew anymore.

	Tested-by: Sasha Levin <sasha.levin@oracle.com>


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 4/5] timekeeping: Fix CLOCK_TAI timer/nanosleep delays
  2013-12-12 13:25   ` Ingo Molnar
@ 2013-12-12 18:31     ` John Stultz
  2013-12-13 14:10       ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: John Stultz @ 2013-12-12 18:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Thomas Gleixner, Prarit Bhargava, Richard Cochran, stable

On 12/12/2013 05:25 AM, Ingo Molnar wrote:
> * John Stultz <john.stultz@linaro.org> wrote:
>> A think-o in the calculation of the monotonic -> tai time offset
>> results in CLOCK_TAI timers and nanosleeps to expire late (the
>> latency is ~2x the tai offset).
>>
>> Fix this by adding the tai offset from the realtime offset instead
>> of subtracting.
> Hm, it looks like the whole CLOCK_TAI feature was rushed in, with not 
> enough testing done.

I wouldn't say rushed (I sat on the patches for awhile), but there was a
hole in my testing and the order that I ran my automated tests had made
it seem that all was well.

To avoid this in the future, I've already committed improvements to my
test set, and will be adding additional timer latency checks soon.

> If the bugs extend to more than this two-liner then for -stable it 
> might be better to just disable CLOCK_TAI (userspace can deal with it 
> just fine), and queue up the right fixes for the next merge window or 
> so.

I don't foresee further issues (famous last words, eh), but since I was
planning on keeping patch #4 and #5 for 3.14 anyway, we can wait till
those land upstream to decide if the two-liner is sufficient or if
disabling CLOCK_TAI in older -stable kernels is the right approach. That
sound ok?

thanks
-john



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-12 16:34   ` Sasha Levin
@ 2013-12-12 18:32     ` Sasha Levin
  2013-12-12 18:59       ` John Stultz
  0 siblings, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2013-12-12 18:32 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Thomas Gleixner, Prarit Bhargava, Richard Cochran, Ingo Molnar, stable

On 12/12/2013 11:34 AM, Sasha Levin wrote:
> On 12/11/2013 02:11 PM, John Stultz wrote:
>> As part of normal operaions, the hrtimer subsystem frequently calls
>> into the timekeeping code, creating a locking order of
>>    hrtimer locks -> timekeeping locks
>>
>> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
>> between the timekeeping the hrtimer subsystem, so that we could
>> notify the hrtimer subsytem the time had changed while holding
>> the timekeeping locks. This was done by scheduling delayed work
>> that would run later once we were out of the timekeeing code.
>>
>> But unfortunately the lock chains are complex enoguh that in
>> scheduling delayed work, we end up eventually trying to grab
>> an hrtimer lock.
>>
>> Sasha Levin noticed this in testing when the new seqlock lockdep
>> enablement triggered the following (somewhat abrieviated) message:
>
> [snip]
>
> This seems to work for me, I don't see the lockdep spew anymore.
>
>      Tested-by: Sasha Levin <sasha.levin@oracle.com>

I think I spoke too soon.

It took way more time to reproduce than previously, but I got:


[ 1195.547491] ======================================================
[ 1195.548319] [ INFO: possible circular locking dependency detected ]
[ 1195.549222] 3.13.0-rc3-next-20131212-sasha-00007-gbcfdb32 #4062 Not tainted
[ 1195.550184] -------------------------------------------------------
[ 1195.550184] trinity-child89/28195 is trying to acquire lock:

[ 1195.550184]  (timekeeper_seq){----..}, at: [<ffffffff8116a8e7>] start_bandwidth_timer+0x27/0x60
[ 1195.550184]
[ 1195.550184] but task is already holding lock:
[ 1195.550184]  (&rt_b->rt_runtime_lock){-.-...}, at: [<ffffffff811842f9>] 
__enqueue_rt_entity+0x229/0x290
[ 1195.550184]
[ 1195.550184] which lock already depends on the new lock.
[ 1195.550184]
[ 1195.550184]
[ 1195.550184] the existing dependency chain (in reverse order) is:
[ 1195.550184]
-> #4 (&rt_b->rt_runtime_lock){-.-...}:
[ 1195.550184]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
[ 1195.550184]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
[ 1195.550184]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
[ 1195.550184]        [<ffffffff843b0760>] _raw_spin_lock+0x40/0x80
[ 1195.550184]        [<ffffffff811842f9>] __enqueue_rt_entity+0x229/0x290
[ 1195.550184]        [<ffffffff811846db>] enqueue_rt_entity+0x6b/0x80
[ 1195.550184]        [<ffffffff81184726>] enqueue_task_rt+0x36/0xb0
[ 1195.550184]        [<ffffffff8116af42>] enqueue_task+0x52/0x60
[ 1195.550184]        [<ffffffff81171ebb>] __sched_setscheduler+0x33b/0x3f0
[ 1195.550184]        [<ffffffff81171f80>] sched_setscheduler_nocheck+0x10/0x20
[ 1195.550184]        [<ffffffff811b5e2b>] rcu_cpu_kthread_setup+0x2b/0x30
[ 1195.550184]        [<ffffffff81166efd>] smpboot_thread_fn+0x1ed/0x2c0
[ 1195.550184]        [<ffffffff8115d9e5>] kthread+0x105/0x110
[ 1195.550184]        [<ffffffff843babfc>] ret_from_fork+0x7c/0xb0
[ 1195.550184]
-> #3 (&rq->lock){-.-.-.}:
[ 1195.550184]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
[ 1195.550184]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
[ 1195.550184]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
[ 1195.550184]        [<ffffffff843b0760>] _raw_spin_lock+0x40/0x80
[ 1195.550184]        [<ffffffff8116f489>] wake_up_new_task+0x149/0x270
[ 1195.550184]        [<ffffffff8113002a>] do_fork+0x1ba/0x270
[ 1195.550184]        [<ffffffff81130166>] kernel_thread+0x26/0x30
[ 1195.550184]        [<ffffffff8439fe76>] rest_init+0x26/0x150
[ 1195.578519]        [<ffffffff8706b377>] start_kernel+0x3b9/0x3c0
[ 1195.578519]        [<ffffffff8706a3d9>] x86_64_start_reservations+0x2a/0x2c
[ 1195.578519]        [<ffffffff8706a5ae>] x86_64_start_kernel+0x186/0x195
[ 1195.578519]
-> #2 (&p->pi_lock){-.-.-.}:
[ 1195.578519]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
[ 1195.578519]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
[ 1195.578519]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
[ 1195.578519]        [<ffffffff843b0911>] _raw_spin_lock_irqsave+0x91/0xd0
[ 1195.578519]        [<ffffffff81172909>] try_to_wake_up+0x39/0x2a0
[ 1195.578519]        [<ffffffff81172bef>] wake_up_process+0x3f/0x50
[ 1195.578519]        [<ffffffff811507da>] start_worker+0x2a/0x40
[ 1195.578519]        [<ffffffff81156a9d>] create_and_start_worker+0x4d/0x90
[ 1195.578519]        [<ffffffff87091ceb>] init_workqueues+0x192/0x3cb
[ 1195.578519]        [<ffffffff810020ca>] do_one_initcall+0xca/0x1e0
[ 1195.578519]        [<ffffffff8706acf1>] kernel_init_freeable+0x2b4/0x354
[ 1195.578519]        [<ffffffff8439ffae>] kernel_init+0xe/0x130
[ 1195.578519]        [<ffffffff843babfc>] ret_from_fork+0x7c/0xb0
[ 1195.578519]
-> #1 (&(&pool->lock)->rlock){-.-...}:
[ 1195.578519]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
[ 1195.578519]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
[ 1195.578519]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
[ 1195.578519]        [<ffffffff843b0760>] _raw_spin_lock+0x40/0x80
[ 1195.578519]        [<ffffffff81153e0e>] __queue_work+0x14e/0x3f0
[ 1195.578519]        [<ffffffff81154168>] queue_work_on+0x98/0x120
[ 1195.578519]        [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30
[ 1195.578519]        [<ffffffff811c4b41>] do_adjtimex+0x111/0x160
[ 1195.578519]        [<ffffffff811360e3>] SYSC_adjtimex+0x43/0x80
[ 1195.578519]        [<ffffffff8113612e>] SyS_adjtimex+0xe/0x10
[ 1195.578519]        [<ffffffff843baed0>] tracesys+0xdd/0xe2
[ 1195.578519]
-> #0 (timekeeper_seq){----..}:
[ 1195.578519]        [<ffffffff81193d2f>] check_prev_add+0x13f/0x550
[ 1195.578519]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
[ 1195.578519]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
[ 1195.578519]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
[ 1195.578519]        [<ffffffff811c541a>] ktime_get+0x9a/0x1b0
[ 1195.578519]        [<ffffffff8116a8e7>] start_bandwidth_timer+0x27/0x60
[ 1195.578519]        [<ffffffff8118430e>] __enqueue_rt_entity+0x23e/0x290
[ 1195.578519]        [<ffffffff811846db>] enqueue_rt_entity+0x6b/0x80
[ 1195.578519]        [<ffffffff81184726>] enqueue_task_rt+0x36/0xb0
[ 1195.578519]        [<ffffffff8116af42>] enqueue_task+0x52/0x60
[ 1195.578519]        [<ffffffff81171ebb>] __sched_setscheduler+0x33b/0x3f0
[ 1195.578519]        [<ffffffff81172043>] sched_setscheduler+0x13/0x20
[ 1195.578519]        [<ffffffff811720c9>] do_sched_setscheduler+0x79/0xa0
[ 1195.578519]        [<ffffffff81172129>] SyS_sched_setscheduler+0x19/0x20
[ 1195.578519]        [<ffffffff843baed0>] tracesys+0xdd/0xe2
[ 1195.578519]
[ 1195.578519] other info that might help us debug this:
[ 1195.578519]
[ 1195.578519] Chain exists of:
   timekeeper_seq --> &rq->lock --> &rt_b->rt_runtime_lock

[ 1195.578519]  Possible unsafe locking scenario:
[ 1195.578519]
[ 1195.578519]        CPU0                    CPU1
[ 1195.578519]        ----                    ----
[ 1195.578519]   lock(&rt_b->rt_runtime_lock);
[ 1195.578519]                                lock(&rq->lock);
[ 1195.578519]                                lock(&rt_b->rt_runtime_lock);
[ 1195.578519]   lock(timekeeper_seq);
[ 1195.578519]
[ 1195.578519]  *** DEADLOCK ***
[ 1195.578519]
[ 1195.578519] 4 locks held by trinity-child89/28195:
[ 1195.578519]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff81167d00>] rcu_read_lock+0x0/0x80
[ 1195.578519]  #1:  (&p->pi_lock){-.-.-.}, at: [<ffffffff811691d0>] task_rq_lock+0x40/0xb0
[ 1195.578519]  #2:  (&rq->lock){-.-.-.}, at: [<ffffffff811691eb>] task_rq_lock+0x5b/0xb0
[ 1195.578519]  #3:  (&rt_b->rt_runtime_lock){-.-...}, at: [<ffffffff811842f9>] 
__enqueue_rt_entity+0x229/0x290
[ 1195.578519]
[ 1195.578519] stack backtrace:
[main] Random reseed: 2208014823
[ 1195.578519] CPU: 60 PID: 28195 Comm: trinity-child89 Not tainted 
3.13.0-rc3-next-20131212-sasha-00007-gbcfdb32 #4062
[ 1195.578519]  0000000000000000 ffff8800b0995b48 ffffffff843a9d97 0000000000000000
[ 1195.578519]  0000000000000000 ffff8800b0995b98 ffffffff811918d9 ffff8800b0995bb8
[ 1195.578519]  ffffffff875fdd00 ffff8800b0995b98 ffff880e48048c80 ffff880e48048cb8
[ 1195.578519] Call Trace:
[ 1195.578519]  [<ffffffff843a9d97>] dump_stack+0x52/0x7f
[ 1195.578519]  [<ffffffff811918d9>] print_circular_bug+0x129/0x160
[ 1195.578519]  [<ffffffff81193d2f>] check_prev_add+0x13f/0x550
[ 1195.578519]  [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
[ 1195.578519]  [<ffffffff81175588>] ? sched_clock_cpu+0x108/0x120
[ 1195.578519]  [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
[ 1195.578519]  [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
[ 1195.578519]  [<ffffffff8116a8e7>] ? start_bandwidth_timer+0x27/0x60
[ 1195.578519]  [<ffffffff81190ea2>] ? __lock_acquired+0x2a2/0x2e0
[ 1195.578519]  [<ffffffff811c541a>] ktime_get+0x9a/0x1b0
[ 1195.578519]  [<ffffffff8116a8e7>] ? start_bandwidth_timer+0x27/0x60
[ 1195.578519]  [<ffffffff8116a8e7>] start_bandwidth_timer+0x27/0x60
[ 1195.578519]  [<ffffffff8118430e>] __enqueue_rt_entity+0x23e/0x290
[ 1195.578519]  [<ffffffff8117a027>] ? update_rq_runnable_avg+0x127/0x1d0
[ 1195.578519]  [<ffffffff811846db>] enqueue_rt_entity+0x6b/0x80
[ 1195.578519]  [<ffffffff81184726>] enqueue_task_rt+0x36/0xb0
[ 1195.578519]  [<ffffffff8116af42>] enqueue_task+0x52/0x60
[ 1195.578519]  [<ffffffff81171ebb>] __sched_setscheduler+0x33b/0x3f0
[ 1195.578519]  [<ffffffff812793d6>] ? might_fault+0x56/0xb0
[ 1195.578519]  [<ffffffff81172043>] sched_setscheduler+0x13/0x20
[ 1195.578519]  [<ffffffff811720c9>] do_sched_setscheduler+0x79/0xa0
[ 1195.578519]  [<ffffffff81172129>] SyS_sched_setscheduler+0x19/0x20
[ 1195.578519]  [<ffffffff843baed0>] tracesys+0xdd/0xe2


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-12 13:23   ` Ingo Molnar
@ 2013-12-12 18:53     ` John Stultz
  0 siblings, 0 replies; 20+ messages in thread
From: John Stultz @ 2013-12-12 18:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Thomas Gleixner, Prarit Bhargava, Richard Cochran,
	Sasha Levin, stable

On 12/12/2013 05:23 AM, Ingo Molnar wrote:
> * John Stultz <john.stultz@linaro.org> wrote:
>
>> As part of normal operaions, the hrtimer subsystem frequently calls
>> into the timekeeping code, creating a locking order of
>>   hrtimer locks -> timekeeping locks
>>
>> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
>> between the timekeeping the hrtimer subsystem, so that we could
>> notify the hrtimer subsytem the time had changed while holding
>> the timekeeping locks. This was done by scheduling delayed work
>> that would run later once we were out of the timekeeing code.
>>
>> But unfortunately the lock chains are complex enoguh that in
>> scheduling delayed work, we end up eventually trying to grab
>> an hrtimer lock.
>>
>> Sasha Levin noticed this in testing when the new seqlock lockdep
>> enablement triggered the following (somewhat abrieviated) message:
>>
>> [  251.100221] ======================================================
>> [  251.100221] [ INFO: possible circular locking dependency detected ]
>> [  251.100221] 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053 Not tainted
>> [  251.101967] -------------------------------------------------------
>> [  251.101967] kworker/10:1/4506 is trying to acquire lock:
>> [  251.101967]  (timekeeper_seq){----..}, at: [<ffffffff81160e96>] retrigger_next_event+0x56/0x70
>> [  251.101967]
>> [  251.101967] but task is already holding lock:
>> [  251.101967]  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
>> [  251.101967]
>> [  251.101967] which lock already depends on the new lock.
>> [  251.101967]
>> [  251.101967]
>> [  251.101967] the existing dependency chain (in reverse order) is:
>> [  251.101967]
>> -> #5 (hrtimer_bases.lock#11){-.-...}:
>> [snipped]
>> -> #4 (&rt_b->rt_runtime_lock){-.-...}:
>> [snipped]
>> -> #3 (&rq->lock){-.-.-.}:
>> [snipped]
>> -> #2 (&p->pi_lock){-.-.-.}:
>> [snipped]
>> -> #1 (&(&pool->lock)->rlock){-.-...}:
>> [  251.101967]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
>> [  251.101967]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
>> [  251.101967]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
>> [  251.101967]        [<ffffffff84398500>] _raw_spin_lock+0x40/0x80
>> [  251.101967]        [<ffffffff81153e69>] __queue_work+0x1a9/0x3f0
>> [  251.101967]        [<ffffffff81154168>] queue_work_on+0x98/0x120
>> [  251.101967]        [<ffffffff81161351>] clock_was_set_delayed+0x21/0x30
>> [  251.101967]        [<ffffffff811c4bd1>] do_adjtimex+0x111/0x160
>> [  251.101967]        [<ffffffff811e2711>] compat_sys_adjtimex+0x41/0x70
>> [  251.101967]        [<ffffffff843a4b49>] ia32_sysret+0x0/0x5
>> [  251.101967]
>> -> #0 (timekeeper_seq){----..}:
>> [snipped]
>> [  251.101967] other info that might help us debug this:
>> [  251.101967]
>> [  251.101967] Chain exists of:
>>   timekeeper_seq --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock#11
>>
>> [  251.101967]  Possible unsafe locking scenario:
>> [  251.101967]
>> [  251.101967]        CPU0                    CPU1
>> [  251.101967]        ----                    ----
>> [  251.101967]   lock(hrtimer_bases.lock#11);
>> [  251.101967]                                lock(&rt_b->rt_runtime_lock);
>> [  251.101967]                                lock(hrtimer_bases.lock#11);
>> [  251.101967]   lock(timekeeper_seq);
>> [  251.101967]
>> [  251.101967]  *** DEADLOCK ***
>> [  251.101967]
>> [  251.101967] 3 locks held by kworker/10:1/4506:
>> [  251.101967]  #0:  (events){.+.+.+}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
>> [  251.101967]  #1:  (hrtimer_work){+.+...}, at: [<ffffffff81154960>] process_one_work+0x200/0x530
>> [  251.101967]  #2:  (hrtimer_bases.lock#11){-.-...}, at: [<ffffffff81160e7c>] retrigger_next_event+0x3c/0x70
>> [  251.101967]
>> [  251.101967] stack backtrace:
>> [  251.101967] CPU: 10 PID: 4506 Comm: kworker/10:1 Not tainted 3.13.0-rc2-next-20131206-sasha-00005-g8be2375-dirty #4053
>> [  251.101967] Workqueue: events clock_was_set_work
>>
>> So the best solution is to avoid calling clock_was_set_delayed() while
>> holding the timekeeping lock, and instead using a flag variable to
>> decide if we should call clock_was_set() once we've released the locks.
>>
>> This works for the case here, where the do_adjtimex() was the deadlock
>> trigger point. Unfortuantely, in update_wall_time() we still hold
>> the jiffies lock, which would deadlock with the ipi triggered by
>> clock_was_set(), preventing us from calling it even after we drop the
>> timekeeping lock. So instead call clock_was_set_delayed() at that point.
>>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Prarit Bhargava <prarit@redhat.com>
>> Cc: Richard Cochran <richardcochran@gmail.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Sasha Levin <sasha.levin@oracle.com>
>> Cc: stable <stable@vger.kernel.org> #3.10+
>> Reported-by: Sasha Levin <sasha.levin@oracle.com>
>> Signed-off-by: John Stultz <john.stultz@linaro.org>
>> ---
>>  kernel/time/timekeeping.c | 18 ++++++++++++++++--
>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index 998ec751..c1d36b6 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1278,7 +1278,6 @@ static inline unsigned int accumulate_nsecs_to_secs(struct timekeeper *tk)
>>  
>>  			__timekeeping_set_tai_offset(tk, tk->tai_offset - leap);
>>  
>> -			clock_was_set_delayed();
>>  			action = TK_CLOCK_WAS_SET;
>>  		}
>>  	}
>> @@ -1440,6 +1439,19 @@ static void update_wall_time(void)
>>  	write_seqcount_end(&timekeeper_seq);
>>  out:
>>  	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
>> +	if (action & TK_CLOCK_WAS_SET) {
>> +		/*
>> +		 * XXX -  I'd rather we just call clock_was_set(), but
>> +		 * since we're currently holding the jiffies lock, calling
>> +		 * clock_was_set would trigger an ipi which would then grab
>> +		 * the jiffies lock and we'd deadlock. :(
>> +		 * The right solution should probably be droping
>> +		 * the jiffies lock before calling update_wall_time
>> +		 * but that requires some rework of the tick sched
>> +		 * code.
> s/droping/
>   dropping
>
>> +		 */
>> +		clock_was_set_delayed();
> Hm, this feels like a hack. Is the 'rework of the tick sched code' 
> going to happen too?

Yea. It is sort of a hack, since the whole purpose of
clock_was_set_delayed() was that it was supposed to be safe to call deep
in the timekeeping code.

We also need to backport this i-don't-know-how-far-back to -stable, and
that may include kernels where the jiffies lock and timekeping lock
aren't yet split apart. So I suspect the hack-ish fix with following
cleanup is the best route.

I'll try to spin up a first draft of the cleanup patch so we can at
least have that discussed and hopefully ready to be queued.

thanks
-john







^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-12 18:32     ` Sasha Levin
@ 2013-12-12 18:59       ` John Stultz
  2013-12-12 19:05         ` Sasha Levin
  0 siblings, 1 reply; 20+ messages in thread
From: John Stultz @ 2013-12-12 18:59 UTC (permalink / raw)
  To: Sasha Levin, LKML
  Cc: Thomas Gleixner, Prarit Bhargava, Richard Cochran, Ingo Molnar, stable

On 12/12/2013 10:32 AM, Sasha Levin wrote:
> On 12/12/2013 11:34 AM, Sasha Levin wrote:
>> On 12/11/2013 02:11 PM, John Stultz wrote:
>>> As part of normal operaions, the hrtimer subsystem frequently calls
>>> into the timekeeping code, creating a locking order of
>>>    hrtimer locks -> timekeeping locks
>>>
>>> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
>>> between the timekeeping the hrtimer subsystem, so that we could
>>> notify the hrtimer subsytem the time had changed while holding
>>> the timekeeping locks. This was done by scheduling delayed work
>>> that would run later once we were out of the timekeeing code.
>>>
>>> But unfortunately the lock chains are complex enoguh that in
>>> scheduling delayed work, we end up eventually trying to grab
>>> an hrtimer lock.
>>>
>>> Sasha Levin noticed this in testing when the new seqlock lockdep
>>> enablement triggered the following (somewhat abrieviated) message:
>>
>> [snip]
>>
>> This seems to work for me, I don't see the lockdep spew anymore.
>>
>>      Tested-by: Sasha Levin <sasha.levin@oracle.com>
>
> I think I spoke too soon.
>
> It took way more time to reproduce than previously, but I got:
>
>
> -> #1 (&(&pool->lock)->rlock){-.-...}:
> [ 1195.578519]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
> [ 1195.578519]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
> [ 1195.578519]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
> [ 1195.578519]        [<ffffffff843b0760>] _raw_spin_lock+0x40/0x80
> [ 1195.578519]        [<ffffffff81153e0e>] __queue_work+0x14e/0x3f0
> [ 1195.578519]        [<ffffffff81154168>] queue_work_on+0x98/0x120
> [ 1195.578519]        [<ffffffff81161351>]
> clock_was_set_delayed+0x21/0x30
> [ 1195.578519]        [<ffffffff811c4b41>] do_adjtimex+0x111/0x160
> [ 1195.578519]        [<ffffffff811360e3>] SYSC_adjtimex+0x43/0x80
> [ 1195.578519]        [<ffffffff8113612e>] SyS_adjtimex+0xe/0x10
> [ 1195.578519]        [<ffffffff843baed0>] tracesys+0xdd/0xe2
> [ 1195.578519]

Are you sure you have that patch applied?

With it we shouldn't be calling clock_was_set_delayed() from do_adjtimex().

thanks
-john





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-12 18:59       ` John Stultz
@ 2013-12-12 19:05         ` Sasha Levin
  2013-12-12 19:13           ` John Stultz
  0 siblings, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2013-12-12 19:05 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Thomas Gleixner, Prarit Bhargava, Richard Cochran, Ingo Molnar, stable

On 12/12/2013 01:59 PM, John Stultz wrote:
> On 12/12/2013 10:32 AM, Sasha Levin wrote:
>> On 12/12/2013 11:34 AM, Sasha Levin wrote:
>>> On 12/11/2013 02:11 PM, John Stultz wrote:
>>>> As part of normal operaions, the hrtimer subsystem frequently calls
>>>> into the timekeeping code, creating a locking order of
>>>>     hrtimer locks -> timekeeping locks
>>>>
>>>> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
>>>> between the timekeeping the hrtimer subsystem, so that we could
>>>> notify the hrtimer subsytem the time had changed while holding
>>>> the timekeeping locks. This was done by scheduling delayed work
>>>> that would run later once we were out of the timekeeing code.
>>>>
>>>> But unfortunately the lock chains are complex enoguh that in
>>>> scheduling delayed work, we end up eventually trying to grab
>>>> an hrtimer lock.
>>>>
>>>> Sasha Levin noticed this in testing when the new seqlock lockdep
>>>> enablement triggered the following (somewhat abrieviated) message:
>>>
>>> [snip]
>>>
>>> This seems to work for me, I don't see the lockdep spew anymore.
>>>
>>>       Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>
>> I think I spoke too soon.
>>
>> It took way more time to reproduce than previously, but I got:
>>
>>
>> -> #1 (&(&pool->lock)->rlock){-.-...}:
>> [ 1195.578519]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
>> [ 1195.578519]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
>> [ 1195.578519]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
>> [ 1195.578519]        [<ffffffff843b0760>] _raw_spin_lock+0x40/0x80
>> [ 1195.578519]        [<ffffffff81153e0e>] __queue_work+0x14e/0x3f0
>> [ 1195.578519]        [<ffffffff81154168>] queue_work_on+0x98/0x120
>> [ 1195.578519]        [<ffffffff81161351>]
>> clock_was_set_delayed+0x21/0x30
>> [ 1195.578519]        [<ffffffff811c4b41>] do_adjtimex+0x111/0x160
>> [ 1195.578519]        [<ffffffff811360e3>] SYSC_adjtimex+0x43/0x80
>> [ 1195.578519]        [<ffffffff8113612e>] SyS_adjtimex+0xe/0x10
>> [ 1195.578519]        [<ffffffff843baed0>] tracesys+0xdd/0xe2
>> [ 1195.578519]
>
> Are you sure you have that patch applied?
>
> With it we shouldn't be calling clock_was_set_delayed() from do_adjtimex().

Hm, It seems that there's a conflict there that wasn't resolved properly. Does this patch
depend on anything else that's not currently in -next?


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-12 19:05         ` Sasha Levin
@ 2013-12-12 19:13           ` John Stultz
  2013-12-17  5:15             ` John Stultz
  0 siblings, 1 reply; 20+ messages in thread
From: John Stultz @ 2013-12-12 19:13 UTC (permalink / raw)
  To: Sasha Levin, LKML
  Cc: Thomas Gleixner, Prarit Bhargava, Richard Cochran, Ingo Molnar, stable

On 12/12/2013 11:05 AM, Sasha Levin wrote:
> On 12/12/2013 01:59 PM, John Stultz wrote:
>> On 12/12/2013 10:32 AM, Sasha Levin wrote:
>>> On 12/12/2013 11:34 AM, Sasha Levin wrote:
>>>> On 12/11/2013 02:11 PM, John Stultz wrote:
>>>>> As part of normal operaions, the hrtimer subsystem frequently calls
>>>>> into the timekeeping code, creating a locking order of
>>>>>     hrtimer locks -> timekeeping locks
>>>>>
>>>>> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
>>>>> between the timekeeping the hrtimer subsystem, so that we could
>>>>> notify the hrtimer subsytem the time had changed while holding
>>>>> the timekeeping locks. This was done by scheduling delayed work
>>>>> that would run later once we were out of the timekeeing code.
>>>>>
>>>>> But unfortunately the lock chains are complex enoguh that in
>>>>> scheduling delayed work, we end up eventually trying to grab
>>>>> an hrtimer lock.
>>>>>
>>>>> Sasha Levin noticed this in testing when the new seqlock lockdep
>>>>> enablement triggered the following (somewhat abrieviated) message:
>>>>
>>>> [snip]
>>>>
>>>> This seems to work for me, I don't see the lockdep spew anymore.
>>>>
>>>>       Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>>
>>> I think I spoke too soon.
>>>
>>> It took way more time to reproduce than previously, but I got:
>>>
>>>
>>> -> #1 (&(&pool->lock)->rlock){-.-...}:
>>> [ 1195.578519]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
>>> [ 1195.578519]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
>>> [ 1195.578519]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
>>> [ 1195.578519]        [<ffffffff843b0760>] _raw_spin_lock+0x40/0x80
>>> [ 1195.578519]        [<ffffffff81153e0e>] __queue_work+0x14e/0x3f0
>>> [ 1195.578519]        [<ffffffff81154168>] queue_work_on+0x98/0x120
>>> [ 1195.578519]        [<ffffffff81161351>]
>>> clock_was_set_delayed+0x21/0x30
>>> [ 1195.578519]        [<ffffffff811c4b41>] do_adjtimex+0x111/0x160
>>> [ 1195.578519]        [<ffffffff811360e3>] SYSC_adjtimex+0x43/0x80
>>> [ 1195.578519]        [<ffffffff8113612e>] SyS_adjtimex+0xe/0x10
>>> [ 1195.578519]        [<ffffffff843baed0>] tracesys+0xdd/0xe2
>>> [ 1195.578519]
>>
>> Are you sure you have that patch applied?
>>
>> With it we shouldn't be calling clock_was_set_delayed() from
>> do_adjtimex().
>
> Hm, It seems that there's a conflict there that wasn't resolved
> properly. Does this patch
> depend on anything else that's not currently in -next?

Oh yes, sorry, I didn't cc you on the entire patch set. Apologies!

You'll probably want to grab the two previous patches:
https://lkml.org/lkml/2013/12/11/479
https://lkml.org/lkml/2013/12/11/758

thanks
-john


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 4/5] timekeeping: Fix CLOCK_TAI timer/nanosleep delays
  2013-12-12 18:31     ` John Stultz
@ 2013-12-13 14:10       ` Ingo Molnar
  0 siblings, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2013-12-13 14:10 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Thomas Gleixner, Prarit Bhargava, Richard Cochran, stable


* John Stultz <john.stultz@linaro.org> wrote:

> > If the bugs extend to more than this two-liner then for -stable it 
> > might be better to just disable CLOCK_TAI (userspace can deal with 
> > it just fine), and queue up the right fixes for the next merge 
> > window or so.
> 
> I don't foresee further issues (famous last words, eh), but since I 
> was planning on keeping patch #4 and #5 for 3.14 anyway, we can wait 
> till those land upstream to decide if the two-liner is sufficient or 
> if disabling CLOCK_TAI in older -stable kernels is the right 
> approach. That sound ok?

Yeah, that certainly sounds good to me.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-12 19:13           ` John Stultz
@ 2013-12-17  5:15             ` John Stultz
  2013-12-17  6:41               ` Sasha Levin
  0 siblings, 1 reply; 20+ messages in thread
From: John Stultz @ 2013-12-17  5:15 UTC (permalink / raw)
  To: Sasha Levin, LKML
  Cc: Thomas Gleixner, Prarit Bhargava, Richard Cochran, Ingo Molnar, stable

On 12/12/2013 11:13 AM, John Stultz wrote:
> On 12/12/2013 11:05 AM, Sasha Levin wrote:
>> On 12/12/2013 01:59 PM, John Stultz wrote:
>>> On 12/12/2013 10:32 AM, Sasha Levin wrote:
>>>> On 12/12/2013 11:34 AM, Sasha Levin wrote:
>>>>> On 12/11/2013 02:11 PM, John Stultz wrote:
>>>>>> As part of normal operaions, the hrtimer subsystem frequently calls
>>>>>> into the timekeeping code, creating a locking order of
>>>>>>     hrtimer locks -> timekeeping locks
>>>>>>
>>>>>> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
>>>>>> between the timekeeping the hrtimer subsystem, so that we could
>>>>>> notify the hrtimer subsytem the time had changed while holding
>>>>>> the timekeeping locks. This was done by scheduling delayed work
>>>>>> that would run later once we were out of the timekeeing code.
>>>>>>
>>>>>> But unfortunately the lock chains are complex enoguh that in
>>>>>> scheduling delayed work, we end up eventually trying to grab
>>>>>> an hrtimer lock.
>>>>>>
>>>>>> Sasha Levin noticed this in testing when the new seqlock lockdep
>>>>>> enablement triggered the following (somewhat abrieviated) message:
>>>>> [snip]
>>>>>
>>>>> This seems to work for me, I don't see the lockdep spew anymore.
>>>>>
>>>>>       Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>>> I think I spoke too soon.
>>>>
>>>> It took way more time to reproduce than previously, but I got:
>>>>
>>>>
>>>> -> #1 (&(&pool->lock)->rlock){-.-...}:
>>>> [ 1195.578519]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
>>>> [ 1195.578519]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
>>>> [ 1195.578519]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
>>>> [ 1195.578519]        [<ffffffff843b0760>] _raw_spin_lock+0x40/0x80
>>>> [ 1195.578519]        [<ffffffff81153e0e>] __queue_work+0x14e/0x3f0
>>>> [ 1195.578519]        [<ffffffff81154168>] queue_work_on+0x98/0x120
>>>> [ 1195.578519]        [<ffffffff81161351>]
>>>> clock_was_set_delayed+0x21/0x30
>>>> [ 1195.578519]        [<ffffffff811c4b41>] do_adjtimex+0x111/0x160
>>>> [ 1195.578519]        [<ffffffff811360e3>] SYSC_adjtimex+0x43/0x80
>>>> [ 1195.578519]        [<ffffffff8113612e>] SyS_adjtimex+0xe/0x10
>>>> [ 1195.578519]        [<ffffffff843baed0>] tracesys+0xdd/0xe2
>>>> [ 1195.578519]
>>> Are you sure you have that patch applied?
>>>
>>> With it we shouldn't be calling clock_was_set_delayed() from
>>> do_adjtimex().
>> Hm, It seems that there's a conflict there that wasn't resolved
>> properly. Does this patch
>> depend on anything else that's not currently in -next?
> Oh yes, sorry, I didn't cc you on the entire patch set. Apologies!
>
> You'll probably want to grab the two previous patches:
> https://lkml.org/lkml/2013/12/11/479
> https://lkml.org/lkml/2013/12/11/758

Just wanted to follow up here. Did you happen to get a chance to try to
reproduce w/ the three patch patchset?

I'm hoping to submit them to Ingo tomorrow, and want to make sure I've
got your tested-by.

thanks
-john


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-17  5:15             ` John Stultz
@ 2013-12-17  6:41               ` Sasha Levin
  2013-12-17 16:34                 ` John Stultz
  0 siblings, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2013-12-17  6:41 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Thomas Gleixner, Prarit Bhargava, Richard Cochran, Ingo Molnar, stable

On 12/17/2013 12:15 AM, John Stultz wrote:
> On 12/12/2013 11:13 AM, John Stultz wrote:
>> On 12/12/2013 11:05 AM, Sasha Levin wrote:
>>> On 12/12/2013 01:59 PM, John Stultz wrote:
>>>> On 12/12/2013 10:32 AM, Sasha Levin wrote:
>>>>> On 12/12/2013 11:34 AM, Sasha Levin wrote:
>>>>>> On 12/11/2013 02:11 PM, John Stultz wrote:
>>>>>>> As part of normal operaions, the hrtimer subsystem frequently calls
>>>>>>> into the timekeeping code, creating a locking order of
>>>>>>>      hrtimer locks -> timekeeping locks
>>>>>>>
>>>>>>> clock_was_set_delayed() was suppoed to allow us to avoid deadlocks
>>>>>>> between the timekeeping the hrtimer subsystem, so that we could
>>>>>>> notify the hrtimer subsytem the time had changed while holding
>>>>>>> the timekeeping locks. This was done by scheduling delayed work
>>>>>>> that would run later once we were out of the timekeeing code.
>>>>>>>
>>>>>>> But unfortunately the lock chains are complex enoguh that in
>>>>>>> scheduling delayed work, we end up eventually trying to grab
>>>>>>> an hrtimer lock.
>>>>>>>
>>>>>>> Sasha Levin noticed this in testing when the new seqlock lockdep
>>>>>>> enablement triggered the following (somewhat abrieviated) message:
>>>>>> [snip]
>>>>>>
>>>>>> This seems to work for me, I don't see the lockdep spew anymore.
>>>>>>
>>>>>>        Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>>>> I think I spoke too soon.
>>>>>
>>>>> It took way more time to reproduce than previously, but I got:
>>>>>
>>>>>
>>>>> -> #1 (&(&pool->lock)->rlock){-.-...}:
>>>>> [ 1195.578519]        [<ffffffff81194803>] validate_chain+0x6c3/0x7b0
>>>>> [ 1195.578519]        [<ffffffff81194d9d>] __lock_acquire+0x4ad/0x580
>>>>> [ 1195.578519]        [<ffffffff81194ff2>] lock_acquire+0x182/0x1d0
>>>>> [ 1195.578519]        [<ffffffff843b0760>] _raw_spin_lock+0x40/0x80
>>>>> [ 1195.578519]        [<ffffffff81153e0e>] __queue_work+0x14e/0x3f0
>>>>> [ 1195.578519]        [<ffffffff81154168>] queue_work_on+0x98/0x120
>>>>> [ 1195.578519]        [<ffffffff81161351>]
>>>>> clock_was_set_delayed+0x21/0x30
>>>>> [ 1195.578519]        [<ffffffff811c4b41>] do_adjtimex+0x111/0x160
>>>>> [ 1195.578519]        [<ffffffff811360e3>] SYSC_adjtimex+0x43/0x80
>>>>> [ 1195.578519]        [<ffffffff8113612e>] SyS_adjtimex+0xe/0x10
>>>>> [ 1195.578519]        [<ffffffff843baed0>] tracesys+0xdd/0xe2
>>>>> [ 1195.578519]
>>>> Are you sure you have that patch applied?
>>>>
>>>> With it we shouldn't be calling clock_was_set_delayed() from
>>>> do_adjtimex().
>>> Hm, It seems that there's a conflict there that wasn't resolved
>>> properly. Does this patch
>>> depend on anything else that's not currently in -next?
>> Oh yes, sorry, I didn't cc you on the entire patch set. Apologies!
>>
>> You'll probably want to grab the two previous patches:
>> https://lkml.org/lkml/2013/12/11/479
>> https://lkml.org/lkml/2013/12/11/758
>
> Just wanted to follow up here. Did you happen to get a chance to try to
> reproduce w/ the three patch patchset?
>
> I'm hoping to submit them to Ingo tomorrow, and want to make sure I've
> got your tested-by.

Oh yeah, have been running it ever since, haven't seen the issue reproduce.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed
  2013-12-17  6:41               ` Sasha Levin
@ 2013-12-17 16:34                 ` John Stultz
  0 siblings, 0 replies; 20+ messages in thread
From: John Stultz @ 2013-12-17 16:34 UTC (permalink / raw)
  To: Sasha Levin, LKML
  Cc: Thomas Gleixner, Prarit Bhargava, Richard Cochran, Ingo Molnar, stable

On 12/16/2013 10:41 PM, Sasha Levin wrote:
> On 12/17/2013 12:15 AM, John Stultz wrote:
>>
>> Just wanted to follow up here. Did you happen to get a chance to try to
>> reproduce w/ the three patch patchset?
>>
>> I'm hoping to submit them to Ingo tomorrow, and want to make sure I've
>> got your tested-by.
>
> Oh yeah, have been running it ever since, haven't seen the issue
> reproduce.

Great! I'll add your tested-by line, if that's ok.

thanks
-john


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-12-17 16:34 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-11 19:11 [RFC][PATCH 0/5] Timekeeping fixes v2 John Stultz
2013-12-11 19:11 ` [RFC][PATCH 1/5] timekeeping: Fix lost updates to tai adjustment John Stultz
2013-12-11 19:11 ` [RFC][PATCH 2/5] timekeeping: Fix potential lost pv notification of time change John Stultz
2013-12-11 19:11   ` John Stultz
2013-12-11 19:11 ` [RFC][PATCH 3/5] timekeeping: Avoid possible deadlock from clock_was_set_delayed John Stultz
2013-12-12 13:23   ` Ingo Molnar
2013-12-12 18:53     ` John Stultz
2013-12-12 16:34   ` Sasha Levin
2013-12-12 18:32     ` Sasha Levin
2013-12-12 18:59       ` John Stultz
2013-12-12 19:05         ` Sasha Levin
2013-12-12 19:13           ` John Stultz
2013-12-17  5:15             ` John Stultz
2013-12-17  6:41               ` Sasha Levin
2013-12-17 16:34                 ` John Stultz
2013-12-11 19:11 ` [RFC][PATCH 4/5] timekeeping: Fix CLOCK_TAI timer/nanosleep delays John Stultz
2013-12-12 13:25   ` Ingo Molnar
2013-12-12 18:31     ` John Stultz
2013-12-13 14:10       ` Ingo Molnar
2013-12-11 19:11 ` [RFC][PATCH 5/5] timekeeping: Fix missing timekeeping_update in suspend path John Stultz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.