linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"
@ 2012-08-31  4:05 Andreas Bombe
  2012-08-31  4:25 ` Linus Torvalds
  2012-08-31 17:43 ` John Stultz
  0 siblings, 2 replies; 6+ messages in thread
From: Andreas Bombe @ 2012-08-31  4:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, John Stultz, Thomas Gleixner

I have recently started to get problems with X simply shutting itself
down and returning to the login screen. In the X logs I find:

> [  1492.936] 
> Fatal server error:
> [  1492.936] WaitForSomething(): select: Invalid argument

No messages whatsoever is found in the kernel logs. This error happens
randomly without any correlation to user input, but with a high
likelihood (within a few minutes at most) when a video is playing. It
doesn't matter if the video is in Flash in a browser window or in a
video player playing a local file.

With that somewhat easy test I bisected it down to 4e8b14526 "time:
Improve sanity checking of timekeeping inputs". The latest Linus git
(155e36d40) with a revert of the bisected commit does not show the
problem.

Video is Radeon HD 6950 with open source drivers. Xorg version is the
one currently in Debian unstable (xserver-xorg-core: 2:1.12.3.902-1,
xserver-xorg-video-radeon: 1:6.14.4-5, libdrm: 2.4.33-3).

-- 
Andreas Bombe

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"
  2012-08-31  4:05 [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs" Andreas Bombe
@ 2012-08-31  4:25 ` Linus Torvalds
  2012-08-31 17:41   ` Andreas Bombe
  2012-08-31 17:43 ` John Stultz
  1 sibling, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2012-08-31  4:25 UTC (permalink / raw)
  To: Andreas Bombe; +Cc: linux-kernel, John Stultz, Thomas Gleixner

On Thu, Aug 30, 2012 at 9:05 PM, Andreas Bombe <aeb@debian.org> wrote:
>
> With that somewhat easy test I bisected it down to 4e8b14526 "time:
> Improve sanity checking of timekeeping inputs". The latest Linus git
> (155e36d40) with a revert of the bisected commit does not show the
> problem.

Ok, I guess we need to revert it. Although it might be interesting to
add a WARN_ON_ONCE() for the case of timespec_valid() returning false,
to just see exactly *where* that thing triggers. Could you do that? In
fact, do it with separate WARN_ON_ONCE's for each of the reasons that
function returns false, so that we also see which check it is that
triggers. Ok?

            Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"
  2012-08-31  4:25 ` Linus Torvalds
@ 2012-08-31 17:41   ` Andreas Bombe
  2012-08-31 17:56     ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Andreas Bombe @ 2012-08-31 17:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, John Stultz, Thomas Gleixner

On Thu, Aug 30, 2012 at 09:25:52PM -0700, Linus Torvalds wrote:
> On Thu, Aug 30, 2012 at 9:05 PM, Andreas Bombe <aeb@debian.org> wrote:
> >
> > With that somewhat easy test I bisected it down to 4e8b14526 "time:
> > Improve sanity checking of timekeeping inputs". The latest Linus git
> > (155e36d40) with a revert of the bisected commit does not show the
> > problem.
> 
> Ok, I guess we need to revert it. Although it might be interesting to
> add a WARN_ON_ONCE() for the case of timespec_valid() returning false,
> to just see exactly *where* that thing triggers. Could you do that? In
> fact, do it with separate WARN_ON_ONCE's for each of the reasons that
> function returns false, so that we also see which check it is that
> triggers. Ok?

It triggers on ((unsigned long long)ts->tv_sec >= KTIME_SEC_MAX).
Looking at some straces (I could have thought of that earlier…) X does
in fact call select with unreasonable timeouts:

| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 20000}) = 1 (in [24], left {0, 19988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 19000}) = 1 (in [24], left {0, 18988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 19000}) = 1 (in [24], left {0, 16804})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 16000}) = 1 (in [24], left {0, 15988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 16000}) = 1 (in [9], left {0, 3649})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 3000}) = 1 (in [24], left {0, 2988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 2000}) = 1 (in [24], left {0, 1988})
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 2000}) = 0 (Timeout)
| 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {18446744073709551, 615000}) = -1 EINVAL (Invalid argument)

The time values are actually decreasing from 90 seconds to this. That
seconds value is actually (0ULL - 1) / 1000, so something is
decrementing the timeout beyond zero. I don't see how it could happen
directly in WaitForSomething in the X server sources[1], it's probably
in the BlockHandler callbacks somewhere.

Have to dig deeper to see if that is a long standing issue.

[1] http://cgit.freedesktop.org/xorg/xserver/tree/os/WaitFor.c?h=mpx&id=xorg-server-1.12.3.902#n145

-- 
Andreas Bombe

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"
  2012-08-31  4:05 [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs" Andreas Bombe
  2012-08-31  4:25 ` Linus Torvalds
@ 2012-08-31 17:43 ` John Stultz
  2012-09-01  2:02   ` Andreas Bombe
  1 sibling, 1 reply; 6+ messages in thread
From: John Stultz @ 2012-08-31 17:43 UTC (permalink / raw)
  To: Andreas Bombe; +Cc: linux-kernel, Linus Torvalds, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1864 bytes --]

On 08/30/2012 09:05 PM, Andreas Bombe wrote:
> I have recently started to get problems with X simply shutting itself
> down and returning to the login screen. In the X logs I find:
>
>> [  1492.936]
>> Fatal server error:
>> [  1492.936] WaitForSomething(): select: Invalid argument
> No messages whatsoever is found in the kernel logs. This error happens
> randomly without any correlation to user input, but with a high
> likelihood (within a few minutes at most) when a video is playing. It
> doesn't matter if the video is in Flash in a browser window or in a
> video player playing a local file.
>
> With that somewhat easy test I bisected it down to 4e8b14526 "time:
> Improve sanity checking of timekeeping inputs". The latest Linus git
> (155e36d40) with a revert of the bisected commit does not show the
> problem.
>
> Video is Radeon HD 6950 with open source drivers. Xorg version is the
> one currently in Debian unstable (xserver-xorg-core: 2:1.12.3.902-1,
>   xserver-xorg-video-radeon: 1:6.14.4-5, libdrm: 2.4.33-3).

Thanks so much for bisecting this down!
I'm guessing X is passing crazy large timespecs into select (via 
WaitForSomething()) values that are catching on the ktime_t overflow 
check in timespec_valid(). Previously these would be clamped to 
KTIME_MAX (which basically is infinity) in the timer subsystem before.

So the issue is the patch in question is too strict in its validation. 
We want to be strict on things like timekeeping inputs, but for timers 
wait to infinity is still valid.

The attached (sorry not inline, on the road) patch should fix this, but 
could you verify it? (I'm running my testing concurrently)

Linus: The issue the patch in question addresses has only been reported 
from trinity stress testing and a system with a crazy CMOS clock value, 
so I'm ok with the revert if you'd prefer that.

thanks
-john

[-- Attachment #2: 0001-time-Move-ktime_t-overflow-checking-into-timespec_va.patch --]
[-- Type: text/x-patch, Size: 3469 bytes --]

>From a71cdf72cbd10ae5ef6b061db7ab3019293933a4 Mon Sep 17 00:00:00 2001
From: John Stultz <john.stultz@linaro.org>
Date: Fri, 31 Aug 2012 13:30:06 -0400
Subject: [PATCH] time: Move ktime_t overflow checking into
 timespec_valid_strict

Andreas Bombe reported that the added ktime_t overflow
checking added to timespec_valid in commit 4e8b14526
was causing problems with X.org because it caused
timeouts larger then KTIME_T to be invalid.

Previously, these large timeouts would be clamped
to KTIME_MAX and would never expire, which is valid.

This patch splits the ktime_t overflow checking into
a new timespec_valid_strict function, and converts
the timekeeping code's checking to use this more
strict function.

Cc: Andreas Bombe <aeb@debian.org>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Reported-by: Andreas Bombe <aeb@debian.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/time.h      |    7 +++++++
 kernel/time/timekeeping.c |   10 +++++-----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/time.h b/include/linux/time.h
index b0bbd8f..b51e664 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -125,6 +125,13 @@ static inline bool timespec_valid(const struct timespec *ts)
 	/* Can't have more nanoseconds then a second */
 	if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC)
 		return false;
+	return true;
+}
+
+static inline bool timespec_valid_strict(const struct timespec *ts)
+{
+	if (!timespec_valid(ts))
+		return false;
 	/* Disallow values that could overflow ktime_t */
 	if ((unsigned long long)ts->tv_sec >= KTIME_SEC_MAX)
 		return false;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0c1485e..34e5eac 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -428,7 +428,7 @@ int do_settimeofday(const struct timespec *tv)
 	struct timespec ts_delta, xt;
 	unsigned long flags;
 
-	if (!timespec_valid(tv))
+	if (!timespec_valid_strict(tv))
 		return -EINVAL;
 
 	write_seqlock_irqsave(&tk->lock, flags);
@@ -476,7 +476,7 @@ int timekeeping_inject_offset(struct timespec *ts)
 
 	/* Make sure the proposed value is valid */
 	tmp = timespec_add(tk_xtime(tk),  *ts);
-	if (!timespec_valid(&tmp)) {
+	if (!timespec_valid_strict(&tmp)) {
 		ret = -EINVAL;
 		goto error;
 	}
@@ -659,7 +659,7 @@ void __init timekeeping_init(void)
 	struct timespec now, boot, tmp;
 
 	read_persistent_clock(&now);
-	if (!timespec_valid(&now)) {
+	if (!timespec_valid_strict(&now)) {
 		pr_warn("WARNING: Persistent clock returned invalid value!\n"
 			"         Check your CMOS/BIOS settings.\n");
 		now.tv_sec = 0;
@@ -667,7 +667,7 @@ void __init timekeeping_init(void)
 	}
 
 	read_boot_clock(&boot);
-	if (!timespec_valid(&boot)) {
+	if (!timespec_valid_strict(&boot)) {
 		pr_warn("WARNING: Boot clock returned invalid value!\n"
 			"         Check your CMOS/BIOS settings.\n");
 		boot.tv_sec = 0;
@@ -713,7 +713,7 @@ static struct timespec timekeeping_suspend_time;
 static void __timekeeping_inject_sleeptime(struct timekeeper *tk,
 							struct timespec *delta)
 {
-	if (!timespec_valid(delta)) {
+	if (!timespec_valid_strict(delta)) {
 		printk(KERN_WARNING "__timekeeping_inject_sleeptime: Invalid "
 					"sleep delta value!\n");
 		return;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"
  2012-08-31 17:41   ` Andreas Bombe
@ 2012-08-31 17:56     ` Linus Torvalds
  0 siblings, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2012-08-31 17:56 UTC (permalink / raw)
  To: Andreas Bombe; +Cc: linux-kernel, John Stultz, Thomas Gleixner

On Fri, Aug 31, 2012 at 10:41 AM, Andreas Bombe <aeb@debian.org> wrote:
>
> It triggers on ((unsigned long long)ts->tv_sec >= KTIME_SEC_MAX).
> Looking at some straces (I could have thought of that earlier…) X does
> in fact call select with unreasonable timeouts:
>
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 20000}) = 1 (in [24], left {0, 19988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 19000}) = 1 (in [24], left {0, 18988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 19000}) = 1 (in [24], left {0, 16804})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 16000}) = 1 (in [24], left {0, 15988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 16000}) = 1 (in [9], left {0, 3649})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 3000}) = 1 (in [24], left {0, 2988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 2000}) = 1 (in [24], left {0, 1988})
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {0, 2000}) = 0 (Timeout)
> | 17:46:55 select(256, [1 3 6 9 10 11 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39], NULL, NULL, {18446744073709551, 615000}) = -1 EINVAL (Invalid argument)

Ok, X clearly *is* doing something wrong, and as you point out, it
looks like X has taken a negative milliseconds value, treated it as
positive, and then done a (unsigned) divide by 1000 on that to create
a large positive seconds value.

It's a fairly "natural" error to make if you keep timeouts as
milliseconds, and then convert those to a timeval or timespec.

At the same time, that value is not really "invalid". It's a
crazy-long timeout, but it's a valid timeout. The fact that it doesn't
happen to fit into our "jiffies" value is an internal kernel
implementation issue that we shouldn't be exposing to user space as an
error. Turning the crazy-long timeout into "forever" (or
alternatively, into the longest timeout we support) should be the
right fix.

Andreas, can you check whether John Stultz patch works for you? It
looks like it should.

Also, letting the X people know that they are doing something crazy
sounds like a good idea. Looking at that strace, I'm also struct by
the fact that always giving "256" as the number of fds is going to be
a performance thing. It would be much better if X actually knew how
many bits it really needs, as that means the kernel isn't going to
need to look at the bits. So opening a bugzilla for this sounds like a
good idea.

               Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs"
  2012-08-31 17:43 ` John Stultz
@ 2012-09-01  2:02   ` Andreas Bombe
  0 siblings, 0 replies; 6+ messages in thread
From: Andreas Bombe @ 2012-09-01  2:02 UTC (permalink / raw)
  To: John Stultz; +Cc: linux-kernel, Linus Torvalds, Thomas Gleixner

On Fri, Aug 31, 2012 at 10:43:42AM -0700, John Stultz wrote:
> On 08/30/2012 09:05 PM, Andreas Bombe wrote:
> >With that somewhat easy test I bisected it down to 4e8b14526 "time:
> >Improve sanity checking of timekeeping inputs". The latest Linus git
> >(155e36d40) with a revert of the bisected commit does not show the
> >problem.
> 
> Thanks so much for bisecting this down!
> I'm guessing X is passing crazy large timespecs into select (via
> WaitForSomething()) values that are catching on the ktime_t overflow
> check in timespec_valid(). Previously these would be clamped to
> KTIME_MAX (which basically is infinity) in the timer subsystem
> before.
> 
> So the issue is the patch in question is too strict in its
> validation. We want to be strict on things like timekeeping inputs,
> but for timers wait to infinity is still valid.
> 
> The attached (sorry not inline, on the road) patch should fix this,
> but could you verify it? (I'm running my testing concurrently)

I'm running it now and it's looking good. I did the video test again and
confirmed with strace that X was doing the giant timeout in select
again, but this time without any errors.

-- 
Andreas Bombe

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-09-01  2:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-31  4:05 [REGRESSION] Xorg doesn't like 4e8b14526 "time: Improve sanity checking of timekeeping inputs" Andreas Bombe
2012-08-31  4:25 ` Linus Torvalds
2012-08-31 17:41   ` Andreas Bombe
2012-08-31 17:56     ` Linus Torvalds
2012-08-31 17:43 ` John Stultz
2012-09-01  2:02   ` Andreas Bombe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).