linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC,v3] timekeeping: Limit the sleep time to avoid overflow
@ 2016-07-30 15:27 Chen Yu
  2016-08-17 19:13 ` John Stultz
  2016-08-17 19:18 ` [PATCH] timekeeping: Cap array access in timekeeping_debug to protect against invalid sleep times John Stultz
  0 siblings, 2 replies; 5+ messages in thread
From: Chen Yu @ 2016-07-30 15:27 UTC (permalink / raw)
  To: Thomas Gleixner, John Stultz
  Cc: Chen Yu, Stable # 3 . 17+,
	Rafael J . Wysocki, Xunlei Pang, Zhang Rui, linux-kernel,
	linux-pm

It is reported the hibernation fails at 2nd attempt, which
hangs at hibernate() -> syscore_resume() -> i8237A_resume()
-> claim_dma_lock(), because the lock has already been taken.
However there is actually no other process would like to grab
this lock on that problematic platform.

Further investigation shows that, the problem is triggered by
setting /sys/power/pm_trace to 1 before the 1st hibernation.
Since once pm_trace is enabled, the rtc becomes unmeaningful
after suspend, and meanwhile some BIOSes would like to adjust
the 'invalid' tsc(e.g, smaller than 1970) to the release date
of that motherboard during POST stage, thus after resumed, a
significant long sleep time might be generated due to meaningless
tsc delta, thus in timekeeping_resume -> tk_debug_account_sleep_time,
if the bit31 happened to be set to 1, the fls returns 32 and then we
add 1 to sleep_time_bin[32], which caused a memory overwritten.
As depicted by System.map:

ffffffff81c9d080 b sleep_time_bin
ffffffff81c9d100 B dma_spin_lock

the dma_spin_lock.val is set to 1, which caused this problem.

In theory we can avoid the overflow by ignoring the idle injection
if pm_trace is enabled, but we might still miss other cases which
might also break the rtc, e.g, buggy clocksoure/rtc driver,
or even user space tool such as hwclock -- so there is no generic
method to dertermin whether we should trust the tsc.

A simpler way is to set the threshold for the sleep time, and
ignore those abnormal ones. This patch sets the upper limit of
sleep seconds to 0x7fffffff, since no one is likely to sleep
that long(68 years).

Cc: Stable <stable@vger.kernel.org> # 3.17+
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Xunlei Pang <xpang@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Reported-and-tested-by: Janek Kozicki <cosurgi@gmail.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/time/timekeeping.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 3b65746..17bc72c 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1509,6 +1509,7 @@ void __init timekeeping_init(void)
 
 /* time in seconds when suspend began for persistent clock */
 static struct timespec64 timekeeping_suspend_time;
+#define MAX_SLEEP_TIME 0x7fffffff
 
 /**
  * __timekeeping_inject_sleeptime - Internal function to add sleep interval
@@ -1520,7 +1521,8 @@ static struct timespec64 timekeeping_suspend_time;
 static void __timekeeping_inject_sleeptime(struct timekeeper *tk,
 					   struct timespec64 *delta)
 {
-	if (!timespec64_valid_strict(delta)) {
+	if (!timespec64_valid_strict(delta) ||
+	     delta->tv_sec > MAX_SLEEP_TIME) {
 		printk_deferred(KERN_WARNING
 				"__timekeeping_inject_sleeptime: Invalid "
 				"sleep delta value!\n");
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC,v3] timekeeping: Limit the sleep time to avoid overflow
  2016-07-30 15:27 [RFC,v3] timekeeping: Limit the sleep time to avoid overflow Chen Yu
@ 2016-08-17 19:13 ` John Stultz
  2016-08-17 19:18 ` [PATCH] timekeeping: Cap array access in timekeeping_debug to protect against invalid sleep times John Stultz
  1 sibling, 0 replies; 5+ messages in thread
From: John Stultz @ 2016-08-17 19:13 UTC (permalink / raw)
  To: Chen Yu
  Cc: Thomas Gleixner, Stable # 3 . 17+,
	Rafael J . Wysocki, Xunlei Pang, Zhang Rui, lkml, Linux PM list

On Sat, Jul 30, 2016 at 8:27 AM, Chen Yu <yu.c.chen@intel.com> wrote:
> It is reported the hibernation fails at 2nd attempt, which
> hangs at hibernate() -> syscore_resume() -> i8237A_resume()
> -> claim_dma_lock(), because the lock has already been taken.
> However there is actually no other process would like to grab
> this lock on that problematic platform.
>
> Further investigation shows that, the problem is triggered by
> setting /sys/power/pm_trace to 1 before the 1st hibernation.
> Since once pm_trace is enabled, the rtc becomes unmeaningful
> after suspend, and meanwhile some BIOSes would like to adjust
> the 'invalid' tsc(e.g, smaller than 1970) to the release date
> of that motherboard during POST stage, thus after resumed, a
> significant long sleep time might be generated due to meaningless
> tsc delta, thus in timekeeping_resume -> tk_debug_account_sleep_time,
> if the bit31 happened to be set to 1, the fls returns 32 and then we
> add 1 to sleep_time_bin[32], which caused a memory overwritten.

Sorry for taking awhile to get to this.

So this bit seems like its actually the problematic thing. If fls()
returns 32, but the max value in sleep_time_bin[] is 31, then that's
the real broken issue, and that's where memory is being corrupted.

Adding a check on the fls() value used, or defining the the
sleep_time_bin[] as 33 entries seems like the proper fix.

And actually looking closer, since its a timespec64, the fls() could
return something as large as 64, so capping it is probably more sane.

> As depicted by System.map:
>
> ffffffff81c9d080 b sleep_time_bin
> ffffffff81c9d100 B dma_spin_lock
>
> the dma_spin_lock.val is set to 1, which caused this problem.
>
> In theory we can avoid the overflow by ignoring the idle injection
> if pm_trace is enabled, but we might still miss other cases which
> might also break the rtc, e.g, buggy clocksoure/rtc driver,
> or even user space tool such as hwclock -- so there is no generic
> method to dertermin whether we should trust the tsc.
>
> A simpler way is to set the threshold for the sleep time, and
> ignore those abnormal ones. This patch sets the upper limit of
> sleep seconds to 0x7fffffff, since no one is likely to sleep
> that long(68 years).

Having validity sanity check is probably a good idea as well, but this
papers over the real problem.

I'll send out a version of the above I think makes the most sense, and
if there's no objections I'll queue it. I am also happy to add a patch
like this one, but the commit message would need to be simplified to
just say out of paranoia we want to cap the maximum valid sleep time
to 68 years.

thanks
-john

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] timekeeping: Cap array access in timekeeping_debug to protect against invalid sleep times
  2016-07-30 15:27 [RFC,v3] timekeeping: Limit the sleep time to avoid overflow Chen Yu
  2016-08-17 19:13 ` John Stultz
@ 2016-08-17 19:18 ` John Stultz
  2016-08-18  1:09   ` Rafael J. Wysocki
  1 sibling, 1 reply; 5+ messages in thread
From: John Stultz @ 2016-08-17 19:18 UTC (permalink / raw)
  To: Chen Yu
  Cc: John Stultz, Thomas Gleixner, Rafael J. Wysocki, Janek Kozicki,
	Xunlei Pang, Zhang Rui, linux-kernel, linux-pm

It was reported that hibernation could fail on the 2nd attempt,
where the system hangs at hibernate() -> syscore_resume() ->
i8237A_resume() -> claim_dma_lock(), because the lock has
already been taken.

However there is actually no other process would like to grab
this lock on that problematic platform.

Further investigation showed that the problem is triggered by
setting /sys/power/pm_trace to 1 before the 1st hibernation.

Since once pm_trace is enabled, the rtc becomes unmeaningful
after suspend, and meanwhile some BIOSes would like to adjust
the 'invalid' tsc(e.g, smaller than 1970) to the release date
of that motherboard during POST stage, thus after resumed, it
may seem that the system had a significant long sleep time might
due to meaningless tsc or RTC delta.

Then in timekeeping_resume -> tk_debug_account_sleep_time, if
the bit31 of the sleep time happened to be set to 1, the fls
returns 32 and then we add 1 to sleep_time_bin[32], which
caused a memory overwritten.

As depicted by System.map:
ffffffff81c9d080 b sleep_time_bin
ffffffff81c9d100 B dma_spin_lock
the dma_spin_lock.val is set to 1, which caused this problem.

This patch adds a sanity check in tk_debug_account_sleep_time()
to ensure we don't index past the sleep_time_bin array.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Janek Kozicki <cosurgi@gmail.com>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: Xunlei Pang <xpang@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Reported-by: Janek Kozicki <cosurgi@gmail.com>
Reported-by: Chen Yu <yu.c.chen@intel.com>
[jstultz: Problem diagnosed and original patch by Chen Yu, I've
 solved the issue slightly differently, but borrowed his excelent
 explanation of of the issue here.]
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/timekeeping_debug.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timekeeping_debug.c b/kernel/time/timekeeping_debug.c
index f6bd652..107310a6 100644
--- a/kernel/time/timekeeping_debug.c
+++ b/kernel/time/timekeeping_debug.c
@@ -23,7 +23,9 @@
 
 #include "timekeeping_internal.h"
 
-static unsigned int sleep_time_bin[32] = {0};
+#define NUM_BINS 32
+
+static unsigned int sleep_time_bin[NUM_BINS] = {0};
 
 static int tk_debug_show_sleep_time(struct seq_file *s, void *data)
 {
@@ -69,6 +71,9 @@ late_initcall(tk_debug_sleep_time_init);
 
 void tk_debug_account_sleep_time(struct timespec64 *t)
 {
-	sleep_time_bin[fls(t->tv_sec)]++;
+	/* Cap bin index so we don't overflow the array */
+	int bin = min(fls(t->tv_sec), NUM_BINS-1);
+
+	sleep_time_bin[bin]++;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] timekeeping: Cap array access in timekeeping_debug to protect against invalid sleep times
  2016-08-17 19:18 ` [PATCH] timekeeping: Cap array access in timekeeping_debug to protect against invalid sleep times John Stultz
@ 2016-08-18  1:09   ` Rafael J. Wysocki
  2016-08-18  7:31     ` Chen Yu
  0 siblings, 1 reply; 5+ messages in thread
From: Rafael J. Wysocki @ 2016-08-18  1:09 UTC (permalink / raw)
  To: John Stultz
  Cc: Chen Yu, Thomas Gleixner, Janek Kozicki, Xunlei Pang, Zhang Rui,
	linux-kernel, linux-pm

On Wednesday, August 17, 2016 12:18:50 PM John Stultz wrote:
> It was reported that hibernation could fail on the 2nd attempt,
> where the system hangs at hibernate() -> syscore_resume() ->
> i8237A_resume() -> claim_dma_lock(), because the lock has
> already been taken.
> 
> However there is actually no other process would like to grab
> this lock on that problematic platform.
> 
> Further investigation showed that the problem is triggered by
> setting /sys/power/pm_trace to 1 before the 1st hibernation.
> 
> Since once pm_trace is enabled, the rtc becomes unmeaningful
> after suspend, and meanwhile some BIOSes would like to adjust
> the 'invalid' tsc(e.g, smaller than 1970) to the release date
> of that motherboard during POST stage, thus after resumed, it
> may seem that the system had a significant long sleep time might
> due to meaningless tsc or RTC delta.
> 
> Then in timekeeping_resume -> tk_debug_account_sleep_time, if
> the bit31 of the sleep time happened to be set to 1, the fls
> returns 32 and then we add 1 to sleep_time_bin[32], which
> caused a memory overwritten.
> 
> As depicted by System.map:
> ffffffff81c9d080 b sleep_time_bin
> ffffffff81c9d100 B dma_spin_lock
> the dma_spin_lock.val is set to 1, which caused this problem.
> 
> This patch adds a sanity check in tk_debug_account_sleep_time()
> to ensure we don't index past the sleep_time_bin array.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
> Cc: Janek Kozicki <cosurgi@gmail.com>
> Cc: Chen Yu <yu.c.chen@intel.com>
> Cc: Xunlei Pang <xpang@redhat.com>
> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pm@vger.kernel.org
> Reported-by: Janek Kozicki <cosurgi@gmail.com>
> Reported-by: Chen Yu <yu.c.chen@intel.com>
> [jstultz: Problem diagnosed and original patch by Chen Yu, I've
>  solved the issue slightly differently, but borrowed his excelent
>  explanation of of the issue here.]
> Signed-off-by: John Stultz <john.stultz@linaro.org>
> ---
>  kernel/time/timekeeping_debug.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/timekeeping_debug.c b/kernel/time/timekeeping_debug.c
> index f6bd652..107310a6 100644
> --- a/kernel/time/timekeeping_debug.c
> +++ b/kernel/time/timekeeping_debug.c
> @@ -23,7 +23,9 @@
>  
>  #include "timekeeping_internal.h"
>  
> -static unsigned int sleep_time_bin[32] = {0};
> +#define NUM_BINS 32
> +
> +static unsigned int sleep_time_bin[NUM_BINS] = {0};
>  
>  static int tk_debug_show_sleep_time(struct seq_file *s, void *data)
>  {
> @@ -69,6 +71,9 @@ late_initcall(tk_debug_sleep_time_init);
>  
>  void tk_debug_account_sleep_time(struct timespec64 *t)
>  {
> -	sleep_time_bin[fls(t->tv_sec)]++;
> +	/* Cap bin index so we don't overflow the array */
> +	int bin = min(fls(t->tv_sec), NUM_BINS-1);
> +
> +	sleep_time_bin[bin]++;
>  }
>  
> 


If pm_trace_enabled is set, we can (or maybe even should) just skip
timekeeping_inject_sleeptime() entirely in rtc_resume() at least, because
sleep_time is almost certainly bogus in that case, even if it doesn't
overflow.

Of course, the above is still needed then.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] timekeeping: Cap array access in timekeeping_debug to protect against invalid sleep times
  2016-08-18  1:09   ` Rafael J. Wysocki
@ 2016-08-18  7:31     ` Chen Yu
  0 siblings, 0 replies; 5+ messages in thread
From: Chen Yu @ 2016-08-18  7:31 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Janek Kozicki, Xunlei Pang, Zhang Rui,
	linux-kernel, linux-pm, John Stultz



On 2016年08月18日 09:09, Rafael J. Wysocki wrote:
> If pm_trace_enabled is set, we can (or maybe even should) just skip
> timekeeping_inject_sleeptime() entirely in rtc_resume() at least, because
> sleep_time is almost certainly bogus in that case, even if it doesn't
> overflow.
>
> Of course, the above is still needed then.
>
> Thanks,
> Rafael
>
OK, I'll provide another version on top of John's patch and consider the 
pm_trace. thanks.
Yu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-08-18  7:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-30 15:27 [RFC,v3] timekeeping: Limit the sleep time to avoid overflow Chen Yu
2016-08-17 19:13 ` John Stultz
2016-08-17 19:18 ` [PATCH] timekeeping: Cap array access in timekeeping_debug to protect against invalid sleep times John Stultz
2016-08-18  1:09   ` Rafael J. Wysocki
2016-08-18  7:31     ` Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).