linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
@ 2013-04-30 22:28 Zoran Markovic
  2013-04-30 22:28 ` [RFC PATCH] power: Add option to log time spent in suspend Zoran Markovic
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Zoran Markovic @ 2013-04-30 22:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, Benoit Goby, Android Kernel Team, Colin Cross,
	Todd Poynor, San Mehat, John Stultz, Pavel Machek,
	Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman, Zoran Markovic

From: Benoit Goby <benoit@android.com>

Below is a patch from android kernel that detects a driver suspend
lockup and captures dump in the kernel log. Please review and provide
comments.

Rather than hard-lock the kernel, dump the suspend thread stack and
BUG() when a driver takes too long to suspend.  The timeout is set to
12 seconds to be longer than the usbhid 10 second timeout.

Exclude from the watchdog the time spent waiting for children that
are resumed asynchronously and time every device, whether or not they
resumed synchronously.

Cc: Android Kernel Team <kernel-team@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Todd Poynor <toddpoynor@google.com>
Cc: San Mehat <san@google.com>
Cc: Benoit Goby <benoit@android.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <len.brown@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Original-author: San Mehat <san@google.com>
Signed-off-by: Benoit Goby <benoit@android.com>
[zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...),
tweaked commit message.]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
---
 drivers/base/power/main.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 15beb50..eb70c0e 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -29,6 +29,8 @@
 #include <linux/async.h>
 #include <linux/suspend.h>
 #include <linux/cpuidle.h>
+#include <linux/timer.h>
+
 #include "../base.h"
 #include "power.h"
 
@@ -54,6 +56,12 @@ struct suspend_stats suspend_stats;
 static DEFINE_MUTEX(dpm_list_mtx);
 static pm_message_t pm_transition;
 
+static void dpm_drv_timeout(unsigned long data);
+struct dpm_drv_wd_data {
+	struct device *dev;
+	struct task_struct *tsk;
+};
+
 static int async_error;
 
 /**
@@ -663,6 +671,30 @@ static bool is_async(struct device *dev)
 }
 
 /**
+ *     dpm_drv_timeout - Driver suspend / resume watchdog handler
+ *     @data: struct device which timed out
+ *
+ *     Called when a driver has timed out suspending or resuming.
+ *     There's not much we can do here to recover so
+ *     BUG() out for a crash-dump
+ *
+ */
+static void dpm_drv_timeout(unsigned long data)
+{
+	struct dpm_drv_wd_data *wd_data = (void *)data;
+	struct device *dev = wd_data->dev;
+	struct task_struct *tsk = wd_data->tsk;
+
+	pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
+		(dev->driver ? dev->driver->name : "no driver"));
+
+	pr_emerg("dpm suspend stack:\n");
+	show_stack(tsk, NULL);
+
+	BUG();
+}
+
+/**
  * dpm_resume - Execute "resume" callbacks for non-sysdev devices.
  * @state: PM transition of the system being carried out.
  *
@@ -1053,6 +1085,8 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async)
 	pm_callback_t callback = NULL;
 	char *info = NULL;
 	int error = 0;
+	struct timer_list timer;
+	struct dpm_drv_wd_data data;
 
 	dpm_wait_for_children(dev, async);
 
@@ -1076,6 +1110,14 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async)
 	if (dev->power.syscore)
 		goto Complete;
 
+	data.dev = dev;
+	data.tsk = get_current();
+	init_timer_on_stack(&timer);
+	timer.expires = jiffies + HZ * 12;
+	timer.function = dpm_drv_timeout;
+	timer.data = (unsigned long)&data;
+	add_timer(&timer);
+
 	device_lock(dev);
 
 	if (dev->pm_domain) {
@@ -1131,6 +1173,9 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async)
 
 	device_unlock(dev);
 
+	del_timer_sync(&timer);
+	destroy_timer_on_stack(&timer);
+
  Complete:
 	complete_all(&dev->power.completion);
 	if (error)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH] power: Add option to log time spent in suspend
  2013-04-30 22:28 [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend Zoran Markovic
@ 2013-04-30 22:28 ` Zoran Markovic
  2013-05-01  0:29   ` Pavel Machek
  2013-04-30 23:30 ` [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend Greg Kroah-Hartman
  2013-05-01  0:30 ` Pavel Machek
  2 siblings, 1 reply; 20+ messages in thread
From: Zoran Markovic @ 2013-04-30 22:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-pm, Colin Cross, Android Kernel Team, Todd Poynor,
	San Mehat, Benoit Goby, John Stultz, Pavel Machek,
	Rafael J. Wysocki, Len Brown, Zoran Markovic

From: Colin Cross <ccross@android.com>

Below is a patch from android kernel that maintains a histogram of
suspend times. Please review and provide feedback.

Prints the time spent in suspend in the kernel log, and keeps statistics
on the time spent in suspend in /sys/kernel/debug/suspend_time

Cc: Android Kernel Team <kernel-team@android.com>
Cc: Colin Cross <ccross@android.com>
Cc: Todd Poynor <toddpoynor@google.com>
Cc: San Mehat <san@google.com>
Cc: Benoit Goby <benoit@android.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Todd Poynor <toddpoynor@google.com>
[zoran.markovic@linaro.org: Re-formatted suspend time table to better fit
expected values, tweaked commit message]
Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
---
 kernel/power/Kconfig        |    7 +++
 kernel/power/Makefile       |    1 +
 kernel/power/suspend_time.c |  111 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+)
 create mode 100644 kernel/power/suspend_time.c

diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
index 5dfdc9e..edc8bdd 100644
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -274,3 +274,10 @@ config PM_GENERIC_DOMAINS_RUNTIME
 config CPU_PM
 	bool
 	depends on SUSPEND || CPU_IDLE
+
+config SUSPEND_TIME
+       bool "Log time spent in suspend"
+       ---help---
+         Prints the time spent in suspend in the kernel log, and
+         keeps statistics on the time spent in suspend in
+         /sys/kernel/debug/suspend_time
diff --git a/kernel/power/Makefile b/kernel/power/Makefile
index 29472bf..578e20e 100644
--- a/kernel/power/Makefile
+++ b/kernel/power/Makefile
@@ -11,5 +11,6 @@ obj-$(CONFIG_HIBERNATION)	+= hibernate.o snapshot.o swap.o user.o \
 				   block_io.o
 obj-$(CONFIG_PM_AUTOSLEEP)	+= autosleep.o
 obj-$(CONFIG_PM_WAKELOCKS)	+= wakelock.o
+obj-$(CONFIG_SUSPEND_TIME)     += suspend_time.o
 
 obj-$(CONFIG_MAGIC_SYSRQ)	+= poweroff.o
diff --git a/kernel/power/suspend_time.c b/kernel/power/suspend_time.c
new file mode 100644
index 0000000..a613ede
--- /dev/null
+++ b/kernel/power/suspend_time.c
@@ -0,0 +1,111 @@
+/*
+ * debugfs file to track time spent in suspend
+ *
+ * Copyright (c) 2011, Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/debugfs.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/seq_file.h>
+#include <linux/syscore_ops.h>
+#include <linux/time.h>
+
+static struct timespec suspend_time_before;
+static unsigned int time_in_suspend_bins[32];
+
+#ifdef CONFIG_DEBUG_FS
+static int suspend_time_debug_show(struct seq_file *s, void *data)
+{
+	unsigned int bin;
+	seq_printf(s, "      time (secs)        count\n");
+	seq_printf(s, "------------------------------\n");
+	for (bin = 0; bin < 32; bin++) {
+		if (time_in_suspend_bins[bin] == 0)
+			continue;
+		seq_printf(s, "%10u - %-10u %4u\n",
+			bin ? 1 << (bin - 1) : 0, 1 << bin,
+				time_in_suspend_bins[bin]);
+	}
+	return 0;
+}
+
+static int suspend_time_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, suspend_time_debug_show, NULL);
+}
+
+static const struct file_operations suspend_time_debug_fops = {
+	.open		= suspend_time_debug_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init suspend_time_debug_init(void)
+{
+	struct dentry *d;
+
+	d = debugfs_create_file("suspend_time", 0444, NULL, NULL,
+		&suspend_time_debug_fops);
+	if (!d) {
+		pr_err("Failed to create suspend_time debug file\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+late_initcall(suspend_time_debug_init);
+#endif
+
+static int suspend_time_syscore_suspend(void)
+{
+	read_persistent_clock(&suspend_time_before);
+
+	return 0;
+}
+
+static void suspend_time_syscore_resume(void)
+{
+	struct timespec after;
+
+	read_persistent_clock(&after);
+
+	after = timespec_sub(after, suspend_time_before);
+
+	time_in_suspend_bins[fls(after.tv_sec)]++;
+
+	pr_info("Suspended for %lu.%03lu seconds\n", after.tv_sec,
+		after.tv_nsec / NSEC_PER_MSEC);
+}
+
+static struct syscore_ops suspend_time_syscore_ops = {
+	.suspend = suspend_time_syscore_suspend,
+	.resume = suspend_time_syscore_resume,
+};
+
+static int suspend_time_syscore_init(void)
+{
+	register_syscore_ops(&suspend_time_syscore_ops);
+
+	return 0;
+}
+
+static void suspend_time_syscore_exit(void)
+{
+	unregister_syscore_ops(&suspend_time_syscore_ops);
+}
+module_init(suspend_time_syscore_init);
+module_exit(suspend_time_syscore_exit);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-04-30 22:28 [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend Zoran Markovic
  2013-04-30 22:28 ` [RFC PATCH] power: Add option to log time spent in suspend Zoran Markovic
@ 2013-04-30 23:30 ` Greg Kroah-Hartman
  2013-05-01  3:36   ` Colin Cross
  2013-05-01  0:30 ` Pavel Machek
  2 siblings, 1 reply; 20+ messages in thread
From: Greg Kroah-Hartman @ 2013-04-30 23:30 UTC (permalink / raw)
  To: Zoran Markovic
  Cc: linux-kernel, linux-pm, Benoit Goby, Android Kernel Team,
	Colin Cross, Todd Poynor, San Mehat, John Stultz, Pavel Machek,
	Rafael J. Wysocki, Len Brown

On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
> From: Benoit Goby <benoit@android.com>
> 
> Below is a patch from android kernel that detects a driver suspend
> lockup and captures dump in the kernel log. Please review and provide
> comments.

There's this really cool thing called a watchdog driver that does stuff
like this :)

> Rather than hard-lock the kernel, dump the suspend thread stack and
> BUG() when a driver takes too long to suspend.  The timeout is set to
> 12 seconds to be longer than the usbhid 10 second timeout.
> 
> Exclude from the watchdog the time spent waiting for children that
> are resumed asynchronously and time every device, whether or not they
> resumed synchronously.

No, don't add a driver-core-only timer, use the existing watchdog timers
if you are worried about the kernel locking up.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] power: Add option to log time spent in suspend
  2013-04-30 22:28 ` [RFC PATCH] power: Add option to log time spent in suspend Zoran Markovic
@ 2013-05-01  0:29   ` Pavel Machek
  2013-05-01  3:29     ` Colin Cross
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2013-05-01  0:29 UTC (permalink / raw)
  To: Zoran Markovic
  Cc: linux-kernel, linux-pm, Colin Cross, Android Kernel Team,
	Todd Poynor, San Mehat, Benoit Goby, John Stultz,
	Rafael J. Wysocki, Len Brown

Hi!

> From: Colin Cross <ccross@android.com>
> 
> Below is a patch from android kernel that maintains a histogram of
> suspend times. Please review and provide feedback.
> 
> Prints the time spent in suspend in the kernel log, and keeps statistics
> on the time spent in suspend in /sys/kernel/debug/suspend_time
> 
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Colin Cross <ccross@android.com>
> Cc: Todd Poynor <toddpoynor@google.com>
> Cc: San Mehat <san@google.com>
> Cc: Benoit Goby <benoit@android.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Rafael J. Wysocki <rjw@sisk.pl>
> Cc: Len Brown <len.brown@intel.com>
> Signed-off-by: Colin Cross <ccross@android.com>
> Signed-off-by: Todd Poynor <toddpoynor@google.com>
> [zoran.markovic@linaro.org: Re-formatted suspend time table to better fit
> expected values, tweaked commit message]
> Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>

> +++ b/kernel/power/suspend_time.c
> @@ -0,0 +1,111 @@
> +/*
> + * debugfs file to track time spent in suspend
> + *
> + * Copyright (c) 2011, Google, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/debugfs.h>
> +#include <linux/err.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/seq_file.h>
> +#include <linux/syscore_ops.h>
> +#include <linux/time.h>
> +
> +static struct timespec suspend_time_before;
> +static unsigned int time_in_suspend_bins[32];
> +
> +#ifdef CONFIG_DEBUG_FS
> +static int suspend_time_debug_show(struct seq_file *s, void *data)
> +{
> +	unsigned int bin;
> +	seq_printf(s, "      time (secs)        count\n");
> +	seq_printf(s, "------------------------------\n");
> +	for (bin = 0; bin < 32; bin++) {
> +		if (time_in_suspend_bins[bin] == 0)
> +			continue;
> +		seq_printf(s, "%10u - %-10u %4u\n",
> +			bin ? 1 << (bin - 1) : 0, 1 << bin,
> +				time_in_suspend_bins[bin]);
> +	}
> +	return 0;
> +}
> +
> +static int suspend_time_debug_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, suspend_time_debug_show, NULL);
> +}

Ok, whole new driver to get timing info... As we already have times in
printk, can't existing printks() get the same info? Maybe
Documentation/ file describing how to get that info from dmesg would
be enough?
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-04-30 22:28 [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend Zoran Markovic
  2013-04-30 22:28 ` [RFC PATCH] power: Add option to log time spent in suspend Zoran Markovic
  2013-04-30 23:30 ` [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend Greg Kroah-Hartman
@ 2013-05-01  0:30 ` Pavel Machek
  2013-05-01  3:39   ` Colin Cross
  2 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2013-05-01  0:30 UTC (permalink / raw)
  To: Zoran Markovic
  Cc: linux-kernel, linux-pm, Benoit Goby, Android Kernel Team,
	Colin Cross, Todd Poynor, San Mehat, John Stultz,
	Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman

Hi!

> Below is a patch from android kernel that detects a driver suspend
> lockup and captures dump in the kernel log. Please review and provide
> comments.
> 
> Rather than hard-lock the kernel, dump the suspend thread stack and
> BUG() when a driver takes too long to suspend.  The timeout is set to
> 12 seconds to be longer than the usbhid 10 second timeout.
> 
> Exclude from the watchdog the time spent waiting for children that
> are resumed asynchronously and time every device, whether or not they
> resumed synchronously.
> 
> Cc: Android Kernel Team <kernel-team@android.com>
> Cc: Colin Cross <ccross@android.com>
> Cc: Todd Poynor <toddpoynor@google.com>
> Cc: San Mehat <san@google.com>
> Cc: Benoit Goby <benoit@android.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Pavel Machek <pavel@ucw.cz>
> Cc: Rafael J. Wysocki <rjw@sisk.pl>
> Cc: Len Brown <len.brown@intel.com>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Original-author: San Mehat <san@google.com>
> Signed-off-by: Benoit Goby <benoit@android.com>
> [zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...),
> tweaked commit message.]
> Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
> ---
>  drivers/base/power/main.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index 15beb50..eb70c0e 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c
> @@ -29,6 +29,8 @@
>  #include <linux/async.h>
>  #include <linux/suspend.h>
>  #include <linux/cpuidle.h>
> +#include <linux/timer.h>
> +
>  #include "../base.h"
>  #include "power.h"
>  
> @@ -54,6 +56,12 @@ struct suspend_stats suspend_stats;
>  static DEFINE_MUTEX(dpm_list_mtx);
>  static pm_message_t pm_transition;
>  
> +static void dpm_drv_timeout(unsigned long data);
> +struct dpm_drv_wd_data {
> +	struct device *dev;
> +	struct task_struct *tsk;
> +};
> +
>  static int async_error;
>  
>  /**
> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev)
>  }
>  
>  /**
> + *     dpm_drv_timeout - Driver suspend / resume watchdog handler
> + *     @data: struct device which timed out
> + *
> + *     Called when a driver has timed out suspending or resuming.
> + *     There's not much we can do here to recover so
> + *     BUG() out for a crash-dump
> + *
> + */
> +static void dpm_drv_timeout(unsigned long data)
> +{
> +	struct dpm_drv_wd_data *wd_data = (void *)data;
> +	struct device *dev = wd_data->dev;
> +	struct task_struct *tsk = wd_data->tsk;
> +
> +	pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
> +		(dev->driver ? dev->driver->name : "no driver"));
> +
> +	pr_emerg("dpm suspend stack:\n");
> +	show_stack(tsk, NULL);
> +
> +	BUG();
> +}

So you:

dump stack of the suspend task

do BUG which
   dumps stack of current task
   kills current task

Current task may very well be idle task; in such case you kill the
machine. Sounds like you should be doing something else, like kill -9
instead of BUG()?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] power: Add option to log time spent in suspend
  2013-05-01  0:29   ` Pavel Machek
@ 2013-05-01  3:29     ` Colin Cross
  2013-05-02 12:27       ` Pavel Machek
  0 siblings, 1 reply; 20+ messages in thread
From: Colin Cross @ 2013-05-01  3:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Zoran Markovic, lkml, Linux PM list, Android Kernel Team,
	Todd Poynor, San Mehat, Benoit Goby, John Stultz,
	Rafael J. Wysocki, Len Brown

On Tue, Apr 30, 2013 at 5:29 PM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> From: Colin Cross <ccross@android.com>
>>
>> Below is a patch from android kernel that maintains a histogram of
>> suspend times. Please review and provide feedback.
>>
>> Prints the time spent in suspend in the kernel log, and keeps statistics
>> on the time spent in suspend in /sys/kernel/debug/suspend_time
>>
>> Cc: Android Kernel Team <kernel-team@android.com>
>> Cc: Colin Cross <ccross@android.com>
>> Cc: Todd Poynor <toddpoynor@google.com>
>> Cc: San Mehat <san@google.com>
>> Cc: Benoit Goby <benoit@android.com>
>> Cc: John Stultz <john.stultz@linaro.org>
>> Cc: Pavel Machek <pavel@ucw.cz>
>> Cc: Rafael J. Wysocki <rjw@sisk.pl>
>> Cc: Len Brown <len.brown@intel.com>
>> Signed-off-by: Colin Cross <ccross@android.com>
>> Signed-off-by: Todd Poynor <toddpoynor@google.com>
>> [zoran.markovic@linaro.org: Re-formatted suspend time table to better fit
>> expected values, tweaked commit message]
>> Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
>
>> +++ b/kernel/power/suspend_time.c
>> @@ -0,0 +1,111 @@
>> +/*
>> + * debugfs file to track time spent in suspend
>> + *
>> + * Copyright (c) 2011, Google, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#include <linux/debugfs.h>
>> +#include <linux/err.h>
>> +#include <linux/init.h>
>> +#include <linux/kernel.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/syscore_ops.h>
>> +#include <linux/time.h>
>> +
>> +static struct timespec suspend_time_before;
>> +static unsigned int time_in_suspend_bins[32];
>> +
>> +#ifdef CONFIG_DEBUG_FS
>> +static int suspend_time_debug_show(struct seq_file *s, void *data)
>> +{
>> +     unsigned int bin;
>> +     seq_printf(s, "      time (secs)        count\n");
>> +     seq_printf(s, "------------------------------\n");
>> +     for (bin = 0; bin < 32; bin++) {
>> +             if (time_in_suspend_bins[bin] == 0)
>> +                     continue;
>> +             seq_printf(s, "%10u - %-10u %4u\n",
>> +                     bin ? 1 << (bin - 1) : 0, 1 << bin,
>> +                             time_in_suspend_bins[bin]);
>> +     }
>> +     return 0;
>> +}
>> +
>> +static int suspend_time_debug_open(struct inode *inode, struct file *file)
>> +{
>> +     return single_open(file, suspend_time_debug_show, NULL);
>> +}
>
> Ok, whole new driver to get timing info... As we already have times in
> printk, can't existing printks() get the same info? Maybe
> Documentation/ file describing how to get that info from dmesg would
> be enough?

If this were to be merged I assume it would be rewritten directly into
kernel/power/suspend.c instead of using syscore ops.  That was just
done to ease future merges while it was out of tree.

dmesg is not the solution, it's not a stable api, it would require
constant parsing to make sure you didn't miss data before it fell out
of the ringbuffer, and sched_clock often does not (and should not)
tick during suspend, so printk times don't show the time spent
suspended.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-04-30 23:30 ` [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend Greg Kroah-Hartman
@ 2013-05-01  3:36   ` Colin Cross
  2013-05-01  4:17     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 20+ messages in thread
From: Colin Cross @ 2013-05-01  3:36 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby,
	Android Kernel Team, Todd Poynor, San Mehat, John Stultz,
	Pavel Machek, Rafael J. Wysocki, Len Brown

On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
>> From: Benoit Goby <benoit@android.com>
>>
>> Below is a patch from android kernel that detects a driver suspend
>> lockup and captures dump in the kernel log. Please review and provide
>> comments.
>
> There's this really cool thing called a watchdog driver that does stuff
> like this :)

If the watchdog driver worked in this case this patch wouldn't exist.

>> Rather than hard-lock the kernel, dump the suspend thread stack and
>> BUG() when a driver takes too long to suspend.  The timeout is set to
>> 12 seconds to be longer than the usbhid 10 second timeout.
>>
>> Exclude from the watchdog the time spent waiting for children that
>> are resumed asynchronously and time every device, whether or not they
>> resumed synchronously.
>
> No, don't add a driver-core-only timer, use the existing watchdog timers
> if you are worried about the kernel locking up.

The watchdog timers are useless here.  For one, they generally stop
when their driver suspend op is called, so you may not even have one
running when you lock up.  More importantly, the purpose of this patch
is to tell you which driver locked up and hopefully why, and the
watchdog driver will usually result in a silent reset.  This patch
will cause a stack trace of the driver suspend op that is blocking
suspend progress, even if that call does not happen in the suspend
thread.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-05-01  0:30 ` Pavel Machek
@ 2013-05-01  3:39   ` Colin Cross
  2013-05-01 10:56     ` Pavel Machek
  0 siblings, 1 reply; 20+ messages in thread
From: Colin Cross @ 2013-05-01  3:39 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby,
	Android Kernel Team, Todd Poynor, San Mehat, John Stultz,
	Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman

On Tue, Apr 30, 2013 at 5:30 PM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> Below is a patch from android kernel that detects a driver suspend
>> lockup and captures dump in the kernel log. Please review and provide
>> comments.
>>
>> Rather than hard-lock the kernel, dump the suspend thread stack and
>> BUG() when a driver takes too long to suspend.  The timeout is set to
>> 12 seconds to be longer than the usbhid 10 second timeout.
>>
>> Exclude from the watchdog the time spent waiting for children that
>> are resumed asynchronously and time every device, whether or not they
>> resumed synchronously.
>>
>> Cc: Android Kernel Team <kernel-team@android.com>
>> Cc: Colin Cross <ccross@android.com>
>> Cc: Todd Poynor <toddpoynor@google.com>
>> Cc: San Mehat <san@google.com>
>> Cc: Benoit Goby <benoit@android.com>
>> Cc: John Stultz <john.stultz@linaro.org>
>> Cc: Pavel Machek <pavel@ucw.cz>
>> Cc: Rafael J. Wysocki <rjw@sisk.pl>
>> Cc: Len Brown <len.brown@intel.com>
>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> Original-author: San Mehat <san@google.com>
>> Signed-off-by: Benoit Goby <benoit@android.com>
>> [zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...),
>> tweaked commit message.]
>> Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>
>> ---
>>  drivers/base/power/main.c |   45 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 45 insertions(+)
>>
>> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
>> index 15beb50..eb70c0e 100644
>> --- a/drivers/base/power/main.c
>> +++ b/drivers/base/power/main.c
>> @@ -29,6 +29,8 @@
>>  #include <linux/async.h>
>>  #include <linux/suspend.h>
>>  #include <linux/cpuidle.h>
>> +#include <linux/timer.h>
>> +
>>  #include "../base.h"
>>  #include "power.h"
>>
>> @@ -54,6 +56,12 @@ struct suspend_stats suspend_stats;
>>  static DEFINE_MUTEX(dpm_list_mtx);
>>  static pm_message_t pm_transition;
>>
>> +static void dpm_drv_timeout(unsigned long data);
>> +struct dpm_drv_wd_data {
>> +     struct device *dev;
>> +     struct task_struct *tsk;
>> +};
>> +
>>  static int async_error;
>>
>>  /**
>> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev)
>>  }
>>
>>  /**
>> + *     dpm_drv_timeout - Driver suspend / resume watchdog handler
>> + *     @data: struct device which timed out
>> + *
>> + *     Called when a driver has timed out suspending or resuming.
>> + *     There's not much we can do here to recover so
>> + *     BUG() out for a crash-dump
>> + *
>> + */
>> +static void dpm_drv_timeout(unsigned long data)
>> +{
>> +     struct dpm_drv_wd_data *wd_data = (void *)data;
>> +     struct device *dev = wd_data->dev;
>> +     struct task_struct *tsk = wd_data->tsk;
>> +
>> +     pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
>> +             (dev->driver ? dev->driver->name : "no driver"));
>> +
>> +     pr_emerg("dpm suspend stack:\n");
>> +     show_stack(tsk, NULL);
>> +
>> +     BUG();
>> +}
>
> So you:
>
> dump stack of the suspend task
It dumps the stack of the suspend task if the suspend callback is run
synchronously, or the async task if the suspend op is run
asynchronously.

> do BUG which
>    dumps stack of current task
>    kills current task
>
> Current task may very well be idle task; in such case you kill the
> machine. Sounds like you should be doing something else, like kill -9
> instead of BUG()?

Not much else you can do, you are stuck part way into suspend with a
driver's suspend callback half executed.  All userspace tasks are
frozen, and the suspend task is blocked indefinitely.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-05-01  3:36   ` Colin Cross
@ 2013-05-01  4:17     ` Greg Kroah-Hartman
  2013-05-01  4:39       ` Colin Cross
  0 siblings, 1 reply; 20+ messages in thread
From: Greg Kroah-Hartman @ 2013-05-01  4:17 UTC (permalink / raw)
  To: Colin Cross
  Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby,
	Android Kernel Team, Todd Poynor, San Mehat, John Stultz,
	Pavel Machek, Rafael J. Wysocki, Len Brown

On Tue, Apr 30, 2013 at 08:36:21PM -0700, Colin Cross wrote:
> On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
> >> From: Benoit Goby <benoit@android.com>
> >>
> >> Below is a patch from android kernel that detects a driver suspend
> >> lockup and captures dump in the kernel log. Please review and provide
> >> comments.
> >
> > There's this really cool thing called a watchdog driver that does stuff
> > like this :)
> 
> If the watchdog driver worked in this case this patch wouldn't exist.

Great, let's fix the watchdog timer then :)

What's wrong with it?

> >> Rather than hard-lock the kernel, dump the suspend thread stack and
> >> BUG() when a driver takes too long to suspend.  The timeout is set to
> >> 12 seconds to be longer than the usbhid 10 second timeout.
> >>
> >> Exclude from the watchdog the time spent waiting for children that
> >> are resumed asynchronously and time every device, whether or not they
> >> resumed synchronously.
> >
> > No, don't add a driver-core-only timer, use the existing watchdog timers
> > if you are worried about the kernel locking up.
> 
> The watchdog timers are useless here.  For one, they generally stop
> when their driver suspend op is called, so you may not even have one
> running when you lock up.

But you can fix that, right?

> More importantly, the purpose of this patch is to tell you which
> driver locked up and hopefully why, and the watchdog driver will
> usually result in a silent reset.

I thought it was an option as to what the watchdog does when it
triggers.

> This patch will cause a stack trace of the driver suspend op that is
> blocking suspend progress, even if that call does not happen in the
> suspend thread.

But who can see this, the machine is now dead.

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-05-01  4:17     ` Greg Kroah-Hartman
@ 2013-05-01  4:39       ` Colin Cross
       [not found]         ` <CAK7N6voYXxJKWDwSj5T9Y2fKK+Y5JqN9Wm8Qoffi9N7nRnsYhw@mail.gmail.com>
  0 siblings, 1 reply; 20+ messages in thread
From: Colin Cross @ 2013-05-01  4:39 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby,
	Android Kernel Team, Todd Poynor, San Mehat, John Stultz,
	Pavel Machek, Rafael J. Wysocki, Len Brown

On Tue, Apr 30, 2013 at 9:17 PM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Tue, Apr 30, 2013 at 08:36:21PM -0700, Colin Cross wrote:
>> On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>> > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
>> >> From: Benoit Goby <benoit@android.com>
>> >>
>> >> Below is a patch from android kernel that detects a driver suspend
>> >> lockup and captures dump in the kernel log. Please review and provide
>> >> comments.
>> >
>> > There's this really cool thing called a watchdog driver that does stuff
>> > like this :)
>>
>> If the watchdog driver worked in this case this patch wouldn't exist.
>
> Great, let's fix the watchdog timer then :)
>
> What's wrong with it?
>
>> >> Rather than hard-lock the kernel, dump the suspend thread stack and
>> >> BUG() when a driver takes too long to suspend.  The timeout is set to
>> >> 12 seconds to be longer than the usbhid 10 second timeout.
>> >>
>> >> Exclude from the watchdog the time spent waiting for children that
>> >> are resumed asynchronously and time every device, whether or not they
>> >> resumed synchronously.
>> >
>> > No, don't add a driver-core-only timer, use the existing watchdog timers
>> > if you are worried about the kernel locking up.
>>
>> The watchdog timers are useless here.  For one, they generally stop
>> when their driver suspend op is called, so you may not even have one
>> running when you lock up.
>
> But you can fix that, right?

Ah, you're talking about the lockup detectors, and not drivers/watchdog.

The hardlockup detector can tell you if timer interrupts are not
firing, which is unaffected by this patch since the timer wouldn't
fire any way.  The softlockup detector could eventually tell you that
tasks were not being scheduled, but not why.  Even panic on softlockup
will only get you the stack trace of the current task, which will be
the locked up task if it is spinning, but is likely to be the idle
task if the suspend task is blocked on a wait_event.  This patch will
give the stack trace of the suspend operation that is blocked, even if
it is an asynchronous suspend callback.

>> More importantly, the purpose of this patch is to tell you which
>> driver locked up and hopefully why, and the watchdog driver will
>> usually result in a silent reset.
>
> I thought it was an option as to what the watchdog does when it
> triggers.
>
>> This patch will cause a stack trace of the driver suspend op that is
>> blocking suspend progress, even if that call does not happen in the
>> suspend thread.
>
> But who can see this, the machine is now dead.

I'm not sure what might still be working in this situation on x86, but
on ARM the machine is dead anyways.  Some random subset of drivers are
suspended, so you probably have no hardware watchdog, no console, no
video.  kexec on panic, kgdb on panic, console messages saved in
pstore, or jtag are the only options I know of.  This patch is very
useful in conjunction with pstore console.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
       [not found]         ` <CAK7N6voYXxJKWDwSj5T9Y2fKK+Y5JqN9Wm8Qoffi9N7nRnsYhw@mail.gmail.com>
@ 2013-05-01  5:14           ` Colin Cross
  0 siblings, 0 replies; 20+ messages in thread
From: Colin Cross @ 2013-05-01  5:14 UTC (permalink / raw)
  To: anish singh
  Cc: Greg Kroah-Hartman, Zoran Markovic, lkml, Linux PM list,
	Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat,
	John Stultz, Pavel Machek, Rafael J. Wysocki, Len Brown

On Tue, Apr 30, 2013 at 9:57 PM, anish singh
<anish198519851985@gmail.com> wrote:
>
>
>
> On Wed, May 1, 2013 at 10:09 AM, Colin Cross <ccross@android.com> wrote:
>>
>> On Tue, Apr 30, 2013 at 9:17 PM, Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>> > On Tue, Apr 30, 2013 at 08:36:21PM -0700, Colin Cross wrote:
>> >> On Tue, Apr 30, 2013 at 4:30 PM, Greg Kroah-Hartman
>> >> <gregkh@linuxfoundation.org> wrote:
>> >> > On Tue, Apr 30, 2013 at 03:28:33PM -0700, Zoran Markovic wrote:
>> >> >> From: Benoit Goby <benoit@android.com>
>> >> >>
>> >> >> Below is a patch from android kernel that detects a driver suspend
>> >> >> lockup and captures dump in the kernel log. Please review and
>> >> >> provide
>> >> >> comments.
>> >> >
>> >> > There's this really cool thing called a watchdog driver that does
>> >> > stuff
>> >> > like this :)
>> >>
>> >> If the watchdog driver worked in this case this patch wouldn't exist.
>> >
>> > Great, let's fix the watchdog timer then :)
>> >
>> > What's wrong with it?
>> >
>> >> >> Rather than hard-lock the kernel, dump the suspend thread stack and
>> >> >> BUG() when a driver takes too long to suspend.  The timeout is set
>> >> >> to
>> >> >> 12 seconds to be longer than the usbhid 10 second timeout.
>> >> >>
>> >> >> Exclude from the watchdog the time spent waiting for children that
>> >> >> are resumed asynchronously and time every device, whether or not
>> >> >> they
>> >> >> resumed synchronously.
>> >> >
>> >> > No, don't add a driver-core-only timer, use the existing watchdog
>> >> > timers
>> >> > if you are worried about the kernel locking up.
>> >>
>> >> The watchdog timers are useless here.  For one, they generally stop
>> >> when their driver suspend op is called, so you may not even have one
>> >> running when you lock up.
>> >
>> > But you can fix that, right?
>>
>> Ah, you're talking about the lockup detectors, and not drivers/watchdog.
>>
>> The hardlockup detector can tell you if timer interrupts are not
>> firing, which is unaffected by this patch since the timer wouldn't
>> fire any way.  The softlockup detector could eventually tell you that
>
> I was wondering if timers don't fire then how is your dpm_drv_timeout
> function gets called?

That's what I meant - this patch doesn't do anything if timers are not firing.

>>
>> tasks were not being scheduled, but not why.  Even panic on softlockup
>> will only get you the stack trace of the current task, which will be
>> the locked up task if it is spinning, but is likely to be the idle
>> task if the suspend task is blocked on a wait_event.  This patch will
>> give the stack trace of the suspend operation that is blocked, even if
>> it is an asynchronous suspend callback.
>
> ......but not when timers itself is not firing right?
>>
>> ...but not when ti
>> >> More importantly, the purpose of this patch is to tell you which
>> >> driver locked up and hopefully why, and the watchdog driver will
>> >> usually result in a silent reset.
>> >
>> > I thought it was an option as to what the watchdog does when it
>> > triggers.
>> >
>> >> This patch will cause a stack trace of the driver suspend op that is
>> >> blocking suspend progress, even if that call does not happen in the
>> >> suspend thread.
>> >
>> > But who can see this, the machine is now dead.
>>
>> I'm not sure what might still be working in this situation on x86, but
>> on ARM the machine is dead anyways.  Some random subset of drivers are
>> suspended, so you probably have no hardware watchdog, no console, no
>> video.  kexec on panic, kgdb on panic, console messages saved in
>> pstore, or jtag are the only options I know of.  This patch is very
>> useful in conjunction with pstore console.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-05-01  3:39   ` Colin Cross
@ 2013-05-01 10:56     ` Pavel Machek
  2013-05-01 16:10       ` Colin Cross
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2013-05-01 10:56 UTC (permalink / raw)
  To: Colin Cross
  Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby,
	Android Kernel Team, Todd Poynor, San Mehat, John Stultz,
	Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman

Hi!

> >> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev)
> >>  }
> >>
> >>  /**
> >> + *     dpm_drv_timeout - Driver suspend / resume watchdog handler
> >> + *     @data: struct device which timed out
> >> + *
> >> + *     Called when a driver has timed out suspending or resuming.
> >> + *     There's not much we can do here to recover so
> >> + *     BUG() out for a crash-dump
> >> + *
> >> + */
> >> +static void dpm_drv_timeout(unsigned long data)
> >> +{
> >> +     struct dpm_drv_wd_data *wd_data = (void *)data;
> >> +     struct device *dev = wd_data->dev;
> >> +     struct task_struct *tsk = wd_data->tsk;
> >> +
> >> +     pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
> >> +             (dev->driver ? dev->driver->name : "no driver"));
> >> +
> >> +     pr_emerg("dpm suspend stack:\n");
> >> +     show_stack(tsk, NULL);
> >> +
> >> +     BUG();
> >> +}
> >
> > So you:
> >
> > dump stack of the suspend task
> It dumps the stack of the suspend task if the suspend callback is run
> synchronously, or the async task if the suspend op is run
> asynchronously.

Lets call that [a]suspend task.

> > do BUG which
> >    dumps stack of current task
> >    kills current task
> >
> > Current task may very well be idle task; in such case you kill the
> > machine. Sounds like you should be doing something else, like kill -9
> > instead of BUG()?
> 
> Not much else you can do, you are stuck part way into suspend with a
> driver's suspend callback half executed.  All userspace tasks are
> frozen, and the suspend task is blocked indefinitely.

Yes, there's better option. Attempt killing the [a]suspend task,
instead of killing the current task.

Try putting mdelay(100000) into suspend path. Your patch will do the
wrong thing in that case (actually turning debuggable problem into
undebuggable one).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-05-01 10:56     ` Pavel Machek
@ 2013-05-01 16:10       ` Colin Cross
  2013-05-01 16:24         ` Greg Kroah-Hartman
  2013-05-02 12:30         ` Pavel Machek
  0 siblings, 2 replies; 20+ messages in thread
From: Colin Cross @ 2013-05-01 16:10 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby,
	Android Kernel Team, Todd Poynor, San Mehat, John Stultz,
	Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman

On Wed, May 1, 2013 at 3:56 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> >> @@ -663,6 +671,30 @@ static bool is_async(struct device *dev)
>> >>  }
>> >>
>> >>  /**
>> >> + *     dpm_drv_timeout - Driver suspend / resume watchdog handler
>> >> + *     @data: struct device which timed out
>> >> + *
>> >> + *     Called when a driver has timed out suspending or resuming.
>> >> + *     There's not much we can do here to recover so
>> >> + *     BUG() out for a crash-dump
>> >> + *
>> >> + */
>> >> +static void dpm_drv_timeout(unsigned long data)
>> >> +{
>> >> +     struct dpm_drv_wd_data *wd_data = (void *)data;
>> >> +     struct device *dev = wd_data->dev;
>> >> +     struct task_struct *tsk = wd_data->tsk;
>> >> +
>> >> +     pr_emerg("**** DPM device timeout: %s (%s)\n", dev_name(dev),
>> >> +             (dev->driver ? dev->driver->name : "no driver"));
>> >> +
>> >> +     pr_emerg("dpm suspend stack:\n");
>> >> +     show_stack(tsk, NULL);
>> >> +
>> >> +     BUG();
>> >> +}
>> >
>> > So you:
>> >
>> > dump stack of the suspend task
>> It dumps the stack of the suspend task if the suspend callback is run
>> synchronously, or the async task if the suspend op is run
>> asynchronously.
>
> Lets call that [a]suspend task.
>
>> > do BUG which
>> >    dumps stack of current task
>> >    kills current task
>> >
>> > Current task may very well be idle task; in such case you kill the
>> > machine. Sounds like you should be doing something else, like kill -9
>> > instead of BUG()?
>>
>> Not much else you can do, you are stuck part way into suspend with a
>> driver's suspend callback half executed.  All userspace tasks are
>> frozen, and the suspend task is blocked indefinitely.
>
> Yes, there's better option. Attempt killing the [a]suspend task,
> instead of killing the current task.

That will leave you in a completely undefined state.  If you just kill
the task, you are likely to kill the synchronous suspend task, which
is the task that would resume your drivers and unfreeze tasks.  That
will leave you with no userspace tasks running, and much of your
hardware suspended.  How is that a useful result?  If you somehow
respawn a resume thread to resume whatever hardware you can and
unfreeze tasks, you still have the hardware that was suspending when
it was killed in a bad state, and probably has locks held, so you're
just going to deadlock or crash soon after.

> Try putting mdelay(100000) into suspend path. Your patch will do the
> wrong thing in that case (actually turning debuggable problem into
> undebuggable one).

I'm not saying this patch as is is right for everyone (it probably at
least needs to be configurable to be turned off, change the delay, and
change the panic to just a stack trace), but from a mobile perspective
this patch is far more debuggable than without this patch.  We work
very hard to make sure that panic's are highly debuggable, in fact we
often prefer panics over any other behavior when the device is in a
bad state, because it immediately gets the user's device working again
while still giving us useful information in our automatic log
collection.

With an mdelay(100000) in the suspend path, users in our debug device
pool are likely to just pull the battery because their screen won't
turn on, in which case I get no debugging information.  With this
patch, the device will automatically reboot due to the panic, and I
will get captured logs after reboot that show a stack trace ending
with
mdelay, which tells me exactly where to look for this mdelay(100000).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-05-01 16:10       ` Colin Cross
@ 2013-05-01 16:24         ` Greg Kroah-Hartman
  2013-05-02 12:30         ` Pavel Machek
  1 sibling, 0 replies; 20+ messages in thread
From: Greg Kroah-Hartman @ 2013-05-01 16:24 UTC (permalink / raw)
  To: Colin Cross
  Cc: Pavel Machek, Zoran Markovic, lkml, Linux PM list, Benoit Goby,
	Android Kernel Team, Todd Poynor, San Mehat, John Stultz,
	Rafael J. Wysocki, Len Brown

On Wed, May 01, 2013 at 09:10:49AM -0700, Colin Cross wrote:
> I'm not saying this patch as is is right for everyone (it probably at
> least needs to be configurable to be turned off, change the delay, and
> change the panic to just a stack trace),

Those changes would be nice.

> but from a mobile perspective
> this patch is far more debuggable than without this patch.  We work
> very hard to make sure that panic's are highly debuggable, in fact we
> often prefer panics over any other behavior when the device is in a
> bad state, because it immediately gets the user's device working again
> while still giving us useful information in our automatic log
> collection.
> 
> With an mdelay(100000) in the suspend path, users in our debug device
> pool are likely to just pull the battery because their screen won't
> turn on, in which case I get no debugging information.  With this
> patch, the device will automatically reboot due to the panic, and I
> will get captured logs after reboot that show a stack trace ending
> with mdelay, which tells me exactly where to look for this
> mdelay(100000).

All of that information would be _great_ to have in the changelog for
the patch, as it explains exactly why you need this.  {hint}

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] power: Add option to log time spent in suspend
  2013-05-01  3:29     ` Colin Cross
@ 2013-05-02 12:27       ` Pavel Machek
  2013-05-02 18:29         ` Colin Cross
  0 siblings, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2013-05-02 12:27 UTC (permalink / raw)
  To: Colin Cross
  Cc: Zoran Markovic, lkml, Linux PM list, Android Kernel Team,
	Todd Poynor, San Mehat, Benoit Goby, John Stultz,
	Rafael J. Wysocki, Len Brown

Hi!

> >> +#include <linux/debugfs.h>
> >> +#include <linux/err.h>
> >> +#include <linux/init.h>
> >> +#include <linux/kernel.h>
> >> +#include <linux/seq_file.h>
> >> +#include <linux/syscore_ops.h>
> >> +#include <linux/time.h>
> >> +
> >> +static struct timespec suspend_time_before;
> >> +static unsigned int time_in_suspend_bins[32];
> >> +
> >> +#ifdef CONFIG_DEBUG_FS
> >> +static int suspend_time_debug_show(struct seq_file *s, void *data)
> >> +{
> >> +     unsigned int bin;
> >> +     seq_printf(s, "      time (secs)        count\n");
> >> +     seq_printf(s, "------------------------------\n");
> >> +     for (bin = 0; bin < 32; bin++) {
> >> +             if (time_in_suspend_bins[bin] == 0)
> >> +                     continue;
> >> +             seq_printf(s, "%10u - %-10u %4u\n",
> >> +                     bin ? 1 << (bin - 1) : 0, 1 << bin,
> >> +                             time_in_suspend_bins[bin]);
> >> +     }
> >> +     return 0;
> >> +}
> >> +
> >> +static int suspend_time_debug_open(struct inode *inode, struct file *file)
> >> +{
> >> +     return single_open(file, suspend_time_debug_show, NULL);
> >> +}
> >
> > Ok, whole new driver to get timing info... As we already have times in
> > printk, can't existing printks() get the same info? Maybe
> > Documentation/ file describing how to get that info from dmesg would
> > be enough?
> 
> If this were to be merged I assume it would be rewritten directly into
> kernel/power/suspend.c instead of using syscore ops.  That was just
> done to ease future merges while it was out of tree.
> 
> dmesg is not the solution, it's not a stable api, it would require
> constant parsing to make sure you didn't miss data before it fell out
> of the ringbuffer, and sched_clock often does not (and should not)
> tick during suspend, so printk times don't show the time spent
> suspended.

If timing info printed is invalid, that should be fixed.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-05-01 16:10       ` Colin Cross
  2013-05-01 16:24         ` Greg Kroah-Hartman
@ 2013-05-02 12:30         ` Pavel Machek
  2013-05-02 18:25           ` Colin Cross
  1 sibling, 1 reply; 20+ messages in thread
From: Pavel Machek @ 2013-05-02 12:30 UTC (permalink / raw)
  To: Colin Cross
  Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby,
	Android Kernel Team, Todd Poynor, San Mehat, John Stultz,
	Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman

Hi!

> >> > do BUG which
> >> >    dumps stack of current task
> >> >    kills current task
> >> >
> >> > Current task may very well be idle task; in such case you kill the
> >> > machine. Sounds like you should be doing something else, like kill -9
> >> > instead of BUG()?
> >>
> >> Not much else you can do, you are stuck part way into suspend with a
> >> driver's suspend callback half executed.  All userspace tasks are
> >> frozen, and the suspend task is blocked indefinitely.
> >
> > Yes, there's better option. Attempt killing the [a]suspend task,
> > instead of killing the current task.
> 
> That will leave you in a completely undefined state.  If you just kill
> the task, you are likely to kill the synchronous suspend task, which
> is the task that would resume your drivers and unfreeze tasks.  That
> will leave you with no userspace tasks running, and much of your
> hardware suspended.  How is that a useful result?  If you somehow

So instead you kill random task? (BUG() from timer kills pretty much
random task, right?)

If you want to do panic(), do panic().
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend.
  2013-05-02 12:30         ` Pavel Machek
@ 2013-05-02 18:25           ` Colin Cross
  0 siblings, 0 replies; 20+ messages in thread
From: Colin Cross @ 2013-05-02 18:25 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby,
	Android Kernel Team, Todd Poynor, San Mehat, John Stultz,
	Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman

On Thu, May 2, 2013 at 5:30 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> >> > do BUG which
>> >> >    dumps stack of current task
>> >> >    kills current task
>> >> >
>> >> > Current task may very well be idle task; in such case you kill the
>> >> > machine. Sounds like you should be doing something else, like kill -9
>> >> > instead of BUG()?
>> >>
>> >> Not much else you can do, you are stuck part way into suspend with a
>> >> driver's suspend callback half executed.  All userspace tasks are
>> >> frozen, and the suspend task is blocked indefinitely.
>> >
>> > Yes, there's better option. Attempt killing the [a]suspend task,
>> > instead of killing the current task.
>>
>> That will leave you in a completely undefined state.  If you just kill
>> the task, you are likely to kill the synchronous suspend task, which
>> is the task that would resume your drivers and unfreeze tasks.  That
>> will leave you with no userspace tasks running, and much of your
>> hardware suspended.  How is that a useful result?  If you somehow
>
> So instead you kill random task? (BUG() from timer kills pretty much
> random task, right?)
>
> If you want to do panic(), do panic().

At least on ARM a BUG() in an interrupt or softirq always results in a
panic, but this can be switched to directly call panic.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] power: Add option to log time spent in suspend
  2013-05-02 12:27       ` Pavel Machek
@ 2013-05-02 18:29         ` Colin Cross
  2013-05-02 18:58           ` John Stultz
  0 siblings, 1 reply; 20+ messages in thread
From: Colin Cross @ 2013-05-02 18:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Zoran Markovic, lkml, Linux PM list, Android Kernel Team,
	Todd Poynor, San Mehat, Benoit Goby, John Stultz,
	Rafael J. Wysocki, Len Brown

On Thu, May 2, 2013 at 5:27 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> >> +#include <linux/debugfs.h>
>> >> +#include <linux/err.h>
>> >> +#include <linux/init.h>
>> >> +#include <linux/kernel.h>
>> >> +#include <linux/seq_file.h>
>> >> +#include <linux/syscore_ops.h>
>> >> +#include <linux/time.h>
>> >> +
>> >> +static struct timespec suspend_time_before;
>> >> +static unsigned int time_in_suspend_bins[32];
>> >> +
>> >> +#ifdef CONFIG_DEBUG_FS
>> >> +static int suspend_time_debug_show(struct seq_file *s, void *data)
>> >> +{
>> >> +     unsigned int bin;
>> >> +     seq_printf(s, "      time (secs)        count\n");
>> >> +     seq_printf(s, "------------------------------\n");
>> >> +     for (bin = 0; bin < 32; bin++) {
>> >> +             if (time_in_suspend_bins[bin] == 0)
>> >> +                     continue;
>> >> +             seq_printf(s, "%10u - %-10u %4u\n",
>> >> +                     bin ? 1 << (bin - 1) : 0, 1 << bin,
>> >> +                             time_in_suspend_bins[bin]);
>> >> +     }
>> >> +     return 0;
>> >> +}
>> >> +
>> >> +static int suspend_time_debug_open(struct inode *inode, struct file *file)
>> >> +{
>> >> +     return single_open(file, suspend_time_debug_show, NULL);
>> >> +}
>> >
>> > Ok, whole new driver to get timing info... As we already have times in
>> > printk, can't existing printks() get the same info? Maybe
>> > Documentation/ file describing how to get that info from dmesg would
>> > be enough?
>>
>> If this were to be merged I assume it would be rewritten directly into
>> kernel/power/suspend.c instead of using syscore ops.  That was just
>> done to ease future merges while it was out of tree.
>>
>> dmesg is not the solution, it's not a stable api, it would require
>> constant parsing to make sure you didn't miss data before it fell out
>> of the ringbuffer, and sched_clock often does not (and should not)
>> tick during suspend, so printk times don't show the time spent
>> suspended.
>
> If timing info printed is invalid, that should be fixed.

It's not invalid, its just not measuring the time that is useful here.
 printk is measuring something similar to (but not exactly the same
as) CLOCK_MONOTONIC, but this is measuring something similar to
CLOCK_BOOTTIME.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] power: Add option to log time spent in suspend
  2013-05-02 18:29         ` Colin Cross
@ 2013-05-02 18:58           ` John Stultz
  2013-05-02 19:11             ` Colin Cross
  0 siblings, 1 reply; 20+ messages in thread
From: John Stultz @ 2013-05-02 18:58 UTC (permalink / raw)
  To: Colin Cross
  Cc: Pavel Machek, Zoran Markovic, lkml, Linux PM list,
	Android Kernel Team, Todd Poynor, San Mehat, Benoit Goby,
	Rafael J. Wysocki, Len Brown

On 05/02/2013 11:29 AM, Colin Cross wrote:
> On Thu, May 2, 2013 at 5:27 AM, Pavel Machek <pavel@ucw.cz> wrote:
>>
>>> dmesg is not the solution, it's not a stable api, it would require
>>> constant parsing to make sure you didn't miss data before it fell out
>>> of the ringbuffer, and sched_clock often does not (and should not)
>>> tick during suspend, so printk times don't show the time spent
>>> suspended.
>> If timing info printed is invalid, that should be fixed.
> It's not invalid, its just not measuring the time that is useful here.
>   printk is measuring something similar to (but not exactly the same
> as) CLOCK_MONOTONIC, but this is measuring something similar to
> CLOCK_BOOTTIME.

One thing I'm curious about with this patch, since we measure and keep 
track of suspend time in the timekeeping core, there may be enough 
justification to adding this sort of debugging info the timekeeping core 
itself.

Although what is being tracked with this driver isn't just total time in 
suspend, but looks like more of a logarithmic histogram of suspend 
times. Is there any context you can provide as to why that particular 
format was chosen (which would could be added to help improve the commit 
message)?

thanks
-john


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] power: Add option to log time spent in suspend
  2013-05-02 18:58           ` John Stultz
@ 2013-05-02 19:11             ` Colin Cross
  0 siblings, 0 replies; 20+ messages in thread
From: Colin Cross @ 2013-05-02 19:11 UTC (permalink / raw)
  To: John Stultz
  Cc: Pavel Machek, Zoran Markovic, lkml, Linux PM list,
	Android Kernel Team, Todd Poynor, San Mehat, Benoit Goby,
	Rafael J. Wysocki, Len Brown

On Thu, May 2, 2013 at 11:58 AM, John Stultz <john.stultz@linaro.org> wrote:
> On 05/02/2013 11:29 AM, Colin Cross wrote:
>>
>> On Thu, May 2, 2013 at 5:27 AM, Pavel Machek <pavel@ucw.cz> wrote:
>>>
>>>
>>>> dmesg is not the solution, it's not a stable api, it would require
>>>> constant parsing to make sure you didn't miss data before it fell out
>>>> of the ringbuffer, and sched_clock often does not (and should not)
>>>> tick during suspend, so printk times don't show the time spent
>>>> suspended.
>>>
>>> If timing info printed is invalid, that should be fixed.
>>
>> It's not invalid, its just not measuring the time that is useful here.
>>   printk is measuring something similar to (but not exactly the same
>> as) CLOCK_MONOTONIC, but this is measuring something similar to
>> CLOCK_BOOTTIME.
>
>
> One thing I'm curious about with this patch, since we measure and keep track
> of suspend time in the timekeeping core, there may be enough justification
> to adding this sort of debugging info the timekeeping core itself.
>
> Although what is being tracked with this driver isn't just total time in
> suspend, but looks like more of a logarithmic histogram of suspend times. Is
> there any context you can provide as to why that particular format was
> chosen (which would could be added to help improve the commit message)?

The idea was to keep a minimal amount of data while still being able
to see how often the device is waking up.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-05-02 19:11 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-30 22:28 [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend Zoran Markovic
2013-04-30 22:28 ` [RFC PATCH] power: Add option to log time spent in suspend Zoran Markovic
2013-05-01  0:29   ` Pavel Machek
2013-05-01  3:29     ` Colin Cross
2013-05-02 12:27       ` Pavel Machek
2013-05-02 18:29         ` Colin Cross
2013-05-02 18:58           ` John Stultz
2013-05-02 19:11             ` Colin Cross
2013-04-30 23:30 ` [RFC PATCH] drivers: power: Add watchdog timer to catch drivers which lockup during suspend Greg Kroah-Hartman
2013-05-01  3:36   ` Colin Cross
2013-05-01  4:17     ` Greg Kroah-Hartman
2013-05-01  4:39       ` Colin Cross
     [not found]         ` <CAK7N6voYXxJKWDwSj5T9Y2fKK+Y5JqN9Wm8Qoffi9N7nRnsYhw@mail.gmail.com>
2013-05-01  5:14           ` Colin Cross
2013-05-01  0:30 ` Pavel Machek
2013-05-01  3:39   ` Colin Cross
2013-05-01 10:56     ` Pavel Machek
2013-05-01 16:10       ` Colin Cross
2013-05-01 16:24         ` Greg Kroah-Hartman
2013-05-02 12:30         ` Pavel Machek
2013-05-02 18:25           ` Colin Cross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).