linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] watchdog: fix for lockup detector breakage on resume
@ 2012-04-27 18:10 Sameer Nanda
  2012-04-27 21:12 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Sameer Nanda @ 2012-04-27 18:10 UTC (permalink / raw)
  To: mingo, peterz, len.brown, pavel, rjw, akpm, dzickus, msb
  Cc: linux-kernel, linux-pm, olofj, Sameer Nanda

On the suspend/resume path the boot CPU does not go though an
offline->online transition.  This breaks the NMI detector
post-resume since it depends on PMU state that is lost when
the system gets suspended.

Fix this by forcing a CPU offline->online transition for the
lockup detector on the boot CPU during resume.

Signed-off-by: Sameer Nanda <snanda@chromium.org>
---
To provide more context, we enable NMI watchdog on
Chrome OS.  We have seen several reports of systems freezing
up completely which indicated that the NMI watchdog was not
firing for some reason.

Debugging further, we found a simple way of repro'ing system
freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
after the system has been suspended/resumed one or more times.

With this patch in place, the system freeze result in panics,
as expected.  These panics provide a nice stack trace for us
to debug the actual issue causing the freeze.


 include/linux/sched.h  |    4 ++++
 kernel/power/suspend.c |    3 +++
 kernel/watchdog.c      |   16 ++++++++++++++++
 3 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 81a173c..118cc38 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
 				  size_t *lenp, loff_t *ppos);
 extern unsigned int  softlockup_panic;
 void lockup_detector_init(void);
+void lockup_detector_bootcpu_resume(void);
 #else
 static inline void touch_softlockup_watchdog(void)
 {
@@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
 static inline void lockup_detector_init(void)
 {
 }
+static inline void lockup_detector_bootcpu_resume(void)
+{
+}
 #endif
 
 #ifdef CONFIG_DETECT_HUNG_TASK
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 396d262..0d262a8 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
 	arch_suspend_enable_irqs();
 	BUG_ON(irqs_disabled());
 
+	/* Kick the lockup detector */
+	lockup_detector_bootcpu_resume();
+
  Enable_cpus:
 	enable_nonboot_cpus();
 
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index df30ee0..dd2ac93 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
 	.notifier_call = cpu_callback
 };
 
+void lockup_detector_bootcpu_resume(void)
+{
+	void *cpu = (void *)(long)smp_processor_id();
+
+	/*
+	 * On the suspend/resume path the boot CPU does not go though the
+	 * offline->online transition. This breaks the NMI detector post
+	 * resume. Force an offline->online transition for the boot CPU on
+	 * resume.
+	 */
+	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
+	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
+
+	return;
+}
+
 void __init lockup_detector_init(void)
 {
 	void *cpu = (void *)(long)smp_processor_id();
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-27 18:10 [PATCH] watchdog: fix for lockup detector breakage on resume Sameer Nanda
@ 2012-04-27 21:12 ` Andrew Morton
  2012-04-27 21:33   ` Rafael J. Wysocki
  2012-04-27 21:40   ` Sameer Nanda
  2012-04-30  6:12 ` Srivatsa S. Bhat
  2012-05-01 17:22 ` [PATCH v2] " Sameer Nanda
  2 siblings, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2012-04-27 21:12 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: mingo, peterz, len.brown, pavel, rjw, dzickus, msb, linux-kernel,
	linux-pm, olofj

On Fri, 27 Apr 2012 11:10:40 -0700
Sameer Nanda <snanda@chromium.org> wrote:

> On the suspend/resume path the boot CPU does not go though an
> offline->online transition.  This breaks the NMI detector
> post-resume since it depends on PMU state that is lost when
> the system gets suspended.
> 
> Fix this by forcing a CPU offline->online transition for the
> lockup detector on the boot CPU during resume.
> 
> Signed-off-by: Sameer Nanda <snanda@chromium.org>
> ---
> To provide more context, we enable NMI watchdog on
> Chrome OS.  We have seen several reports of systems freezing
> up completely which indicated that the NMI watchdog was not
> firing for some reason.
> 
> Debugging further, we found a simple way of repro'ing system
> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
> after the system has been suspended/resumed one or more times.
> 
> With this patch in place, the system freeze result in panics,
> as expected.  These panics provide a nice stack trace for us
> to debug the actual issue causing the freeze.
> 
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
>  				  size_t *lenp, loff_t *ppos);
>  extern unsigned int  softlockup_panic;
>  void lockup_detector_init(void);
> +void lockup_detector_bootcpu_resume(void);
>  #else
>  static inline void touch_softlockup_watchdog(void)
>  {
> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
>  static inline void lockup_detector_init(void)
>  {
>  }
> +static inline void lockup_detector_bootcpu_resume(void)
> +{
> +}
>  #endif
>  
>  #ifdef CONFIG_DETECT_HUNG_TASK
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 396d262..0d262a8 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
>  	arch_suspend_enable_irqs();
>  	BUG_ON(irqs_disabled());
>  
> +	/* Kick the lockup detector */
> +	lockup_detector_bootcpu_resume();
> +
>   Enable_cpus:
>  	enable_nonboot_cpus();
>  
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index df30ee0..dd2ac93 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
>  	.notifier_call = cpu_callback
>  };
>  
> +void lockup_detector_bootcpu_resume(void)
> +{
> +	void *cpu = (void *)(long)smp_processor_id();
> +
> +	/*
> +	 * On the suspend/resume path the boot CPU does not go though the
> +	 * offline->online transition. This breaks the NMI detector post
> +	 * resume. Force an offline->online transition for the boot CPU on
> +	 * resume.
> +	 */
> +	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
> +	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> +
> +	return;
> +}

I have issues with the comment ;) It describes some old bug which isn't
there any more and which nobody cares about.  A better comment would
simply describe the function in the usual fashion.  Something like
this:

--- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix
+++ a/kernel/watchdog.c
@@ -597,20 +597,17 @@ static struct notifier_block __cpuinitda
 	.notifier_call = cpu_callback
 };
 
+/*
+ * On entry to suspend we force an offline->online transition on the boot CPU so
+ * that PMU state is available to that CPU when it comes back online after
+ * resume.  This information is required for restarting the NMI watchdog.
+ */
 void lockup_detector_bootcpu_resume(void)
 {
 	void *cpu = (void *)(long)smp_processor_id();
 
-	/*
-	 * On the suspend/resume path the boot CPU does not go though the
-	 * offline->online transition. This breaks the NMI detector post
-	 * resume. Force an offline->online transition for the boot CPU on
-	 * resume.
-	 */
 	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
 	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
-
-	return;
 }
 
 void __init lockup_detector_init(void)
_


But I'm not sure how accurate it is.  Is it true that the PMU data was
required for starting the NMI hardware?


Also, this is all dead code if CONFIG_SUSPEND=n, so how about

--- a/include/linux/sched.h~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
+++ a/include/linux/sched.h
@@ -317,7 +317,6 @@ extern int proc_dowatchdog_thresh(struct
 				  size_t *lenp, loff_t *ppos);
 extern unsigned int  softlockup_panic;
 void lockup_detector_init(void);
-void lockup_detector_bootcpu_resume(void);
 #else
 static inline void touch_softlockup_watchdog(void)
 {
@@ -331,6 +330,11 @@ static inline void touch_all_softlockup_
 static inline void lockup_detector_init(void)
 {
 }
+#endif
+
+#if defined(CONFIG_LOCKUP_DETECTOR) && defined(CONFIG_SUSPEND)
+void lockup_detector_bootcpu_resume(void);
+#else
 static inline void lockup_detector_bootcpu_resume(void)
 {
 }
--- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
+++ a/kernel/watchdog.c
@@ -597,6 +597,7 @@ static struct notifier_block __cpuinitda
 	.notifier_call = cpu_callback
 };
 
+#ifdef CONFIG_SUSPEND
 /*
  * On entry to suspend we force an offline->online transition on the boot CPU so
  * that PMU state is available to that CPU when it comes back online after
@@ -609,6 +610,7 @@ void lockup_detector_bootcpu_resume(void
 	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
 	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
 }
+#endif
 
 void __init lockup_detector_init(void)
 {
_


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-27 21:12 ` Andrew Morton
@ 2012-04-27 21:33   ` Rafael J. Wysocki
  2012-04-27 21:40   ` Sameer Nanda
  1 sibling, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2012-04-27 21:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sameer Nanda, mingo, peterz, len.brown, pavel, dzickus, msb,
	linux-kernel, linux-pm, olofj

On Friday, April 27, 2012, Andrew Morton wrote:
> On Fri, 27 Apr 2012 11:10:40 -0700
> Sameer Nanda <snanda@chromium.org> wrote:
> 
> > On the suspend/resume path the boot CPU does not go though an
> > offline->online transition.  This breaks the NMI detector
> > post-resume since it depends on PMU state that is lost when
> > the system gets suspended.
> > 
> > Fix this by forcing a CPU offline->online transition for the
> > lockup detector on the boot CPU during resume.
> > 
> > Signed-off-by: Sameer Nanda <snanda@chromium.org>
> > ---
> > To provide more context, we enable NMI watchdog on
> > Chrome OS.  We have seen several reports of systems freezing
> > up completely which indicated that the NMI watchdog was not
> > firing for some reason.
> > 
> > Debugging further, we found a simple way of repro'ing system
> > freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
> > after the system has been suspended/resumed one or more times.
> > 
> > With this patch in place, the system freeze result in panics,
> > as expected.  These panics provide a nice stack trace for us
> > to debug the actual issue causing the freeze.
> > 
> > ...
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
> >  				  size_t *lenp, loff_t *ppos);
> >  extern unsigned int  softlockup_panic;
> >  void lockup_detector_init(void);
> > +void lockup_detector_bootcpu_resume(void);
> >  #else
> >  static inline void touch_softlockup_watchdog(void)
> >  {
> > @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
> >  static inline void lockup_detector_init(void)
> >  {
> >  }
> > +static inline void lockup_detector_bootcpu_resume(void)
> > +{
> > +}
> >  #endif
> >  
> >  #ifdef CONFIG_DETECT_HUNG_TASK
> > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> > index 396d262..0d262a8 100644
> > --- a/kernel/power/suspend.c
> > +++ b/kernel/power/suspend.c
> > @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
> >  	arch_suspend_enable_irqs();
> >  	BUG_ON(irqs_disabled());
> >  
> > +	/* Kick the lockup detector */
> > +	lockup_detector_bootcpu_resume();
> > +
> >   Enable_cpus:
> >  	enable_nonboot_cpus();
> >  
> > diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> > index df30ee0..dd2ac93 100644
> > --- a/kernel/watchdog.c
> > +++ b/kernel/watchdog.c
> > @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
> >  	.notifier_call = cpu_callback
> >  };
> >  
> > +void lockup_detector_bootcpu_resume(void)
> > +{
> > +	void *cpu = (void *)(long)smp_processor_id();
> > +
> > +	/*
> > +	 * On the suspend/resume path the boot CPU does not go though the
> > +	 * offline->online transition. This breaks the NMI detector post
> > +	 * resume. Force an offline->online transition for the boot CPU on
> > +	 * resume.
> > +	 */
> > +	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
> > +	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> > +
> > +	return;
> > +}
> 
> I have issues with the comment ;) It describes some old bug which isn't
> there any more and which nobody cares about.  A better comment would
> simply describe the function in the usual fashion.  Something like
> this:
> 
> --- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix
> +++ a/kernel/watchdog.c
> @@ -597,20 +597,17 @@ static struct notifier_block __cpuinitda
>  	.notifier_call = cpu_callback
>  };
>  
> +/*
> + * On entry to suspend we force an offline->online transition on the boot CPU so
> + * that PMU state is available to that CPU when it comes back online after
> + * resume.  This information is required for restarting the NMI watchdog.
> + */
>  void lockup_detector_bootcpu_resume(void)
>  {
>  	void *cpu = (void *)(long)smp_processor_id();
>  
> -	/*
> -	 * On the suspend/resume path the boot CPU does not go though the
> -	 * offline->online transition. This breaks the NMI detector post
> -	 * resume. Force an offline->online transition for the boot CPU on
> -	 * resume.
> -	 */
>  	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>  	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> -
> -	return;
>  }
>  
>  void __init lockup_detector_init(void)
> _
> 
> 
> But I'm not sure how accurate it is.  Is it true that the PMU data was
> required for starting the NMI hardware?
> 
> 
> Also, this is all dead code if CONFIG_SUSPEND=n, so how about

FWIW, looks good to me.

Thanks,
Rafael


> --- a/include/linux/sched.h~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
> +++ a/include/linux/sched.h
> @@ -317,7 +317,6 @@ extern int proc_dowatchdog_thresh(struct
>  				  size_t *lenp, loff_t *ppos);
>  extern unsigned int  softlockup_panic;
>  void lockup_detector_init(void);
> -void lockup_detector_bootcpu_resume(void);
>  #else
>  static inline void touch_softlockup_watchdog(void)
>  {
> @@ -331,6 +330,11 @@ static inline void touch_all_softlockup_
>  static inline void lockup_detector_init(void)
>  {
>  }
> +#endif
> +
> +#if defined(CONFIG_LOCKUP_DETECTOR) && defined(CONFIG_SUSPEND)
> +void lockup_detector_bootcpu_resume(void);
> +#else
>  static inline void lockup_detector_bootcpu_resume(void)
>  {
>  }
> --- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
> +++ a/kernel/watchdog.c
> @@ -597,6 +597,7 @@ static struct notifier_block __cpuinitda
>  	.notifier_call = cpu_callback
>  };
>  
> +#ifdef CONFIG_SUSPEND
>  /*
>   * On entry to suspend we force an offline->online transition on the boot CPU so
>   * that PMU state is available to that CPU when it comes back online after
> @@ -609,6 +610,7 @@ void lockup_detector_bootcpu_resume(void
>  	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>  	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>  }
> +#endif
>  
>  void __init lockup_detector_init(void)
>  {
> _
> 
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-27 21:12 ` Andrew Morton
  2012-04-27 21:33   ` Rafael J. Wysocki
@ 2012-04-27 21:40   ` Sameer Nanda
  2012-04-27 22:03     ` Andrew Morton
  1 sibling, 1 reply; 14+ messages in thread
From: Sameer Nanda @ 2012-04-27 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, peterz, len.brown, pavel, rjw, dzickus, msb, linux-kernel,
	linux-pm, olofj

On Fri, Apr 27, 2012 at 2:12 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri, 27 Apr 2012 11:10:40 -0700
> Sameer Nanda <snanda@chromium.org> wrote:
>
>> On the suspend/resume path the boot CPU does not go though an
>> offline->online transition.  This breaks the NMI detector
>> post-resume since it depends on PMU state that is lost when
>> the system gets suspended.
>>
>> Fix this by forcing a CPU offline->online transition for the
>> lockup detector on the boot CPU during resume.
>>
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>> ---
>> To provide more context, we enable NMI watchdog on
>> Chrome OS.  We have seen several reports of systems freezing
>> up completely which indicated that the NMI watchdog was not
>> firing for some reason.
>>
>> Debugging further, we found a simple way of repro'ing system
>> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
>> after the system has been suspended/resumed one or more times.
>>
>> With this patch in place, the system freeze result in panics,
>> as expected.  These panics provide a nice stack trace for us
>> to debug the actual issue causing the freeze.
>>
>> ...
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
>>                                 size_t *lenp, loff_t *ppos);
>>  extern unsigned int  softlockup_panic;
>>  void lockup_detector_init(void);
>> +void lockup_detector_bootcpu_resume(void);
>>  #else
>>  static inline void touch_softlockup_watchdog(void)
>>  {
>> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
>>  static inline void lockup_detector_init(void)
>>  {
>>  }
>> +static inline void lockup_detector_bootcpu_resume(void)
>> +{
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_DETECT_HUNG_TASK
>> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
>> index 396d262..0d262a8 100644
>> --- a/kernel/power/suspend.c
>> +++ b/kernel/power/suspend.c
>> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
>>       arch_suspend_enable_irqs();
>>       BUG_ON(irqs_disabled());
>>
>> +     /* Kick the lockup detector */
>> +     lockup_detector_bootcpu_resume();
>> +
>>   Enable_cpus:
>>       enable_nonboot_cpus();
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index df30ee0..dd2ac93 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
>>       .notifier_call = cpu_callback
>>  };
>>
>> +void lockup_detector_bootcpu_resume(void)
>> +{
>> +     void *cpu = (void *)(long)smp_processor_id();
>> +
>> +     /*
>> +      * On the suspend/resume path the boot CPU does not go though the
>> +      * offline->online transition. This breaks the NMI detector post
>> +      * resume. Force an offline->online transition for the boot CPU on
>> +      * resume.
>> +      */
>> +     cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>> +     cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>> +
>> +     return;
>> +}
>
> I have issues with the comment ;) It describes some old bug which isn't
> there any more and which nobody cares about.  A better comment would
> simply describe the function in the usual fashion.  Something like
> this:
>
> --- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix
> +++ a/kernel/watchdog.c
> @@ -597,20 +597,17 @@ static struct notifier_block __cpuinitda
>        .notifier_call = cpu_callback
>  };
>
> +/*
> + * On entry to suspend we force an offline->online transition on the boot CPU so
> + * that PMU state is available to that CPU when it comes back online after
> + * resume.  This information is required for restarting the NMI watchdog.

This call actually happens on "exit from suspend" or "entry into
resume" processing so
how about something like:

On exit from suspend we force an offline->online transition on the boot CPU so
that the PMU state that was lost while in suspended state gets set up properly
for the boot CPU.  This information is required for restarting the NMI watchdog.

> + */
>  void lockup_detector_bootcpu_resume(void)
>  {
>        void *cpu = (void *)(long)smp_processor_id();
>
> -       /*
> -        * On the suspend/resume path the boot CPU does not go though the
> -        * offline->online transition. This breaks the NMI detector post
> -        * resume. Force an offline->online transition for the boot CPU on
> -        * resume.
> -        */
>        cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>        cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> -
> -       return;
>  }
>
>  void __init lockup_detector_init(void)
> _
>
>
> But I'm not sure how accurate it is.  Is it true that the PMU data was
> required for starting the NMI hardware?
>
>
> Also, this is all dead code if CONFIG_SUSPEND=n, so how about
>
> --- a/include/linux/sched.h~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
> +++ a/include/linux/sched.h
> @@ -317,7 +317,6 @@ extern int proc_dowatchdog_thresh(struct
>                                  size_t *lenp, loff_t *ppos);
>  extern unsigned int  softlockup_panic;
>  void lockup_detector_init(void);
> -void lockup_detector_bootcpu_resume(void);
>  #else
>  static inline void touch_softlockup_watchdog(void)
>  {
> @@ -331,6 +330,11 @@ static inline void touch_all_softlockup_
>  static inline void lockup_detector_init(void)
>  {
>  }
> +#endif
> +
> +#if defined(CONFIG_LOCKUP_DETECTOR) && defined(CONFIG_SUSPEND)
> +void lockup_detector_bootcpu_resume(void);
> +#else
>  static inline void lockup_detector_bootcpu_resume(void)
>  {
>  }
> --- a/kernel/watchdog.c~nmi-watchdog-fix-for-lockup-detector-breakage-on-resume-fix-fix
> +++ a/kernel/watchdog.c
> @@ -597,6 +597,7 @@ static struct notifier_block __cpuinitda
>        .notifier_call = cpu_callback
>  };
>
> +#ifdef CONFIG_SUSPEND
>  /*
>  * On entry to suspend we force an offline->online transition on the boot CPU so
>  * that PMU state is available to that CPU when it comes back online after
> @@ -609,6 +610,7 @@ void lockup_detector_bootcpu_resume(void
>        cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>        cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>  }
> +#endif
>
>  void __init lockup_detector_init(void)
>  {
> _
>



-- 
Sameer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-27 21:40   ` Sameer Nanda
@ 2012-04-27 22:03     ` Andrew Morton
  2012-04-27 22:20       ` Sameer Nanda
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2012-04-27 22:03 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: mingo, peterz, len.brown, pavel, rjw, dzickus, msb, linux-kernel,
	linux-pm, olofj

On Fri, 27 Apr 2012 14:40:01 -0700
Sameer Nanda <snanda@chromium.org> wrote:

> On Fri, Apr 27, 2012 at 2:12 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> 
> This call actually happens on "exit from suspend" or "entry into
> resume" processing so
> how about something like:
> 
> On exit from suspend we force an offline->online transition on the boot CPU so
> that the PMU state that was lost while in suspended state gets set up properly
> for the boot CPU.  This information is required for restarting the NMI watchdog.

OK, let's use that.

> > Also, this is all dead code if CONFIG_SUSPEND=n, so how about
> >

You had no comments about this?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-27 22:03     ` Andrew Morton
@ 2012-04-27 22:20       ` Sameer Nanda
  0 siblings, 0 replies; 14+ messages in thread
From: Sameer Nanda @ 2012-04-27 22:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, peterz, len.brown, pavel, rjw, dzickus, msb, linux-kernel,
	linux-pm, olofj

On Fri, Apr 27, 2012 at 3:03 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri, 27 Apr 2012 14:40:01 -0700
> Sameer Nanda <snanda@chromium.org> wrote:
>
>> On Fri, Apr 27, 2012 at 2:12 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>>
>> This call actually happens on "exit from suspend" or "entry into
>> resume" processing so
>> how about something like:
>>
>> On exit from suspend we force an offline->online transition on the boot CPU so
>> that the PMU state that was lost while in suspended state gets set up properly
>> for the boot CPU.  This information is required for restarting the NMI watchdog.
>
> OK, let's use that.
>
>> > Also, this is all dead code if CONFIG_SUSPEND=n, so how about
>> >
>
> You had no comments about this?
Looks good to me.

Thanks for including this in the mm tree!

-- 
Sameer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-27 18:10 [PATCH] watchdog: fix for lockup detector breakage on resume Sameer Nanda
  2012-04-27 21:12 ` Andrew Morton
@ 2012-04-30  6:12 ` Srivatsa S. Bhat
  2012-04-30 13:05   ` Don Zickus
  2012-04-30 21:10   ` Sameer Nanda
  2012-05-01 17:22 ` [PATCH v2] " Sameer Nanda
  2 siblings, 2 replies; 14+ messages in thread
From: Srivatsa S. Bhat @ 2012-04-30  6:12 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: mingo, peterz, len.brown, pavel, rjw, akpm, dzickus, msb,
	linux-kernel, linux-pm, olofj

On 04/27/2012 11:40 PM, Sameer Nanda wrote:

> On the suspend/resume path the boot CPU does not go though an
> offline->online transition.  This breaks the NMI detector
> post-resume since it depends on PMU state that is lost when
> the system gets suspended.
> 
> Fix this by forcing a CPU offline->online transition for the
> lockup detector on the boot CPU during resume.
> 
> Signed-off-by: Sameer Nanda <snanda@chromium.org>
> ---
> To provide more context, we enable NMI watchdog on
> Chrome OS.  We have seen several reports of systems freezing
> up completely which indicated that the NMI watchdog was not
> firing for some reason.
> 
> Debugging further, we found a simple way of repro'ing system
> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
> after the system has been suspended/resumed one or more times.
> 
> With this patch in place, the system freeze result in panics,
> as expected.  These panics provide a nice stack trace for us
> to debug the actual issue causing the freeze.
> 
> 
>  include/linux/sched.h  |    4 ++++
>  kernel/power/suspend.c |    3 +++
>  kernel/watchdog.c      |   16 ++++++++++++++++
>  3 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 81a173c..118cc38 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
>  				  size_t *lenp, loff_t *ppos);
>  extern unsigned int  softlockup_panic;
>  void lockup_detector_init(void);
> +void lockup_detector_bootcpu_resume(void);
>  #else
>  static inline void touch_softlockup_watchdog(void)
>  {
> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
>  static inline void lockup_detector_init(void)
>  {
>  }
> +static inline void lockup_detector_bootcpu_resume(void)
> +{
> +}
>  #endif
> 
>  #ifdef CONFIG_DETECT_HUNG_TASK
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 396d262..0d262a8 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
>  	arch_suspend_enable_irqs();
>  	BUG_ON(irqs_disabled());
> 
> +	/* Kick the lockup detector */
> +	lockup_detector_bootcpu_resume();
> +
>   Enable_cpus:
>  	enable_nonboot_cpus();
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index df30ee0..dd2ac93 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
>  	.notifier_call = cpu_callback
>  };
> 
> +void lockup_detector_bootcpu_resume(void)
> +{
> +	void *cpu = (void *)(long)smp_processor_id();
> +
> +	/*
> +	 * On the suspend/resume path the boot CPU does not go though the
> +	 * offline->online transition. This breaks the NMI detector post
> +	 * resume. Force an offline->online transition for the boot CPU on
> +	 * resume.
> +	 */
> +	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
> +	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> +


I have a couple of comments about this:

1. Strictly speaking, we should be using the _FROZEN variants here (since the
tasks are still frozen).

Like, cpu_callback(&cpu_nfb, CPU_DEAD_FROZEN, cpu);
and   cpu_callback(&cpu_nfb, CPU_ONLINE_FROZEN, cpu);

Right now, since the same action is taken for either variant (ie., with or without
_FROZEN), it really doesn't matter. But still, good to be on the safer side no?

2. Why are we skipping the CPU_UP_PREPARE_FROZEN callback?

3. How about hibernation? We don't hit this problem there?

> +	return;
> +}
> +
>  void __init lockup_detector_init(void)
>  {
>  	void *cpu = (void *)(long)smp_processor_id();



Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-30  6:12 ` Srivatsa S. Bhat
@ 2012-04-30 13:05   ` Don Zickus
  2012-04-30 21:10   ` Sameer Nanda
  1 sibling, 0 replies; 14+ messages in thread
From: Don Zickus @ 2012-04-30 13:05 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Sameer Nanda, mingo, peterz, len.brown, pavel, rjw, akpm, msb,
	linux-kernel, linux-pm, olofj

On Mon, Apr 30, 2012 at 11:42:13AM +0530, Srivatsa S. Bhat wrote:
> > +void lockup_detector_bootcpu_resume(void)
> > +{
> > +	void *cpu = (void *)(long)smp_processor_id();
> > +
> > +	/*
> > +	 * On the suspend/resume path the boot CPU does not go though the
> > +	 * offline->online transition. This breaks the NMI detector post
> > +	 * resume. Force an offline->online transition for the boot CPU on
> > +	 * resume.
> > +	 */
> > +	cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
> > +	cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> > +
> 
> 
> I have a couple of comments about this:
> 
> 1. Strictly speaking, we should be using the _FROZEN variants here (since the
> tasks are still frozen).
> 
> Like, cpu_callback(&cpu_nfb, CPU_DEAD_FROZEN, cpu);
> and   cpu_callback(&cpu_nfb, CPU_ONLINE_FROZEN, cpu);
> 
> Right now, since the same action is taken for either variant (ie., with or without
> _FROZEN), it really doesn't matter. But still, good to be on the safer side no?
> 
> 2. Why are we skipping the CPU_UP_PREPARE_FROZEN callback?
> 
> 3. How about hibernation? We don't hit this problem there?

Hi,

I have similar concerns as this patch seems kinda like a hack.  OTOH I
don't know all the available hooks for the suspend/resume paths.  I would
have assumed there was a special case call for the boot cpu to shutdown or
at least disable its services.  Wouldn't a lot of other tasks run into
similar problems as the watchdog?  I don't think the watchdog does
anything special that requires a special hook into the suspend path.

What do other hardware timers do on the suspend path?

Cheers,
Don

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-30  6:12 ` Srivatsa S. Bhat
  2012-04-30 13:05   ` Don Zickus
@ 2012-04-30 21:10   ` Sameer Nanda
  2012-05-01 17:25     ` Sameer Nanda
  2012-05-02 13:14     ` Srivatsa S. Bhat
  1 sibling, 2 replies; 14+ messages in thread
From: Sameer Nanda @ 2012-04-30 21:10 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mingo, peterz, len.brown, pavel, rjw, akpm, dzickus, msb,
	linux-kernel, linux-pm, olofj

On Sun, Apr 29, 2012 at 11:12 PM, Srivatsa S. Bhat
<srivatsa.bhat@linux.vnet.ibm.com> wrote:
> On 04/27/2012 11:40 PM, Sameer Nanda wrote:
>
>> On the suspend/resume path the boot CPU does not go though an
>> offline->online transition.  This breaks the NMI detector
>> post-resume since it depends on PMU state that is lost when
>> the system gets suspended.
>>
>> Fix this by forcing a CPU offline->online transition for the
>> lockup detector on the boot CPU during resume.
>>
>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>> ---
>> To provide more context, we enable NMI watchdog on
>> Chrome OS.  We have seen several reports of systems freezing
>> up completely which indicated that the NMI watchdog was not
>> firing for some reason.
>>
>> Debugging further, we found a simple way of repro'ing system
>> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
>> after the system has been suspended/resumed one or more times.
>>
>> With this patch in place, the system freeze result in panics,
>> as expected.  These panics provide a nice stack trace for us
>> to debug the actual issue causing the freeze.
>>
>>
>>  include/linux/sched.h  |    4 ++++
>>  kernel/power/suspend.c |    3 +++
>>  kernel/watchdog.c      |   16 ++++++++++++++++
>>  3 files changed, 23 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 81a173c..118cc38 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
>>                                 size_t *lenp, loff_t *ppos);
>>  extern unsigned int  softlockup_panic;
>>  void lockup_detector_init(void);
>> +void lockup_detector_bootcpu_resume(void);
>>  #else
>>  static inline void touch_softlockup_watchdog(void)
>>  {
>> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
>>  static inline void lockup_detector_init(void)
>>  {
>>  }
>> +static inline void lockup_detector_bootcpu_resume(void)
>> +{
>> +}
>>  #endif
>>
>>  #ifdef CONFIG_DETECT_HUNG_TASK
>> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
>> index 396d262..0d262a8 100644
>> --- a/kernel/power/suspend.c
>> +++ b/kernel/power/suspend.c
>> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
>>       arch_suspend_enable_irqs();
>>       BUG_ON(irqs_disabled());
>>
>> +     /* Kick the lockup detector */
>> +     lockup_detector_bootcpu_resume();
>> +
>>   Enable_cpus:
>>       enable_nonboot_cpus();
>>
>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>> index df30ee0..dd2ac93 100644
>> --- a/kernel/watchdog.c
>> +++ b/kernel/watchdog.c
>> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
>>       .notifier_call = cpu_callback
>>  };
>>
>> +void lockup_detector_bootcpu_resume(void)
>> +{
>> +     void *cpu = (void *)(long)smp_processor_id();
>> +
>> +     /*
>> +      * On the suspend/resume path the boot CPU does not go though the
>> +      * offline->online transition. This breaks the NMI detector post
>> +      * resume. Force an offline->online transition for the boot CPU on
>> +      * resume.
>> +      */
>> +     cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>> +     cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>> +
>
>
> I have a couple of comments about this:
>
> 1. Strictly speaking, we should be using the _FROZEN variants here (since the
> tasks are still frozen).
>
> Like, cpu_callback(&cpu_nfb, CPU_DEAD_FROZEN, cpu);
> and   cpu_callback(&cpu_nfb, CPU_ONLINE_FROZEN, cpu);
>
> Right now, since the same action is taken for either variant (ie., with or without
> _FROZEN), it really doesn't matter. But still, good to be on the safer side no?

Agreed that the _FROZEN counterparts are a better fit here since the
tasks are still frozen.  Let me make this change.

>
> 2. Why are we skipping the CPU_UP_PREPARE_FROZEN callback?

Mainly because the hrtimer_init has already been done at kernel init
time.  But, this seems to be a good idea since the non-boot CPUs do
transition through the CPU_UP_PREPARE_FROZEN phase on the way up
during resume so it makes sense to keep the boot CPU path symmetrical.

Let me make this change also.

>
> 3. How about hibernation? We don't hit this problem there?

I am not too familiar with hibernation path and don't have a setup to
test it either so can't really answer this one.

>
>> +     return;
>> +}
>> +
>>  void __init lockup_detector_init(void)
>>  {
>>       void *cpu = (void *)(long)smp_processor_id();
>
>
>
> Regards,
> Srivatsa S. Bhat
>



-- 
Sameer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2] watchdog: fix for lockup detector breakage on resume
  2012-04-27 18:10 [PATCH] watchdog: fix for lockup detector breakage on resume Sameer Nanda
  2012-04-27 21:12 ` Andrew Morton
  2012-04-30  6:12 ` Srivatsa S. Bhat
@ 2012-05-01 17:22 ` Sameer Nanda
  2012-05-07  3:24   ` Anshuman Khandual
  2 siblings, 1 reply; 14+ messages in thread
From: Sameer Nanda @ 2012-05-01 17:22 UTC (permalink / raw)
  To: mingo, peterz, len.brown, pavel, rjw, akpm, dzickus, msb
  Cc: linux-kernel, linux-pm, olofj, Sameer Nanda, Srivatsa S. Bhat

On the suspend/resume path the boot CPU does not go though an
offline->online transition.  This breaks the NMI detector
post-resume since it depends on PMU state that is lost when
the system gets suspended.

Fix this by forcing a CPU offline->online transition for the
lockup detector on the boot CPU during resume.

Signed-off-by: Sameer Nanda <snanda@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

To provide more context, we enable NMI watchdog on
Chrome OS.  We have seen several reports of systems freezing
up completely which indicated that the NMI watchdog was not
firing for some reason.

Debugging further, we found a simple way of repro'ing system
freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
after the system has been suspended/resumed one or more times.

With this patch in place, the system freeze result in panics,
as expected.  These panics provide a nice stack trace for us
to debug the actual issue causing the freeze.

 include/linux/sched.h  |    8 ++++++++
 kernel/power/suspend.c |    3 +++
 kernel/watchdog.c      |   17 +++++++++++++++++
 3 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 81a173c..44f6046 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -332,6 +332,14 @@ static inline void lockup_detector_init(void)
 }
 #endif
 
+#if defined(CONFIG_LOCKUP_DETECTOR) && defined(CONFIG_SUSPEND)
+void lockup_detector_bootcpu_resume(void);
+#else
+static inline void lockup_detector_bootcpu_resume(void)
+{
+}
+#endif
+
 #ifdef CONFIG_DETECT_HUNG_TASK
 extern unsigned int  sysctl_hung_task_panic;
 extern unsigned long sysctl_hung_task_check_count;
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 396d262..0d262a8 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
 	arch_suspend_enable_irqs();
 	BUG_ON(irqs_disabled());
 
+	/* Kick the lockup detector */
+	lockup_detector_bootcpu_resume();
+
  Enable_cpus:
 	enable_nonboot_cpus();
 
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index df30ee0..dae2482 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -585,6 +585,23 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
 	.notifier_call = cpu_callback
 };
 
+#ifdef CONFIG_SUSPEND
+/*
+ * On exit from suspend we force an offline->online transition on the boot CPU
+ * so that the PMU state that was lost while in suspended state gets set up
+ * properly for the boot CPU.  This information is required for restarting the
+ * NMI watchdog.
+ */
+void lockup_detector_bootcpu_resume(void)
+{
+	void *cpu = (void *)(long)smp_processor_id();
+
+	cpu_callback(&cpu_nfb, CPU_DEAD_FROZEN, cpu);
+	cpu_callback(&cpu_nfb, CPU_UP_PREPARE_FROZEN, cpu);
+	cpu_callback(&cpu_nfb, CPU_ONLINE_FROZEN, cpu);
+}
+#endif
+
 void __init lockup_detector_init(void)
 {
 	void *cpu = (void *)(long)smp_processor_id();
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-30 21:10   ` Sameer Nanda
@ 2012-05-01 17:25     ` Sameer Nanda
  2012-05-02 13:14     ` Srivatsa S. Bhat
  1 sibling, 0 replies; 14+ messages in thread
From: Sameer Nanda @ 2012-05-01 17:25 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mingo, peterz, len.brown, pavel, rjw, akpm, dzickus, msb,
	linux-kernel, linux-pm, olofj

On Mon, Apr 30, 2012 at 2:10 PM, Sameer Nanda <snanda@chromium.org> wrote:
> On Sun, Apr 29, 2012 at 11:12 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> On 04/27/2012 11:40 PM, Sameer Nanda wrote:
>>
>>> On the suspend/resume path the boot CPU does not go though an
>>> offline->online transition.  This breaks the NMI detector
>>> post-resume since it depends on PMU state that is lost when
>>> the system gets suspended.
>>>
>>> Fix this by forcing a CPU offline->online transition for the
>>> lockup detector on the boot CPU during resume.
>>>
>>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>>> ---
>>> To provide more context, we enable NMI watchdog on
>>> Chrome OS.  We have seen several reports of systems freezing
>>> up completely which indicated that the NMI watchdog was not
>>> firing for some reason.
>>>
>>> Debugging further, we found a simple way of repro'ing system
>>> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
>>> after the system has been suspended/resumed one or more times.
>>>
>>> With this patch in place, the system freeze result in panics,
>>> as expected.  These panics provide a nice stack trace for us
>>> to debug the actual issue causing the freeze.
>>>
>>>
>>>  include/linux/sched.h  |    4 ++++
>>>  kernel/power/suspend.c |    3 +++
>>>  kernel/watchdog.c      |   16 ++++++++++++++++
>>>  3 files changed, 23 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 81a173c..118cc38 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
>>>                                 size_t *lenp, loff_t *ppos);
>>>  extern unsigned int  softlockup_panic;
>>>  void lockup_detector_init(void);
>>> +void lockup_detector_bootcpu_resume(void);
>>>  #else
>>>  static inline void touch_softlockup_watchdog(void)
>>>  {
>>> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
>>>  static inline void lockup_detector_init(void)
>>>  {
>>>  }
>>> +static inline void lockup_detector_bootcpu_resume(void)
>>> +{
>>> +}
>>>  #endif
>>>
>>>  #ifdef CONFIG_DETECT_HUNG_TASK
>>> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
>>> index 396d262..0d262a8 100644
>>> --- a/kernel/power/suspend.c
>>> +++ b/kernel/power/suspend.c
>>> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
>>>       arch_suspend_enable_irqs();
>>>       BUG_ON(irqs_disabled());
>>>
>>> +     /* Kick the lockup detector */
>>> +     lockup_detector_bootcpu_resume();
>>> +
>>>   Enable_cpus:
>>>       enable_nonboot_cpus();
>>>
>>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>>> index df30ee0..dd2ac93 100644
>>> --- a/kernel/watchdog.c
>>> +++ b/kernel/watchdog.c
>>> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
>>>       .notifier_call = cpu_callback
>>>  };
>>>
>>> +void lockup_detector_bootcpu_resume(void)
>>> +{
>>> +     void *cpu = (void *)(long)smp_processor_id();
>>> +
>>> +     /*
>>> +      * On the suspend/resume path the boot CPU does not go though the
>>> +      * offline->online transition. This breaks the NMI detector post
>>> +      * resume. Force an offline->online transition for the boot CPU on
>>> +      * resume.
>>> +      */
>>> +     cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>>> +     cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>>> +
>>
>>
>> I have a couple of comments about this:
>>
>> 1. Strictly speaking, we should be using the _FROZEN variants here (since the
>> tasks are still frozen).
>>
>> Like, cpu_callback(&cpu_nfb, CPU_DEAD_FROZEN, cpu);
>> and   cpu_callback(&cpu_nfb, CPU_ONLINE_FROZEN, cpu);
>>
>> Right now, since the same action is taken for either variant (ie., with or without
>> _FROZEN), it really doesn't matter. But still, good to be on the safer side no?
>
> Agreed that the _FROZEN counterparts are a better fit here since the
> tasks are still frozen.  Let me make this change.
>
>>
>> 2. Why are we skipping the CPU_UP_PREPARE_FROZEN callback?
>
> Mainly because the hrtimer_init has already been done at kernel init
> time.  But, this seems to be a good idea since the non-boot CPUs do
> transition through the CPU_UP_PREPARE_FROZEN phase on the way up
> during resume so it makes sense to keep the boot CPU path symmetrical.
>
> Let me make this change also.

Just sent the updated patch incorporating these two changes as well as
the earlier feedback from akpm.

>
>>
>> 3. How about hibernation? We don't hit this problem there?
>
> I am not too familiar with hibernation path and don't have a setup to
> test it either so can't really answer this one.
>
>>
>>> +     return;
>>> +}
>>> +
>>>  void __init lockup_detector_init(void)
>>>  {
>>>       void *cpu = (void *)(long)smp_processor_id();
>>
>>
>>
>> Regards,
>> Srivatsa S. Bhat
>>
>
>
>
> --
> Sameer



-- 
Sameer

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] watchdog: fix for lockup detector breakage on resume
  2012-04-30 21:10   ` Sameer Nanda
  2012-05-01 17:25     ` Sameer Nanda
@ 2012-05-02 13:14     ` Srivatsa S. Bhat
  1 sibling, 0 replies; 14+ messages in thread
From: Srivatsa S. Bhat @ 2012-05-02 13:14 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: mingo, peterz, len.brown, pavel, rjw, akpm, dzickus, msb,
	linux-kernel, linux-pm, olofj

On 05/01/2012 02:40 AM, Sameer Nanda wrote:

> On Sun, Apr 29, 2012 at 11:12 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>> On 04/27/2012 11:40 PM, Sameer Nanda wrote:
>>
>>> On the suspend/resume path the boot CPU does not go though an
>>> offline->online transition.  This breaks the NMI detector
>>> post-resume since it depends on PMU state that is lost when
>>> the system gets suspended.
>>>
>>> Fix this by forcing a CPU offline->online transition for the
>>> lockup detector on the boot CPU during resume.
>>>
>>> Signed-off-by: Sameer Nanda <snanda@chromium.org>
>>> ---
>>> To provide more context, we enable NMI watchdog on
>>> Chrome OS.  We have seen several reports of systems freezing
>>> up completely which indicated that the NMI watchdog was not
>>> firing for some reason.
>>>
>>> Debugging further, we found a simple way of repro'ing system
>>> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
>>> after the system has been suspended/resumed one or more times.
>>>
>>> With this patch in place, the system freeze result in panics,
>>> as expected.  These panics provide a nice stack trace for us
>>> to debug the actual issue causing the freeze.
>>>
>>>
>>>  include/linux/sched.h  |    4 ++++
>>>  kernel/power/suspend.c |    3 +++
>>>  kernel/watchdog.c      |   16 ++++++++++++++++
>>>  3 files changed, 23 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 81a173c..118cc38 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
>>>                                 size_t *lenp, loff_t *ppos);
>>>  extern unsigned int  softlockup_panic;
>>>  void lockup_detector_init(void);
>>> +void lockup_detector_bootcpu_resume(void);
>>>  #else
>>>  static inline void touch_softlockup_watchdog(void)
>>>  {
>>> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
>>>  static inline void lockup_detector_init(void)
>>>  {
>>>  }
>>> +static inline void lockup_detector_bootcpu_resume(void)
>>> +{
>>> +}
>>>  #endif
>>>
>>>  #ifdef CONFIG_DETECT_HUNG_TASK
>>> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
>>> index 396d262..0d262a8 100644
>>> --- a/kernel/power/suspend.c
>>> +++ b/kernel/power/suspend.c
>>> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
>>>       arch_suspend_enable_irqs();
>>>       BUG_ON(irqs_disabled());
>>>
>>> +     /* Kick the lockup detector */
>>> +     lockup_detector_bootcpu_resume();
>>> +
>>>   Enable_cpus:
>>>       enable_nonboot_cpus();
>>>
>>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>>> index df30ee0..dd2ac93 100644
>>> --- a/kernel/watchdog.c
>>> +++ b/kernel/watchdog.c
>>> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
>>>       .notifier_call = cpu_callback
>>>  };
>>>
>>> +void lockup_detector_bootcpu_resume(void)
>>> +{
>>> +     void *cpu = (void *)(long)smp_processor_id();
>>> +
>>> +     /*
>>> +      * On the suspend/resume path the boot CPU does not go though the
>>> +      * offline->online transition. This breaks the NMI detector post
>>> +      * resume. Force an offline->online transition for the boot CPU on
>>> +      * resume.
>>> +      */
>>> +     cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
>>> +     cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
>>> +
>>
>>
>> I have a couple of comments about this:
>>


[...]

>>
>> 3. How about hibernation? We don't hit this problem there?
> 
> I am not too familiar with hibernation path and don't have a setup to
> test it either so can't really answer this one.
> 


I tested this issue with hibernation, and I found that this problem doesn't occur
in that case! And since the paths (atleast related to CPU hotplug) are identical
in both the cases (suspend/hibernation), I decided to dig a bit more to understand
what is really happening underneath, just in case there is more to this problem
than what your patch addresses, or if the root cause is something different...

Anyway, I found a couple of things (see below), though I couldn't put them
together to fully explain the observed events.

1. perf_event_exit_cpu(boot-cpu) present in hibernation, missing in suspend path:
   ------------------------------------------------------------------------------

   In the case of hibernation, we ultimately power off the machine (noticeably
   different from suspend). And all the functions that lead to poweroff/halt/
   reboot, end up calling the callbacks registered with the reboot_notifier.
   (kernel/sys.c)

   And perf registers a callback to this, (perf_reboot(), kernel/events/core.c)
   which does:
	for_each_online_cpu(cpu)
		perf_event_exit_cpu(cpu);

   Note that we are doing this for *all* cpus, including the boot cpu. (In fact,
   if we come here from hibernate path, the only cpu online at the point is the
   boot cpu.)

   So calling perf_event_exit_cpu() for the boot cpu is one of the things that
   we miss in the suspend path. For the non-boot cpus, the CPU hotplug callback
   perf_cpu_notify() will call this function anyway.

   [By the way, I couldn't figure out where perf_event_init_cpu() is done for
   the boot cpu in the restore path of hibernation.]

2. perf_cpu_notify() has higher priority than watchdog's cpu_callback():
   --------------------------------------------------------------------

   Apart from the priority difference, the former listens to CPU_DOWN_PREPARE
   (early phase of CPU offline), but the latter listens to CPU_DEAD (later
   phase of CPU offline).

   And what all this means is that, for non-boot cpus, we call
   perf_event_exit_cpu() first, and only then call watchdog_disable()->
   watchdog_nmi_disable()->perf_event_disable().

   I don't know the implication of that ordering, but it did look a bit weird..


Do the above observations ring a bell to anyone?

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] watchdog: fix for lockup detector breakage on resume
  2012-05-01 17:22 ` [PATCH v2] " Sameer Nanda
@ 2012-05-07  3:24   ` Anshuman Khandual
  2012-06-08 21:44     ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Anshuman Khandual @ 2012-05-07  3:24 UTC (permalink / raw)
  To: Sameer Nanda
  Cc: mingo, peterz, len.brown, pavel, rjw, akpm, dzickus, msb,
	linux-kernel, linux-pm, olofj, Srivatsa S. Bhat

On Tuesday 01 May 2012 10:52 PM, Sameer Nanda wrote:

> On the suspend/resume path the boot CPU does not go though an
> offline->online transition.  This breaks the NMI detector
> post-resume since it depends on PMU state that is lost when
> the system gets suspended.


We should not have allowed the PMU to go with events counting on it across the suspend/resume transition
and find out that the state has been lost. This patch solves the problem of the NMI detector as we restart the
counter again when the boot cpu comes back online during resume. But the original cause (PMU going with
counters into the suspend state) which triggered this problem is still there. May be we should have called
perf_event_exit() on the boot cpu before going into the suspend state. 

Regards
Anshuman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2] watchdog: fix for lockup detector breakage on resume
  2012-05-07  3:24   ` Anshuman Khandual
@ 2012-06-08 21:44     ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2012-06-08 21:44 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Sameer Nanda, mingo, peterz, len.brown, pavel, rjw, dzickus, msb,
	linux-kernel, linux-pm, olofj, Srivatsa S. Bhat

On Mon, 07 May 2012 08:54:57 +0530
Anshuman Khandual <khandual@linux.vnet.ibm.com> wrote:

> On Tuesday 01 May 2012 10:52 PM, Sameer Nanda wrote:
> 
> > On the suspend/resume path the boot CPU does not go though an
> > offline->online transition.  This breaks the NMI detector
> > post-resume since it depends on PMU state that is lost when
> > the system gets suspended.
> 
> 
> We should not have allowed the PMU to go with events counting on it across the suspend/resume transition
> and find out that the state has been lost. This patch solves the problem of the NMI detector as we restart the
> counter again when the boot cpu comes back online during resume. But the original cause (PMU going with
> counters into the suspend state) which triggered this problem is still there. May be we should have called
> perf_event_exit() on the boot cpu before going into the suspend state. 
> 

That sounds like a nicer solution.

An implementation would be nice ;) I'll keep the original patch on life
support until we get all this nailed down.


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-06-08 21:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-27 18:10 [PATCH] watchdog: fix for lockup detector breakage on resume Sameer Nanda
2012-04-27 21:12 ` Andrew Morton
2012-04-27 21:33   ` Rafael J. Wysocki
2012-04-27 21:40   ` Sameer Nanda
2012-04-27 22:03     ` Andrew Morton
2012-04-27 22:20       ` Sameer Nanda
2012-04-30  6:12 ` Srivatsa S. Bhat
2012-04-30 13:05   ` Don Zickus
2012-04-30 21:10   ` Sameer Nanda
2012-05-01 17:25     ` Sameer Nanda
2012-05-02 13:14     ` Srivatsa S. Bhat
2012-05-01 17:22 ` [PATCH v2] " Sameer Nanda
2012-05-07  3:24   ` Anshuman Khandual
2012-06-08 21:44     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).