* [PATCH] softlockup: fix problem with long kernel pauses from kgdb @ 2009-07-27 20:03 Jason Wessel 2009-07-27 20:18 ` Peter Zijlstra 0 siblings, 1 reply; 9+ messages in thread From: Jason Wessel @ 2009-07-27 20:03 UTC (permalink / raw) To: Ingo Molnar; +Cc: lkml, Deng, Dongdong, peterz Ingo, Given that you are the maintainer of kernel/softlockup.c, I am seeking advice as to how to properly fix this problem. The short version of the problem is: * CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y * Attach to kgdb after boot * Wait for 90 seconds * Execute a continue in gdb * You receive a warning about softlockup The patch that follows is a lengthy analysis on the issue, but the question here is what is the right way to fix this? It seems that a provision is required in order to get the clock synced up prior to touching the watch dog. It was not clear that it was a good idea to unconditionally call the sched_clock_tick() from the softlockup touch code. Your input on this issue is greatly appreciated. Thanks, Jason. ---------- From: "Dongdong Deng" <Dongdong.Deng@windriver.com> Subject: [PATCH] softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set and the sched_clock() was gets the time from hardware, such as from TSC, kgdb often causes softlock warning messages on resuming or detaching from a debug session. Sequence of events in the problem case: 1) "cpu sched clock" and "hardware time" are at 100 seconds prior to a call to kgdb_handle_exception() 2) Debugger waits in kgdb_handle_exception() for 80 seconds and on exit the following is called ... touch_softlockup_watchdog() --> __raw_get_cpu_var(touch_timestamp) = 0; 3) "cpu sched clock" = 100s (it didn't be updated, because the interrupt was disabled in kgdb) but the "hardware time" = 180s 4) The first timer interrupt after resuming from kgdb_handle_exception updates the watchdog from the "cpu sched clock" update_process_times() { ... run_local_timers() --> softlockup_tick() --> check (touch_timestamp == 0) (it is "YES" here, we have set "touch_timestamp = 0" at kgdb) --> __touch_softlockup_watchdog() ***(A)--> reset "touch_timestamp" to "get_timestamp()" (Here, the "touch_timestamp" will still be set to 100s.) ... scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched clock" to "hardware time" = 180s) ... } 5) The Second timer interrupt handler appears to have a large jump and trips the softlockup warning. update_process_times() { ... run_local_timers() --> softlockup_tick() --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s --> printk "soft lockup error messages" ... } note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)" Why "touch_timestamp" will be 100s, not 180s ? We enable "CONFIG_HAVE_UNSTABLE_SCHED_CLOCK", so the call trace of get_timestamp() is: get_timestamp(this_cpu) -->cpu_clock(this_cpu) -->sched_clock_cpu(this_cpu) -->__update_sched_clock(sched_clock_data, now) 22 static u64 __update_sched_clock(struct sched_clock_data *scd, u64 = now) 23 { 24 s64 delta = now - scd->tick_raw; 25 u64 clock, min_clock, max_clock; 26 27 WARN_ON_ONCE(!irqs_disabled()); 28 29 if (unlikely(delta < 0)) 30 delta = 0; 31 32 clock = scd->tick_gtod + delta; 33 34 min_clock = wrap_max(scd->tick_gtod, scd->clock); 35 max_clock = wrap_max(scd->clock, scd->tick_gtod + TICK_NSEC); 36 37 clock = wrap_max(clock, min_clock); 38 clock = wrap_min(clock, max_clock); 39 40 scd->clock = clock; 41 42 return scd->clock; 43 } Please pay attention to __update_sched_clock() function, it uses the GTOD tick value to create a window to filter crazy "now" values. So if "now" values is too big for sched_clock_data, it will be ignored. as the data of "step 3)", "now" = sched_clock() = "hardware time" = 180s and sched_clock_data = 100s. 180s is too big for 100s, it will be ignored. That's why the touch_timestamp will be set to 100s, not 180s. The fix is to simply invoke sched_clock_tick() to update "cpu sched clock" on exit from kgdb_handle_exception. Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: peterz@infradead.org --- kernel/softlockup.c | 3 +++ 1 file changed, 3 insertions(+) --- a/kernel/softlockup.c +++ b/kernel/softlockup.c @@ -118,6 +118,9 @@ void softlockup_tick(void) } if (touch_timestamp == 0) { + /* If the time stamp was touched externally make sure the + * scheduler tick is up to date as well */ + sched_clock_tick(); __touch_softlockup_watchdog(); return; } ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] softlockup: fix problem with long kernel pauses from kgdb 2009-07-27 20:03 [PATCH] softlockup: fix problem with long kernel pauses from kgdb Jason Wessel @ 2009-07-27 20:18 ` Peter Zijlstra 2009-07-27 21:25 ` Jason Wessel 2009-07-28 15:05 ` Jason Wessel 0 siblings, 2 replies; 9+ messages in thread From: Peter Zijlstra @ 2009-07-27 20:18 UTC (permalink / raw) To: Jason Wessel; +Cc: Ingo Molnar, lkml, Deng, Dongdong On Mon, 2009-07-27 at 15:03 -0500, Jason Wessel wrote: > > The fix is to simply invoke sched_clock_tick() to update "cpu sched > clock" on exit from kgdb_handle_exception. Is that a regular IRQ context, or is that NMI context? > Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> > Signed-off-by: Jason Wessel <jason.wessel@windriver.com> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: peterz@infradead.org > --- > kernel/softlockup.c | 3 +++ > 1 file changed, 3 insertions(+) > > --- a/kernel/softlockup.c > +++ b/kernel/softlockup.c > @@ -118,6 +118,9 @@ void softlockup_tick(void) > } > > if (touch_timestamp == 0) { > + /* If the time stamp was touched externally make sure the > + * scheduler tick is up to date as well */ > + sched_clock_tick(); > __touch_softlockup_watchdog(); > return; > } > Aside from the funny comment style (please fix) the fix does look sensible. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] softlockup: fix problem with long kernel pauses from kgdb 2009-07-27 20:18 ` Peter Zijlstra @ 2009-07-27 21:25 ` Jason Wessel 2009-07-28 15:05 ` Jason Wessel 1 sibling, 0 replies; 9+ messages in thread From: Jason Wessel @ 2009-07-27 21:25 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, lkml, Deng, Dongdong Peter Zijlstra wrote: > On Mon, 2009-07-27 at 15:03 -0500, Jason Wessel wrote: > >> The fix is to simply invoke sched_clock_tick() to update "cpu sched >> clock" on exit from kgdb_handle_exception. >> > > Is that a regular IRQ context, or is that NMI context? > > The patch text there is incorrect (which I can fix), because there were 3 other versions of this particular patch. The version of this patch that tried to use sched_clock_tick() from the exception context hung the system because there was lock contention for CPUs that were not in the running state. In this version, the sched_clock_tick() added in the softlockup_tick(), which did not result in a hard hang. I am pretty sure that softlockup_tick() runs as a part of the timer interrupt processing from run_local_timers(), so there was no further need to turn off the interrupts. Jason. >> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> >> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> >> Cc: Ingo Molnar <mingo@elte.hu> >> Cc: peterz@infradead.org >> --- >> kernel/softlockup.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> --- a/kernel/softlockup.c >> +++ b/kernel/softlockup.c >> @@ -118,6 +118,9 @@ void softlockup_tick(void) >> } >> >> if (touch_timestamp == 0) { >> + /* If the time stamp was touched externally make sure the >> + * scheduler tick is up to date as well */ >> + sched_clock_tick(); >> __touch_softlockup_watchdog(); >> return; >> } >> >> > > Aside from the funny comment style (please fix) the fix does look > sensible. > No problem. Thanks, Jason. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] softlockup: fix problem with long kernel pauses from kgdb 2009-07-27 20:18 ` Peter Zijlstra 2009-07-27 21:25 ` Jason Wessel @ 2009-07-28 15:05 ` Jason Wessel 2009-08-04 14:16 ` Ingo Molnar 1 sibling, 1 reply; 9+ messages in thread From: Jason Wessel @ 2009-07-28 15:05 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, lkml, Deng, Dongdong Peter Zijlstra wrote: > On Mon, 2009-07-27 at 15:03 -0500, Jason Wessel wrote: >> The fix is to simply invoke sched_clock_tick() to update "cpu sched >> clock" on exit from kgdb_handle_exception. > > Is that a regular IRQ context, or is that NMI context? > >> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> >> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> >> Cc: Ingo Molnar <mingo@elte.hu> >> Cc: peterz@infradead.org >> --- >> kernel/softlockup.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> --- a/kernel/softlockup.c >> +++ b/kernel/softlockup.c >> @@ -118,6 +118,9 @@ void softlockup_tick(void) >> } >> >> if (touch_timestamp == 0) { >> + /* If the time stamp was touched externally make sure the >> + * scheduler tick is up to date as well */ >> + sched_clock_tick(); >> __touch_softlockup_watchdog(); >> return; >> } >> > > Aside from the funny comment style (please fix) the fix does look > sensible. It turns out that further testing of this patch shows a regression in the ability to detect certain lockups. It is a direct result of the way the scheduling code makes use of the touch_softlockup_watchdog(). With the above proposed patch the tick was getting updated after a resume, but was also getting updated with the run_timers(), and if that happened before the softlockup tick, no softlockup would get reported (note that I was using some test code to induce softlockups). The patch below is a bit more invasive but solves the problem by allowing kgdb to request that the sched cpu clock is updated only when returning from a state where we know we need to force the update. Would this change be an acceptable solution to allow kgdb to peacefully exist with the softlockup code? Thanks, Jason. ----- From: Jason Wessel <jason.wessel@windriver.com> Subject: [PATCH] softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the time from hardware, such as from TSC. In this configuration kgdb will report a softlock warning messages on resuming or detaching from a debug session. Sequence of events in the problem case: 1) "cpu sched clock" and "hardware time" are at 100 sec prior to a call to kgdb_handle_exception() 2) Debugger waits in kgdb_handle_exception() for 80 sec and on exit the following is called ... touch_softlockup_watchdog() --> __raw_get_cpu_var(touch_timestamp) = 0; 3) "cpu sched clock" = 100s (it was not updated, because the interrupt was disabled in kgdb) but the "hardware time" = 180 sec 4) The first timer interrupt after resuming from kgdb_handle_exception updates the watchdog from the "cpu sched clock" update_process_times() { ... run_local_timers() --> softlockup_tick() --> check (touch_timestamp == 0) (it is "YES" here, we have set "touch_timestamp = 0" at kgdb) --> __touch_softlockup_watchdog() ***(A)--> reset "touch_timestamp" to "get_timestamp()" (Here, the "touch_timestamp" will still be set to 100s.) ... scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched clock" to "hardware time" = 180s) ... } 5) The Second timer interrupt handler appears to have a large jump and trips the softlockup warning. update_process_times() { ... run_local_timers() --> softlockup_tick() --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s --> printk "soft lockup error messages" ... } note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)" Why "touch_timestamp" is 100 sec, instead of 180 sec? With the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK" set the call trace of get_timestamp() is: get_timestamp(this_cpu) -->cpu_clock(this_cpu) -->sched_clock_cpu(this_cpu) -->__update_sched_clock(sched_clock_data, now) The __update_sched_clock() function uses the GTOD tick value to create a window to normalize the "now" values. So if "now" values is too big for sched_clock_data, it will be ignored. The fix is to invoke sched_clock_tick() to update "cpu sched clock" in order to recover from this state. This is done by introducing the function touch_softlockup_watchdog_sync(), which allows kgdb to request that the sched clock is updated when the watchdog thread runs the first time after a resume from kgdb. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: peterz@infradead.org --- include/linux/sched.h | 4 ++++ kernel/kgdb.c | 6 +++--- kernel/softlockup.c | 16 ++++++++++++++++ 3 files changed, 23 insertions(+), 3 deletions(-) --- a/kernel/softlockup.c +++ b/kernel/softlockup.c @@ -79,6 +79,14 @@ void touch_softlockup_watchdog(void) } EXPORT_SYMBOL(touch_softlockup_watchdog); +static int softlock_touch_sync[NR_CPUS]; + +void touch_softlockup_watchdog_sync(void) +{ + softlock_touch_sync[raw_smp_processor_id()] = 1; + __raw_get_cpu_var(touch_timestamp) = 0; +} + void touch_all_softlockup_watchdogs(void) { int cpu; @@ -118,6 +126,14 @@ void softlockup_tick(void) } if (touch_timestamp == 0) { + if (unlikely(softlock_touch_sync[this_cpu])) { + /* + * If the time stamp was touched atomically + * make sure the scheduler tick is up to date. + */ + softlock_touch_sync[this_cpu] = 0; + sched_clock_tick(); + } __touch_softlockup_watchdog(); return; } --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -300,6 +300,7 @@ extern void sched_show_task(struct task_ #ifdef CONFIG_DETECT_SOFTLOCKUP extern void softlockup_tick(void); extern void touch_softlockup_watchdog(void); +extern void touch_softlockup_watchdog_sync(void); extern void touch_all_softlockup_watchdogs(void); extern int proc_dosoftlockup_thresh(struct ctl_table *table, int write, struct file *filp, void __user *buffer, @@ -313,6 +314,9 @@ static inline void softlockup_tick(void) static inline void touch_softlockup_watchdog(void) { } +static inline void touch_softlockup_watchdog_sync(void) +{ +} static inline void touch_all_softlockup_watchdogs(void) { } --- a/kernel/kgdb.c +++ b/kernel/kgdb.c @@ -590,7 +590,7 @@ static void kgdb_wait(struct pt_regs *re /* Signal the primary CPU that we are done: */ atomic_set(&cpu_in_kgdb[cpu], 0); - touch_softlockup_watchdog(); + touch_softlockup_watchdog_sync(); clocksource_touch_watchdog(); local_irq_restore(flags); } @@ -1433,7 +1433,7 @@ acquirelock: atomic_read(&kgdb_cpu_doing_single_step) != cpu) { atomic_set(&kgdb_active, -1); - touch_softlockup_watchdog(); + touch_softlockup_watchdog_sync(); clocksource_touch_watchdog(); local_irq_restore(flags); @@ -1526,7 +1526,7 @@ acquirelock: kgdb_restore: /* Free kgdb_active */ atomic_set(&kgdb_active, -1); - touch_softlockup_watchdog(); + touch_softlockup_watchdog_sync(); clocksource_touch_watchdog(); local_irq_restore(flags); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] softlockup: fix problem with long kernel pauses from kgdb 2009-07-28 15:05 ` Jason Wessel @ 2009-08-04 14:16 ` Ingo Molnar 2009-08-04 14:53 ` [PATCH] softlockup: fix problem with long kernel pauses fromkgdb Jason Wessel 2009-09-26 3:01 ` Yong Zhang 0 siblings, 2 replies; 9+ messages in thread From: Ingo Molnar @ 2009-08-04 14:16 UTC (permalink / raw) To: Jason Wessel, Thomas Gleixner; +Cc: Peter Zijlstra, lkml, Deng, Dongdong * Jason Wessel <jason.wessel@windriver.com> wrote: > Peter Zijlstra wrote: > > On Mon, 2009-07-27 at 15:03 -0500, Jason Wessel wrote: > >> The fix is to simply invoke sched_clock_tick() to update "cpu sched > >> clock" on exit from kgdb_handle_exception. > > > > Is that a regular IRQ context, or is that NMI context? > > > >> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> > >> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> > >> Cc: Ingo Molnar <mingo@elte.hu> > >> Cc: peterz@infradead.org > >> --- > >> kernel/softlockup.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> --- a/kernel/softlockup.c > >> +++ b/kernel/softlockup.c > >> @@ -118,6 +118,9 @@ void softlockup_tick(void) > >> } > >> > >> if (touch_timestamp == 0) { > >> + /* If the time stamp was touched externally make sure the > >> + * scheduler tick is up to date as well */ > >> + sched_clock_tick(); > >> __touch_softlockup_watchdog(); > >> return; > >> } > >> > > > > Aside from the funny comment style (please fix) the fix does look > > sensible. > > It turns out that further testing of this patch shows a regression in > the ability to detect certain lockups. It is a direct result of the > way the scheduling code makes use of the touch_softlockup_watchdog(). > With the above proposed patch the tick was getting updated after a > resume, but was also getting updated with the run_timers(), and if > that happened before the softlockup tick, no softlockup would get > reported (note that I was using some test code to induce softlockups). > > The patch below is a bit more invasive but solves the problem by > allowing kgdb to request that the sched cpu clock is updated only when > returning from a state where we know we need to force the update. > > Would this change be an acceptable solution to allow kgdb to > peacefully exist with the softlockup code? > > Thanks, > Jason. > > > ----- > From: Jason Wessel <jason.wessel@windriver.com> > Subject: [PATCH] softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume > > When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the > time from hardware, such as from TSC. In this configuration kgdb will > report a softlock warning messages on resuming or detaching from a > debug session. > > Sequence of events in the problem case: > > 1) "cpu sched clock" and "hardware time" are at 100 sec prior > to a call to kgdb_handle_exception() > > 2) Debugger waits in kgdb_handle_exception() for 80 sec and on exit > the following is called ... touch_softlockup_watchdog() --> > __raw_get_cpu_var(touch_timestamp) = 0; > > 3) "cpu sched clock" = 100s (it was not updated, because the interrupt > was disabled in kgdb) but the "hardware time" = 180 sec > > 4) The first timer interrupt after resuming from kgdb_handle_exception > updates the watchdog from the "cpu sched clock" > > update_process_times() { ... run_local_timers() --> softlockup_tick() > --> check (touch_timestamp == 0) (it is "YES" here, we have set > "touch_timestamp = 0" at kgdb) --> __touch_softlockup_watchdog() > ***(A)--> reset "touch_timestamp" to "get_timestamp()" (Here, the > "touch_timestamp" will still be set to 100s.) ... > > scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched > clock" to "hardware time" = 180s) ... } > > 5) The Second timer interrupt handler appears to have a large jump and > trips the softlockup warning. > > update_process_times() { ... run_local_timers() --> softlockup_tick() > --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s --> printk > "soft lockup error messages" ... } > > note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)" > > Why "touch_timestamp" is 100 sec, instead of 180 sec? > > With the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK" set the call trace of > get_timestamp() is: > > get_timestamp(this_cpu) -->cpu_clock(this_cpu) > -->sched_clock_cpu(this_cpu) -->__update_sched_clock(sched_clock_data, > now) > > The __update_sched_clock() function uses the GTOD tick value to create > a window to normalize the "now" values. So if "now" values is too big > for sched_clock_data, it will be ignored. > > The fix is to invoke sched_clock_tick() to update "cpu sched clock" in > order to recover from this state. This is done by introducing the > function touch_softlockup_watchdog_sync(), which allows kgdb to > request that the sched clock is updated when the watchdog thread runs > the first time after a resume from kgdb. > > Signed-off-by: Jason Wessel <jason.wessel@windriver.com> > Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: peterz@infradead.org > > --- > include/linux/sched.h | 4 ++++ > kernel/kgdb.c | 6 +++--- > kernel/softlockup.c | 16 ++++++++++++++++ > 3 files changed, 23 insertions(+), 3 deletions(-) > > --- a/kernel/softlockup.c > +++ b/kernel/softlockup.c > @@ -79,6 +79,14 @@ void touch_softlockup_watchdog(void) > } > EXPORT_SYMBOL(touch_softlockup_watchdog); > > +static int softlock_touch_sync[NR_CPUS]; > + > +void touch_softlockup_watchdog_sync(void) > +{ > + softlock_touch_sync[raw_smp_processor_id()] = 1; > + __raw_get_cpu_var(touch_timestamp) = 0; > +} > + > void touch_all_softlockup_watchdogs(void) > { > int cpu; > @@ -118,6 +126,14 @@ void softlockup_tick(void) > } > > if (touch_timestamp == 0) { > + if (unlikely(softlock_touch_sync[this_cpu])) { > + /* > + * If the time stamp was touched atomically > + * make sure the scheduler tick is up to date. > + */ > + softlock_touch_sync[this_cpu] = 0; > + sched_clock_tick(); > + } Hm, this looks quite ugly. Peter, Thomas, can you think of a cleaner solution? Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] softlockup: fix problem with long kernel pauses fromkgdb 2009-08-04 14:16 ` Ingo Molnar @ 2009-08-04 14:53 ` Jason Wessel 2009-08-04 14:59 ` Ingo Molnar 2009-09-26 3:01 ` Yong Zhang 1 sibling, 1 reply; 9+ messages in thread From: Jason Wessel @ 2009-08-04 14:53 UTC (permalink / raw) To: Ingo Molnar; +Cc: Thomas Gleixner, Peter Zijlstra, lkml, Deng, Dongdong Ingo Molnar wrote: > * Jason Wessel <jason.wessel@windriver.com> wrote: > >> ----- >> From: Jason Wessel <jason.wessel@windriver.com> >> Subject: [PATCH] softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume >> >> When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the >> time from hardware, such as from TSC. In this configuration kgdb will >> report a softlock warning messages on resuming or detaching from a >> debug session. >> > > Hm, this looks quite ugly. Peter, Thomas, can you think of a cleaner > solution? > Below was a more specific test case I received from Dongdong Deng which did not require kgdb. The test case is not something to merge, it is just a simple module to build to demonstrate the problem outside of kgdb. The patch I submitted might have been on the "ugly" side, but there was not an obvious way to solve the problem without making changes in kernel/sched*. I opted for something that was entirely self contained to the softlockup code. Jason. ============= test case ================ /* Information about this test case. * ------------------------------------- * 1: CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y * * 2: export clocksource_touch_watchdog for this test case to build. * * * diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c * index 7466cb8..c81f9a3 100644 * --- a/kernel/time/clocksource.c * +++ b/kernel/time/clocksource.c * @@ -319,6 +319,7 @@ void clocksource_touch_watchdog(void) * { * clocksource_resume_watchdog(); * } * +EXPORT_SYMBOL(clocksource_touch_watchdog); * * 3: build following module, insmod and observe the softlockup * warning or watch your system reboot if you have reboot * on softlockup set. * */ #include <linux/module.h> #include <linux/init.h> #include <linux/irqflags.h> #include <linux/delay.h> #include <linux/sched.h> #include <linux/clocksource.h> static int example_init(void) { unsigned long flags; unsigned i; printk(KERN_NOTICE"softlockup test init\n"); local_irq_save(flags); for (i = 0; i < 70; i++) mdelay(1000); touch_softlockup_watchdog(); clocksource_touch_watchdog(); local_irq_restore(flags); return 0; } static void example_exit(void) { printk(KERN_NOTICE"softlockup test exit\n"); } module_init(example_init); module_exit(example_exit); MODULE_LICENSE("GPL"); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] softlockup: fix problem with long kernel pauses fromkgdb 2009-08-04 14:53 ` [PATCH] softlockup: fix problem with long kernel pauses fromkgdb Jason Wessel @ 2009-08-04 14:59 ` Ingo Molnar 2009-08-21 12:42 ` [PATCH] softlockup: fix problem with long kernel pauses from kgdb DDD 0 siblings, 1 reply; 9+ messages in thread From: Ingo Molnar @ 2009-08-04 14:59 UTC (permalink / raw) To: Jason Wessel; +Cc: Thomas Gleixner, Peter Zijlstra, lkml, Deng, Dongdong * Jason Wessel <jason.wessel@windriver.com> wrote: > Ingo Molnar wrote: > > * Jason Wessel <jason.wessel@windriver.com> wrote: > > > >> ----- > >> From: Jason Wessel <jason.wessel@windriver.com> > >> Subject: [PATCH] softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume > >> > >> When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the > >> time from hardware, such as from TSC. In this configuration kgdb will > >> report a softlock warning messages on resuming or detaching from a > >> debug session. > >> > > > > Hm, this looks quite ugly. Peter, Thomas, can you think of a > > cleaner solution? > > > > Below was a more specific test case I received from Dongdong Deng > which did not require kgdb. The test case is not something to > merge, it is just a simple module to build to demonstrate the > problem outside of kgdb. > > The patch I submitted might have been on the "ugly" side, but > there was not an obvious way to solve the problem without making > changes in kernel/sched*. I opted for something that was entirely > self contained to the softlockup code. it's not really the fault of your patch really - all these integrations along dynticks, scheduler time and softlockup feel a bit awkward. Wondering whether there's some nicer way. Ingo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] softlockup: fix problem with long kernel pauses from kgdb 2009-08-04 14:59 ` Ingo Molnar @ 2009-08-21 12:42 ` DDD 0 siblings, 0 replies; 9+ messages in thread From: DDD @ 2009-08-21 12:42 UTC (permalink / raw) To: Jason Wessel, Ingo Molnar; +Cc: Thomas Gleixner, Peter Zijlstra, lkml Hi Jason & Ingo, If we apply Peter's patch which was send out just a moment ago, I believe this issue will gone. :-) Thanks, Dongdong [PATCH -rt] timer: delay waking softirqs from the jiffy tick ... ... Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/timer.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/timer.c b/kernel/timer.c index 8137cce..96ac1b4 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -1221,10 +1221,10 @@ void update_process_times(int user_tick) /* Note: this timer irq context must be accounted for as well. */ account_process_tick(p, user_tick); + scheduler_tick(); run_local_timers(); if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_tick); - scheduler_tick(); run_posix_cpu_timers(p); } On Tue, 2009-08-04 at 16:59 +0200, Ingo Molnar wrote: > * Jason Wessel <jason.wessel@windriver.com> wrote: > > > Ingo Molnar wrote: > > > * Jason Wessel <jason.wessel@windriver.com> wrote: > > > > > >> ----- > > >> From: Jason Wessel <jason.wessel@windriver.com> > > >> Subject: [PATCH] softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume > > >> > > >> When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the > > >> time from hardware, such as from TSC. In this configuration kgdb will > > >> report a softlock warning messages on resuming or detaching from a > > >> debug session. > > >> > > > > > > Hm, this looks quite ugly. Peter, Thomas, can you think of a > > > cleaner solution? > > > > > > > Below was a more specific test case I received from Dongdong Deng > > which did not require kgdb. The test case is not something to > > merge, it is just a simple module to build to demonstrate the > > problem outside of kgdb. > > > > The patch I submitted might have been on the "ugly" side, but > > there was not an obvious way to solve the problem without making > > changes in kernel/sched*. I opted for something that was entirely > > self contained to the softlockup code. > > it's not really the fault of your patch really - all these > integrations along dynticks, scheduler time and softlockup feel a > bit awkward. Wondering whether there's some nicer way. > > Ingo ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] softlockup: fix problem with long kernel pauses from kgdb 2009-08-04 14:16 ` Ingo Molnar 2009-08-04 14:53 ` [PATCH] softlockup: fix problem with long kernel pauses fromkgdb Jason Wessel @ 2009-09-26 3:01 ` Yong Zhang 1 sibling, 0 replies; 9+ messages in thread From: Yong Zhang @ 2009-09-26 3:01 UTC (permalink / raw) To: Ingo Molnar Cc: Jason Wessel, Thomas Gleixner, Peter Zijlstra, lkml, Deng, Dongdong [-- Attachment #1: Type: text/plain, Size: 7147 bytes --] On Tue, Aug 4, 2009 at 10:16 PM, Ingo Molnar <mingo@elte.hu> wrote: > > * Jason Wessel <jason.wessel@windriver.com> wrote: > >> Peter Zijlstra wrote: >> > On Mon, 2009-07-27 at 15:03 -0500, Jason Wessel wrote: >> >> The fix is to simply invoke sched_clock_tick() to update "cpu sched >> >> clock" on exit from kgdb_handle_exception. >> > >> > Is that a regular IRQ context, or is that NMI context? >> > >> >> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> >> >> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> >> >> Cc: Ingo Molnar <mingo@elte.hu> >> >> Cc: peterz@infradead.org >> >> --- >> >> kernel/softlockup.c | 3 +++ >> >> 1 file changed, 3 insertions(+) >> >> >> >> --- a/kernel/softlockup.c >> >> +++ b/kernel/softlockup.c >> >> @@ -118,6 +118,9 @@ void softlockup_tick(void) >> >> } >> >> >> >> if (touch_timestamp == 0) { >> >> + /* If the time stamp was touched externally make sure the >> >> + * scheduler tick is up to date as well */ >> >> + sched_clock_tick(); >> >> __touch_softlockup_watchdog(); >> >> return; >> >> } >> >> >> > >> > Aside from the funny comment style (please fix) the fix does look >> > sensible. >> >> It turns out that further testing of this patch shows a regression in >> the ability to detect certain lockups. It is a direct result of the >> way the scheduling code makes use of the touch_softlockup_watchdog(). >> With the above proposed patch the tick was getting updated after a >> resume, but was also getting updated with the run_timers(), and if >> that happened before the softlockup tick, no softlockup would get >> reported (note that I was using some test code to induce softlockups). >> >> The patch below is a bit more invasive but solves the problem by >> allowing kgdb to request that the sched cpu clock is updated only when >> returning from a state where we know we need to force the update. >> >> Would this change be an acceptable solution to allow kgdb to >> peacefully exist with the softlockup code? >> >> Thanks, >> Jason. >> >> >> ----- >> From: Jason Wessel <jason.wessel@windriver.com> >> Subject: [PATCH] softlockup: add sched_clock_tick() to avoid kernel warning on kgdb resume >> >> When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set sched_clock() gets the >> time from hardware, such as from TSC. In this configuration kgdb will >> report a softlock warning messages on resuming or detaching from a >> debug session. >> >> Sequence of events in the problem case: >> >> 1) "cpu sched clock" and "hardware time" are at 100 sec prior >> to a call to kgdb_handle_exception() >> >> 2) Debugger waits in kgdb_handle_exception() for 80 sec and on exit >> the following is called ... touch_softlockup_watchdog() --> >> __raw_get_cpu_var(touch_timestamp) = 0; >> >> 3) "cpu sched clock" = 100s (it was not updated, because the interrupt >> was disabled in kgdb) but the "hardware time" = 180 sec >> >> 4) The first timer interrupt after resuming from kgdb_handle_exception >> updates the watchdog from the "cpu sched clock" >> >> update_process_times() { ... run_local_timers() --> softlockup_tick() >> --> check (touch_timestamp == 0) (it is "YES" here, we have set >> "touch_timestamp = 0" at kgdb) --> __touch_softlockup_watchdog() >> ***(A)--> reset "touch_timestamp" to "get_timestamp()" (Here, the >> "touch_timestamp" will still be set to 100s.) ... >> >> scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched >> clock" to "hardware time" = 180s) ... } >> >> 5) The Second timer interrupt handler appears to have a large jump and >> trips the softlockup warning. >> >> update_process_times() { ... run_local_timers() --> softlockup_tick() >> --> "cpu sched clock" - "touch_timestamp" = 180s-100s > 60s --> printk >> "soft lockup error messages" ... } >> >> note: ***(A) reset "touch_timestamp" to "get_timestamp(this_cpu)" >> >> Why "touch_timestamp" is 100 sec, instead of 180 sec? >> >> With the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK" set the call trace of >> get_timestamp() is: >> >> get_timestamp(this_cpu) -->cpu_clock(this_cpu) >> -->sched_clock_cpu(this_cpu) -->__update_sched_clock(sched_clock_data, >> now) >> >> The __update_sched_clock() function uses the GTOD tick value to create >> a window to normalize the "now" values. So if "now" values is too big >> for sched_clock_data, it will be ignored. >> >> The fix is to invoke sched_clock_tick() to update "cpu sched clock" in >> order to recover from this state. This is done by introducing the >> function touch_softlockup_watchdog_sync(), which allows kgdb to >> request that the sched clock is updated when the watchdog thread runs >> the first time after a resume from kgdb. >> >> Signed-off-by: Jason Wessel <jason.wessel@windriver.com> >> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> >> Cc: Ingo Molnar <mingo@elte.hu> >> Cc: peterz@infradead.org >> >> --- >> include/linux/sched.h | 4 ++++ >> kernel/kgdb.c | 6 +++--- >> kernel/softlockup.c | 16 ++++++++++++++++ >> 3 files changed, 23 insertions(+), 3 deletions(-) >> >> --- a/kernel/softlockup.c >> +++ b/kernel/softlockup.c >> @@ -79,6 +79,14 @@ void touch_softlockup_watchdog(void) >> } >> EXPORT_SYMBOL(touch_softlockup_watchdog); >> >> +static int softlock_touch_sync[NR_CPUS]; >> + >> +void touch_softlockup_watchdog_sync(void) >> +{ >> + softlock_touch_sync[raw_smp_processor_id()] = 1; >> + __raw_get_cpu_var(touch_timestamp) = 0; >> +} >> + >> void touch_all_softlockup_watchdogs(void) >> { >> int cpu; >> @@ -118,6 +126,14 @@ void softlockup_tick(void) >> } >> >> if (touch_timestamp == 0) { >> + if (unlikely(softlock_touch_sync[this_cpu])) { >> + /* >> + * If the time stamp was touched atomically >> + * make sure the scheduler tick is up to date. >> + */ >> + softlock_touch_sync[this_cpu] = 0; >> + sched_clock_tick(); >> + } > > Hm, this looks quite ugly. Peter, Thomas, can you think of a cleaner > solution? > We need not to sync up sched_clock in softlockup_tick, it can just return and sync up the timestamp in the next tick. This will not touch what should be done by dynticks, scheduler time and so on and keep softlockup_tick clean. I have rewrote the patch based on above, please check the attachment. Thanks, Yong > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > [-- Attachment #2: 0001-softlockup-introduce-touch_softlockup_watchdog_sync.patch --] [-- Type: text/x-patch, Size: 3305 bytes --] From 1821356596bdd28dfc69f708a6929394da351001 Mon Sep 17 00:00:00 2001 From: Yong Zhang <yong.zhang0@gmail.com> Date: Sat, 26 Sep 2009 10:37:14 +0800 Subject: [PATCH] softlockup: introduce touch_softlockup_watchdog_sync() In case of system halted for a long time, system will do time sync up in the first incoming tick. But if we also touch softlockup_tick timestamp at the same time, softlockup_tick could keep a old timestamp and this will lead annoying softlockup warning. So move softlockup syncing it's timestamp to the next tick. Signed-off-by: Jason Wessel <jason.wessel@windriver.com> Signed-off-by: Dongdong Deng <Dongdong.Deng@windriver.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <http://lkml.org/lkml/2009/7/27/317> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> --- include/linux/sched.h | 4 ++++ kernel/softlockup.c | 18 ++++++++++++++++++ 2 files changed, 22 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 75e6e60..1aa4574 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -307,6 +307,7 @@ extern void sched_show_task(struct task_struct *p); #ifdef CONFIG_DETECT_SOFTLOCKUP extern void softlockup_tick(void); extern void touch_softlockup_watchdog(void); +extern void touch_softlockup_watchdog_sync(void); extern void touch_all_softlockup_watchdogs(void); extern int proc_dosoftlockup_thresh(struct ctl_table *table, int write, void __user *buffer, @@ -320,6 +321,9 @@ static inline void softlockup_tick(void) static inline void touch_softlockup_watchdog(void) { } +static inline void touch_softlockup_watchdog_sync(void) +{ +} static inline void touch_all_softlockup_watchdogs(void) { } diff --git a/kernel/softlockup.c b/kernel/softlockup.c index 81324d1..628d531 100644 --- a/kernel/softlockup.c +++ b/kernel/softlockup.c @@ -25,6 +25,7 @@ static DEFINE_SPINLOCK(print_lock); static DEFINE_PER_CPU(unsigned long, touch_timestamp); static DEFINE_PER_CPU(unsigned long, print_timestamp); static DEFINE_PER_CPU(struct task_struct *, watchdog_task); +static DEFINE_PER_CPU(bool, softlock_touch_sync); static int __read_mostly did_panic; int __read_mostly softlockup_thresh = 60; @@ -79,6 +80,13 @@ void touch_softlockup_watchdog(void) } EXPORT_SYMBOL(touch_softlockup_watchdog); +void touch_softlockup_watchdog_sync(void) +{ + __raw_get_cpu_var(touch_timestamp) = 0; + __raw_get_cpu_var(softlock_touch_sync) = true; +} +EXPORT_SYMBOL(touch_softlockup_watchdog_sync); + void touch_all_softlockup_watchdogs(void) { int cpu; @@ -117,6 +125,15 @@ void softlockup_tick(void) return; } + /* + * If the scheduler tick is not up to date. Skip this check + * and wait for the timestamp to to synced in the next tick + */ + if (unlikely(per_cpu(softlock_touch_sync, this_cpu))) { + per_cpu(softlock_touch_sync, this_cpu) = false; + return; + } + if (touch_timestamp == 0) { __touch_softlockup_watchdog(); return; @@ -216,6 +233,7 @@ cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) return NOTIFY_BAD; } per_cpu(touch_timestamp, hotcpu) = 0; + per_cpu(softlock_touch_sync, hotcpu) = false; per_cpu(watchdog_task, hotcpu) = p; kthread_bind(p, hotcpu); break; -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-09-26 3:09 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-07-27 20:03 [PATCH] softlockup: fix problem with long kernel pauses from kgdb Jason Wessel 2009-07-27 20:18 ` Peter Zijlstra 2009-07-27 21:25 ` Jason Wessel 2009-07-28 15:05 ` Jason Wessel 2009-08-04 14:16 ` Ingo Molnar 2009-08-04 14:53 ` [PATCH] softlockup: fix problem with long kernel pauses fromkgdb Jason Wessel 2009-08-04 14:59 ` Ingo Molnar 2009-08-21 12:42 ` [PATCH] softlockup: fix problem with long kernel pauses from kgdb DDD 2009-09-26 3:01 ` Yong Zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).