From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758213Ab0DPBrS (ORCPT ); Thu, 15 Apr 2010 21:47:18 -0400 Received: from mail-ww0-f46.google.com ([74.125.82.46]:50112 "EHLO mail-ww0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758185Ab0DPBrP (ORCPT ); Thu, 15 Apr 2010 21:47:15 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=iBrsCeaOgt/ACoTrI0+vDzcw0dKtJnYj5YYCX98KzW0pKH31IN106Hl9SYWauqbPMc dyRVDoP5xHW34KwxmwTmVrjxftl4FpryQ0Zj+XFhMQB6Le6t+jtq2GPS4XBqnh7oLyZX fIoouJ8FXn1QbpWYLnIgZTYIKiRBWiPLKsl28= Date: Fri, 16 Apr 2010 03:47:14 +0200 From: Frederic Weisbecker To: Don Zickus Cc: mingo@elte.hu, peterz@infradead.org, gorcunov@gmail.com, aris@redhat.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2] [watchdog] combine nmi_watchdog and softlockup Message-ID: <20100416014712.GC15570@nowhere> References: <1271366710-17468-1-git-send-email-dzickus@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1271366710-17468-1-git-send-email-dzickus@redhat.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 15, 2010 at 05:25:10PM -0400, Don Zickus wrote: > The new nmi_watchdog (which uses the perf event subsystem) is very > similar in structure to the softlockup detector. Using Ingo's suggestion, > I combined the two functionalities into one file, kernel/watchdog.c. > > Now both the nmi_watchdog (or hardlockup detector) and softlockup detector > sit on top of the perf event subsystem, which is run every 60 seconds or so > to see if there are any lockups. > > To detect hardlockups, cpus not responding to interrupts, I implemented an > hrtimer that runs 5 times for every perf event overflow event. If that stops > counting on a cpu, then the cpu is most likely in trouble. > > To detect softlockups, tasks not yielding to the scheduler, I used the > previous kthread idea that now gets kicked every time the hrtimer fires. > If the kthread isn't being scheduled neither is anyone else and the > warning is printed to the console. > > I tested this on x86_64 and both the softlockup and hardlockup paths work. > > V2: > - cleaned up the Kconfig and softlockup combination > - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI > - seperated out the softlockup case from perf event subsystem > - re-arranged the enabling/disabling nmi watchdog from proc space > - added cpumasks for hardlockup failure cases > - removed fallback to soft events if no PMU exists for hard events > > TODO: > - figure out how to make an arch-agnostic clock2cycles call (if possible) > to feed into perf events as a sample period In fact we also have the sample_freq thing that let you run against a frequency (events per sec) rather than a sample period. We do some calculations that recompute the actual sample period on top of this frequency in a regular base as the events come. It's rather unfortunate we can't have 0.xxxx frequencies but may be we can work that out by adding a freq_unscale field, which would basically produce the frequency like this: freq = event->attr.sample_freq * (10 ^ -event->attr.freq_unscale) This is not trivial though, so let's rather focus on the real matter for now :) > diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c > index e8b78a0..79425f9 100644 > --- a/arch/x86/kernel/apic/hw_nmi.c > +++ b/arch/x86/kernel/apic/hw_nmi.c > @@ -89,7 +89,7 @@ int hw_nmi_is_cpu_stuck(struct pt_regs *regs) > > u64 hw_nmi_get_sample_period(void) > { > - return cpu_khz * 1000; > + return (u64)(cpu_khz) * 1000 * 60; > } > > #ifdef ARCH_HAS_NMI_WATCHDOG > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 6f7bba9..83be6d7 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -338,6 +338,12 @@ extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, > size_t *lenp, loff_t *ppos); > #endif > > +#ifdef CONFIG_NMI_WATCHDOG > +extern int proc_dowatchdog_thresh(struct ctl_table *table, int write, > + void __user *buffer, > + size_t *lenp, loff_t *ppos); > +#endif > + > /* Attach to any functions which should be ignored in wchan output. */ > #define __sched __attribute__((__section__(".sched.text"))) > > diff --git a/init/Kconfig b/init/Kconfig > index 7331a16..0b83612 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -948,6 +948,7 @@ config PERF_USE_VMALLOC > > config PERF_EVENTS_NMI > bool > + depends on PERF_EVENTS > help > Arch has support for nmi_watchdog That looks too general. It's more about the fact the arch supports cpu cycle events and generates NMIs on overflow. > diff --git a/kernel/Makefile b/kernel/Makefile > index 8a5abe5..56ba99d 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -75,9 +75,11 @@ obj-$(CONFIG_GCOV_KERNEL) += gcov/ > obj-$(CONFIG_AUDIT_TREE) += audit_tree.o > obj-$(CONFIG_KPROBES) += kprobes.o > obj-$(CONFIG_KGDB) += kgdb.o > +ifneq ($(CONFIG_NMI_WATCHDOG),y) > obj-$(CONFIG_DETECT_SOFTLOCKUP) += softlockup.o > -obj-$(CONFIG_NMI_WATCHDOG) += nmi_watchdog.o > +endif I'm confused, do we have two versions of the softlockup detector now? You should drop the older one. > obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o > +obj-$(CONFIG_NMI_WATCHDOG) += watchdog.o > obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ > obj-$(CONFIG_SECCOMP) += seccomp.o > obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index ac72c9e..2165b22 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -704,6 +704,15 @@ static struct ctl_table kern_table[] = { > .mode = 0644, > .proc_handler = proc_nmi_enabled, > }, > + { > + .procname = "watchdog_thresh", > + .data = &softlockup_thresh, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dowatchdog_thresh, > + .extra1 = &neg_one, > + .extra2 = &sixty, > + }, > #endif > #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) && !defined(CONFIG_NMI_WATCHDOG) > { > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > new file mode 100644 > index 0000000..fad5b1a > --- /dev/null > +++ b/kernel/watchdog.c > @@ -0,0 +1,570 @@ > +/* > + * Detect Hard/Soft Lockups using the NMI Well, softlockups detection doesn't use NMI. > +/* > + * Returns seconds, approximately. We don't need nanosecond > + * resolution, and we don't need to waste time with a big divide when > + * 2^30ns == 1.074s. > + */ > +static unsigned long get_timestamp(int this_cpu) > +{ > + return cpu_clock(this_cpu) >> 30LL; /* 2^30 ~= 10^9 */ > +} > + > +static unsigned long get_sample_period(void) > +{ > + /* > + * convert softlockup_thresh from seconds to ns > + * the divide by 5 is to give hrtimer 5 chances to > + * increment before the hardlockup detector generates > + * a warning > + */ > + return softlockup_thresh / 5 * NSEC_PER_SEC; > +} > + > +/* Commands for resetting the watchdog */ > +static void __touch_watchdog(void) > +{ > + int this_cpu = raw_smp_processor_id(); > + > + __raw_get_cpu_var(watchdog_touch_ts) = get_timestamp(this_cpu); Please use __get_cpu_var() instead so that we keep the preempt disabled check from smp_processor_id() > +} > + > +void touch_watchdog(void) > +{ > + __raw_get_cpu_var(watchdog_touch_ts) = 0; Same here. > +} > +EXPORT_SYMBOL(touch_watchdog); > + > +void touch_all_watchdog(void) > +{ > + int cpu; > + > + for_each_online_cpu(cpu) > + per_cpu(watchdog_touch_ts, cpu) = 0; > +} > + > +/* deprecated functions */ > +void touch_nmi_watchdog(void) > +{ > + touch_watchdog(); > +} > +EXPORT_SYMBOL(touch_nmi_watchdog); > + > +void touch_all_nmi_watchdog(void) > +{ > + touch_all_watchdog(); > +} > + > +void touch_softlockup_watchdog(void) > +{ > + touch_watchdog(); > +} > + > +void touch_all_softlockup_watchdogs(void) > +{ > + touch_all_watchdog(); > +} > + > +void softlockup_tick(void) > +{ > +} > +/* end of deprecated functions */ You should replace the call sites directly. > +#ifdef CONFIG_PERF_EVENTS_NMI > +/* watchdog detector functions */ > +static int is_hardlockup(int cpu) > +{ > + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu); > + > + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) > + return 1; > + > + per_cpu(hrtimer_interrupts_saved, cpu) = hrint; > + return 0; > +} > +#endif > + > +static int is_softlockup(unsigned long touch_ts, int cpu) > +{ > + unsigned long now = get_timestamp(cpu); > + > + /* Warn about unreasonable delays: */ > + if (now > (touch_ts + softlockup_thresh)) > + return now - touch_ts; > + > + return 0; > +} > + > +static int > +watchdog_panic(struct notifier_block *this, unsigned long event, void *ptr) > +{ > + did_panic = 1; > + > + return NOTIFY_DONE; > +} > + > +static struct notifier_block panic_block = { > + .notifier_call = watchdog_panic, > +}; > + > +#ifdef CONFIG_PERF_EVENTS_NMI > +struct perf_event_attr wd_hw_attr = { > + .type = PERF_TYPE_HARDWARE, > + .config = PERF_COUNT_HW_CPU_CYCLES, > + .size = sizeof(struct perf_event_attr), > + .pinned = 1, > + .disabled = 1, > +}; > + > +struct perf_event_attr wd_sw_attr = { > + .type = PERF_TYPE_SOFTWARE, > + .config = PERF_COUNT_SW_CPU_CLOCK, > + .size = sizeof(struct perf_event_attr), > + .pinned = 1, > + .disabled = 1, > +}; Why do you keep the software clock, I don't see how it can be useful to detect hardlockups. It triggers in a regular irq not an NMI. > + > +/* Callback function for perf event subsystem */ > +void watchdog_overflow_callback(struct perf_event *event, int nmi, > + struct perf_sample_data *data, > + struct pt_regs *regs) > +{ > + int this_cpu = smp_processor_id(); > + unsigned long touch_ts = per_cpu(watchdog_touch_ts, this_cpu); > + > + if (touch_ts == 0) { > + __touch_watchdog(); > + return; > + } > + > + /* check for a hardlockup > + * This is done by making sure our timer interrupt > + * is incrementing. The timer interrupt should have > + * fired multiple times before we overflow'd. If it hasn't > + * then this is a good indication the cpu is stuck > + */ > + if (is_hardlockup(this_cpu)) { > + /* only print hardlockups once */ > + if (cpumask_test_cpu(this_cpu, to_cpumask(hardlockup_mask))) > + return; > + > + if (hardlockup_panic) > + panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu); > + else > + WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu); > + > + cpumask_set_cpu(this_cpu, to_cpumask(hardlockup_mask)); May be have an arch spin lock there to update your cpu mask safely. > + return; > + } > + > + cpumask_clear_cpu(this_cpu, to_cpumask(hardlockup_mask)); Hmm...this is probably not necessary. > + return; > +} > +#endif /* CONFIG_PERF_EVENTS_NMI */ > + > +/* watchdog kicker functions */ > +static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > +{ > + int this_cpu = smp_processor_id(); > + unsigned long touch_ts = per_cpu(watchdog_touch_ts, this_cpu); > + struct pt_regs *regs = get_irq_regs(); > + int duration; > + > +#ifdef CONFIG_PERF_EVENTS_NMI > + /* kick the hardlockup detector */ > + __get_cpu_var(hrtimer_interrupts)++; > +#endif Please avoid such ifdefs in the middle of the code. It's better to gather hardlockup matters in a single ifdef block: #ifdef CONFIG_PERF_EVENTS_NMI define your functions here #else define (if needed) your off case functions here (static inline stubs) #endif > + > + /* kick the softlockup detector */ > + wake_up_process(__get_cpu_var(softlockup_watchdog)); > + > + /* .. and repeat */ > + hrtimer_forward_now(hrtimer, ns_to_ktime(get_sample_period())); > + > + if (touch_ts == 0) { > + __touch_watchdog(); > + return HRTIMER_RESTART; > + } > + > + /* check for a softlockup > + * This is done by making sure a high priority task is > + * being scheduled. The task touches the watchdog to > + * indicate it is getting cpu time. If it hasn't then > + * this is a good indication some task is hogging the cpu > + */ > + duration = is_softlockup(touch_ts, this_cpu); > + if (duration) { > + printk(KERN_ERR "BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n", > + this_cpu, duration, > + current->comm, task_pid_nr(current)); > + print_modules(); > + print_irqtrace_events(current); > + if (regs) > + show_regs(regs); > + else > + dump_stack(); > + > + if (softlockup_panic) > + panic("softlockup: hung tasks"); You probably want a backtrace cpu mask here as well (but better don't use the same than the hardlockup thing) > + } > + > + return HRTIMER_RESTART; > +} > + > + > +/* > + * The watchdog thread - touches the timestamp. > + */ > +static int watchdog(void *__bind_cpu) > +{ > + struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; > + struct hrtimer *hrtimer = &per_cpu(watchdog_hrtimer, (unsigned long)__bind_cpu); > + > + sched_setscheduler(current, SCHED_FIFO, ¶m); > + > + /* initialize timestamp */ > + __touch_watchdog(); > + > + /* kick off the timer for the hardlockup detector */ > + /* done here because hrtimer_start can only pin to smp_processor_id() */ > + hrtimer_start(hrtimer, ns_to_ktime(get_sample_period()), > + HRTIMER_MODE_REL_PINNED); > + > + set_current_state(TASK_INTERRUPTIBLE); > + /* > + * Run briefly once per second to reset the softlockup timestamp. > + * If this gets delayed for more than 60 seconds then the > + * debug-printout triggers in softlockup_tick(). > + */ > + while (!kthread_should_stop()) { > + __touch_watchdog(); > + schedule(); > + > + if (kthread_should_stop()) > + break; > + > + set_current_state(TASK_INTERRUPTIBLE); > + } > + __set_current_state(TASK_RUNNING); > + > + return 0; > +} > + > + > +/* prepare/enable/disable routines */ > +static int watchdog_prepare_cpu(int cpu) > +{ > + struct hrtimer *hrtimer = &per_cpu(watchdog_hrtimer, cpu); > + > + BUG_ON(per_cpu(softlockup_watchdog, cpu)); Please warn here instead, nothing seriously dangerous is going to happen. > + hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); > + hrtimer->function = watchdog_timer_fn; > + > + return 0; > +} > + > +static int watchdog_enable(int cpu) > +{ > + struct task_struct *p = per_cpu(softlockup_watchdog, cpu); > +#ifdef CONFIG_PERF_EVENTS_NMI > + struct perf_event_attr *wd_attr; > + struct perf_event *event = per_cpu(watchdog_ev, cpu); > + > + /* is it already setup and enabled? */ > + if (event && event->state > PERF_EVENT_STATE_OFF) > + goto out; > + > + /* it is setup but not enabled */ > + if (event != NULL) > + goto out_enable; > + > + /* Try to register using hardware perf events */ > + wd_attr = &wd_hw_attr; > + wd_attr->sample_period = hw_nmi_get_sample_period(); > + event = perf_event_create_kernel_counter(wd_attr, cpu, -1, watchdog_overflow_callback); > + if (!IS_ERR(event)) { > + printk(KERN_INFO "NMI watchdog enabled, takes one hw-pmu counter.\n"); > + goto out_save; > + } > + > + printk(KERN_ERR "NMI watchdog failed to create perf event on cpu%i: %p\n", cpu, event); > + return -1; > + > + /* success path */ > +out_save: > + per_cpu(watchdog_ev, cpu) = event; > +out_enable: > + perf_event_enable(per_cpu(watchdog_ev, cpu)); > +out: > +#endif /* CONFIG_PERF_EVENTS_NMI */ Especially such kind of idef in a function plus goto in the middle, we really want to avoid that. You want a watchdog_nmi_enable() call instead that does nothing if !CONFIG_PERF_EVENTS_NMI. > + > + /* create the watchdog thread */ > + if (!p) { > + p = kthread_create(watchdog, (void *)(unsigned long)cpu, "watchdog/%d", cpu); > + if (IS_ERR(p)) { > + printk(KERN_ERR "softlockup watchdog for %i failed\n", cpu); > + return -1; > + } > + kthread_bind(p, cpu); > + per_cpu(watchdog_touch_ts, cpu) = 0; > + per_cpu(softlockup_watchdog, cpu) = p; > + wake_up_process(p); > + } > + > + return 0; > +} > + > +static void watchdog_disable(int cpu) > +{ > + struct task_struct *p = per_cpu(softlockup_watchdog, cpu); > + struct hrtimer *hrtimer = &per_cpu(watchdog_hrtimer, cpu); > +#ifdef CONFIG_PERF_EVENTS_NMI > + struct perf_event *event = per_cpu(watchdog_ev, cpu); > +#endif > + > + /* > + * cancel the timer first to stop incrementing the stats > + * and waking up the kthread > + */ > + hrtimer_cancel(hrtimer); > + > +#ifdef CONFIG_PERF_EVENTS_NMI > + if (event) { > + perf_event_disable(event); > + per_cpu(watchdog_ev, cpu) = NULL; > + > + /* should be in cleanup, but blocks oprofile */ > + perf_event_release_kernel(event); > + } > +#endif > + > + if (p) { > + per_cpu(softlockup_watchdog, cpu) = NULL; > + kthread_stop(p); > + } > + > + /* if any cpu succeeds, watchdog is considered enabled for the system */ > + nmi_watchdog_enabled = 1; > +} > + > +static void watchdog_cleanup(int cpu) > +{ > + > +} > + > +static void watchdog_enable_all_cpus(void) > +{ > + int cpu; > + int result; > + > + for_each_online_cpu(cpu) > + result += watchdog_enable(cpu); > + > + if (result) > + printk(KERN_ERR "watchdog: failed to be enabled on some cpus\n"); > + > +} > + > +static void watchdog_disable_all_cpus(void) > +{ > + int cpu; > + > + for_each_online_cpu(cpu) > + watchdog_disable(cpu); > + > + /* if all watchdogs are disabled, then they are disabled for the system */ > + nmi_watchdog_enabled = 0; > +} > + > + > +/* sysctl functions */ > +#ifdef CONFIG_SYSCTL > +/* > + * proc handler for /proc/sys/kernel/nmi_watchdog > + */ > + > +int proc_nmi_enabled(struct ctl_table *table, int write, > + void __user *buffer, size_t *length, loff_t *ppos) > +{ > + touch_all_watchdog(); Why do you need to touch watchdogs here? > + proc_dointvec(table, write, buffer, length, ppos); > + > + if (nmi_watchdog_enabled) > + watchdog_enable_all_cpus(); > + else > + watchdog_disable_all_cpus(); > + return 0; > +} > + > +int proc_dowatchdog_thresh(struct ctl_table *table, int write, > + void __user *buffer, > + size_t *lenp, loff_t *ppos) > +{ > + touch_all_watchdog(); Same here? > + return proc_dointvec_minmax(table, write, buffer, lenp, ppos); > +} > + > +/* stub functions */ > +int proc_dosoftlockup_thresh(struct ctl_table *table, int write, > + void __user *buffer, > + size_t *lenp, loff_t *ppos) > +{ > + return proc_dowatchdog_thresh(table, write, buffer, lenp, ppos); > +} > +/* end of stub functions */ > +#endif /* CONFIG_SYSCTL */ > + > + > +/* > + * Create/destroy watchdog threads as CPUs come and go: > + */ > +static int __cpuinit > +cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) > +{ > + int hotcpu = (unsigned long)hcpu; > + > + switch (action) { > + case CPU_UP_PREPARE: > + case CPU_UP_PREPARE_FROZEN: > + if (watchdog_prepare_cpu(hotcpu)) > + return NOTIFY_BAD; > + break; > + case CPU_ONLINE: > + case CPU_ONLINE_FROZEN: > + if (watchdog_enable(hotcpu)) > + return NOTIFY_BAD; > + break; > +#ifdef CONFIG_HOTPLUG_CPU > + case CPU_UP_CANCELED: > + case CPU_UP_CANCELED_FROZEN: > + watchdog_disable(hotcpu); > + break; > + case CPU_DEAD: > + case CPU_DEAD_FROZEN: > + watchdog_disable(hotcpu); > + watchdog_cleanup(hotcpu); > + break; > +#endif /* CONFIG_HOTPLUG_CPU */ > + } > + return NOTIFY_OK; > +} > + > +static struct notifier_block __cpuinitdata cpu_nfb = { > + .notifier_call = cpu_callback > +}; > + > +static int __init spawn_watchdog_task(void) > +{ > + void *cpu = (void *)(long)smp_processor_id(); > + int err; > + > + if (no_watchdog) > + return 0; > + > + err = cpu_callback(&cpu_nfb, CPU_UP_PREPARE, cpu); > + if (err == NOTIFY_BAD) { > + BUG(); Please warn instead, there is nothing fatal here. > + return 1; > + } > + cpu_callback(&cpu_nfb, CPU_ONLINE, cpu); > + register_cpu_notifier(&cpu_nfb); > + > + atomic_notifier_chain_register(&panic_notifier_list, &panic_block); > + > + return 0; > +} > +early_initcall(spawn_watchdog_task); > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > index e2e73cc..280794a 100644 > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -171,15 +171,24 @@ config DETECT_SOFTLOCKUP > support it.) > > config NMI_WATCHDOG > - bool "Detect Hard Lockups with an NMI Watchdog" > - depends on DEBUG_KERNEL && PERF_EVENTS && PERF_EVENTS_NMI > + bool "Detect Hard and Soft Lockups" > + depends on DEBUG_KERNEL > + default DETECT_SOFTLOCKUP I'd suggest to drop the NMI prefix, this is not anymore about detecting hard lockups only. May be config WATCHDOG or config LOCKUP_DETECTOR if it's taken already. Also you should half-drop the DETECT_SOFTLOCKUP thing: keep it's definition but drop the ability to choose it from the prompt: config DETECT_SOFTLOCKUP bool depends on DEBUG_KERNEL && !S390 default y This way we keep it for compatibility with def_configs, it will enable the WATCHDOG by default if it is "y", we can schedule its removal later. Thanks.