From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754797AbcFGTnb (ORCPT ); Tue, 7 Jun 2016 15:43:31 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40068 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754249AbcFGTn2 (ORCPT ); Tue, 7 Jun 2016 15:43:28 -0400 From: Josh Poimboeuf To: Ingo Molnar , Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mel Gorman , Matt Fleming , Srikar Dronamraju Subject: [PATCH 2/2] sched/debug: fix deadlock when enabling sched events Date: Tue, 7 Jun 2016 14:43:17 -0500 Message-Id: <50a0f7f2a5f8d69e0440c3430f06be7ccbe59903.1465322027.git.jpoimboe@redhat.com> In-Reply-To: References: X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Tue, 07 Jun 2016 19:43:27 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When enabling sched trace events via: echo 1 > /sys/kernel/debug/tracing/events/sched/enable I see a hang, with the following BUG in the printk buffer: BUG: spinlock recursion on CPU#1, swapper/1/0 lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 0000000000000086 3a8e170cf69ad23f ffff88007d403898 ffffffff8143d663 ffff88007c535480 ffff88007d5d8c00 ffff88007d4038b8 ffffffff81115948 ffff88007d5d8c00 ffff88007d5d8c00 ffff88007d4038e8 ffffffff81115aea Call Trace: [] dump_stack+0x85/0xc2 [] spin_dump+0x78/0xc0 [] do_raw_spin_lock+0x11a/0x150 [] _raw_spin_lock+0x61/0x80 [] ? try_to_wake_up+0x256/0x4e0 [] try_to_wake_up+0x256/0x4e0 [] ? _raw_spin_unlock_irqrestore+0x4a/0x80 [] wake_up_process+0x15/0x20 [] insert_work+0x84/0xc0 [] __queue_work+0x18f/0x660 [] queue_work_on+0x46/0x90 [] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper] [] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper] [] soft_cursor+0x1ad/0x230 [] bit_cursor+0x649/0x680 [] ? update_attr.isra.2+0x90/0x90 [] fbcon_cursor+0x14a/0x1c0 [] hide_cursor+0x28/0x90 [] vt_console_print+0x3bf/0x3f0 [] call_console_drivers.constprop.24+0x183/0x200 [] console_unlock+0x3d4/0x610 [] vprintk_emit+0x3c5/0x610 [] vprintk_default+0x29/0x40 [] printk+0x57/0x73 [] enqueue_entity+0xc2e/0xc70 [] enqueue_task_fair+0x59/0xab0 [] ? kvm_sched_clock_read+0x9/0x20 [] ? sched_clock+0x9/0x10 [] activate_task+0x5c/0xa0 [] ttwu_do_activate+0x54/0xb0 [] sched_ttwu_pending+0x7a/0xb0 [] scheduler_ipi+0x61/0x170 [] smp_trace_reschedule_interrupt+0x4f/0x2a0 [] trace_reschedule_interrupt+0x96/0xa0 [] ? native_safe_halt+0x6/0x10 [] ? trace_hardirqs_on+0xd/0x10 [] default_idle+0x20/0x1a0 [] arch_cpu_idle+0xf/0x20 [] default_idle_call+0x2f/0x50 [] cpu_startup_entry+0x37e/0x450 [] start_secondary+0x160/0x1a0 The bug is caused by a deadlock where the task is trying to grab the rq lock twice because printk()'s aren't safe in sched code. The offending printk() is the pr_warn_once() in check_schedstat_required() which warns the user that their 'trace_sched_stat_*' tracepoints aren't going to work because schedstats is disabled: /* Force schedstat enabled if a dependent tracepoint is active */ if (trace_sched_stat_wait_enabled() || trace_sched_stat_sleep_enabled() || trace_sched_stat_iowait_enabled() || trace_sched_stat_blocked_enabled() || trace_sched_stat_runtime_enabled()) { pr_warn_once("Scheduler tracepoints stat_sleep, stat_iowait, " "stat_blocked and stat_runtime require the " "kernel parameter schedstats=enabled or " "kernel.sched_schedstats=1\n"); } In addition to the deadlock, I think this code has other issues: 1. Instead of just warning and allowing the tracepoints to be broken, I'd argue that it would be better to make them work by forcing schedstats enabled and printing a warning about that, which is what's already being done in other similar cases (latencytop and profiling). Otherwise things like "perf sched record" won't have the intended effect. In fact the comment in the above code snippet seems to agree with me, so maybe that was the original intent. 2. It's called in the scheduler hot path from enqueue_entity(). So change the warning to a call to force_schedstat_enabled(), and do it only when the tracepoint gets enabled. Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default") Cc: stable@vger.kernel.org Signed-off-by: Josh Poimboeuf --- include/trace/events/sched.h | 30 +++++++++++++++--------------- kernel/sched/fair.c | 29 +++++++---------------------- 2 files changed, 22 insertions(+), 37 deletions(-) diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 9b90c57..ce12e93 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -341,37 +341,36 @@ DECLARE_EVENT_CLASS(sched_stat_template, (unsigned long long)__entry->delay) ); +extern void trace_sched_stat_register(void); + +#define DEFINE_SCHED_STAT_EVENT(name) \ + DEFINE_EVENT_FN(sched_stat_template, name, \ + TP_PROTO(struct task_struct *tsk, u64 delay), \ + TP_ARGS(tsk, delay), \ + trace_sched_stat_register, NULL) /* * Tracepoint for accounting wait time (time the task is runnable * but not actually running due to scheduler contention). */ -DEFINE_EVENT(sched_stat_template, sched_stat_wait, - TP_PROTO(struct task_struct *tsk, u64 delay), - TP_ARGS(tsk, delay)); +DEFINE_SCHED_STAT_EVENT(sched_stat_wait); /* * Tracepoint for accounting sleep time (time the task is not runnable, * including iowait, see below). */ -DEFINE_EVENT(sched_stat_template, sched_stat_sleep, - TP_PROTO(struct task_struct *tsk, u64 delay), - TP_ARGS(tsk, delay)); +DEFINE_SCHED_STAT_EVENT(sched_stat_sleep); /* * Tracepoint for accounting iowait time (time the task is not runnable * due to waiting on IO to complete). */ -DEFINE_EVENT(sched_stat_template, sched_stat_iowait, - TP_PROTO(struct task_struct *tsk, u64 delay), - TP_ARGS(tsk, delay)); +DEFINE_SCHED_STAT_EVENT(sched_stat_iowait); /* * Tracepoint for accounting blocked time (time the task is in uninterruptible). */ -DEFINE_EVENT(sched_stat_template, sched_stat_blocked, - TP_PROTO(struct task_struct *tsk, u64 delay), - TP_ARGS(tsk, delay)); +DEFINE_SCHED_STAT_EVENT(sched_stat_blocked); /* * Tracepoint for accounting runtime (time the task is executing @@ -403,9 +402,10 @@ DECLARE_EVENT_CLASS(sched_stat_runtime, (unsigned long long)__entry->vruntime) ); -DEFINE_EVENT(sched_stat_runtime, sched_stat_runtime, - TP_PROTO(struct task_struct *tsk, u64 runtime, u64 vruntime), - TP_ARGS(tsk, runtime, vruntime)); +DEFINE_EVENT_FN(sched_stat_runtime, sched_stat_runtime, + TP_PROTO(struct task_struct *tsk, u64 runtime, u64 vruntime), + TP_ARGS(tsk, runtime, vruntime), + trace_sched_stat_register, NULL); /* * Tracepoint for showing priority inheritance modifying a tasks diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c6dd8ba..b0cbcc0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -789,6 +789,13 @@ static void update_curr_fair(struct rq *rq) update_curr(cfs_rq_of(&rq->curr->se)); } +void trace_sched_stat_register(void) +{ +#ifdef CONFIG_SCHEDSTATS + force_schedstat_enabled(); +#endif +} + #ifdef CONFIG_SCHEDSTATS static inline void update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se) @@ -3208,27 +3215,6 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial) static void check_enqueue_throttle(struct cfs_rq *cfs_rq); -static inline void check_schedstat_required(void) -{ -#ifdef CONFIG_SCHEDSTATS - if (schedstat_enabled()) - return; - - /* Force schedstat enabled if a dependent tracepoint is active */ - if (trace_sched_stat_wait_enabled() || - trace_sched_stat_sleep_enabled() || - trace_sched_stat_iowait_enabled() || - trace_sched_stat_blocked_enabled() || - trace_sched_stat_runtime_enabled()) { - pr_warn_once("Scheduler tracepoints stat_sleep, stat_iowait, " - "stat_blocked and stat_runtime require the " - "kernel parameter schedstats=enabled or " - "kernel.sched_schedstats=1\n"); - } -#endif -} - - /* * MIGRATION * @@ -3293,7 +3279,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) enqueue_sleeper(cfs_rq, se); } - check_schedstat_required(); if (schedstat_enabled()) { update_stats_enqueue(cfs_rq, se); check_spread(cfs_rq, se); -- 2.4.11