From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757939AbcJHKYy (ORCPT ); Sat, 8 Oct 2016 06:24:54 -0400 Received: from mail-pa0-f68.google.com ([209.85.220.68]:36211 "EHLO mail-pa0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754693AbcJHKYq (ORCPT ); Sat, 8 Oct 2016 06:24:46 -0400 From: Wanpeng Li X-Google-Original-From: Wanpeng Li To: linux-kernel@vger.kernel.org Cc: Wanpeng Li , Ingo Molnar , Mike Galbraith , Peter Zijlstra , Thomas Gleixner Subject: [PATCH] sched/fair: Fix dereference NULL sched domain during select_idle_sibling Date: Sat, 8 Oct 2016 18:24:38 +0800 Message-Id: <1475922278-3306-1-git-send-email-wanpeng.li@hotmail.com> X-Mailer: git-send-email 1.9.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Wanpeng Li Commit: 10e2f1acd01 ("sched/core: Rewrite and improve select_idle_siblings()") ... improved select_idle_sibling() but also triggered a regression: BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 IP: [] select_idle_sibling+0x1c2/0x4f0 PGD 0 Oops: 0000 [#1] SMP CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.8.0+ #16 RIP: 0010:[] [] select_idle_sibling+0x1c2/0x4f0 Call Trace: select_task_rq_fair+0x749/0x930 ? select_task_rq_fair+0xb4/0x930 ? __lock_is_held+0x54/0x70 try_to_wake_up+0x19a/0x5b0 default_wake_function+0x12/0x20 autoremove_wake_function+0x12/0x40 __wake_up_common+0x55/0x90 __wake_up+0x39/0x50 wake_up_klogd_work_func+0x40/0x60 irq_work_run_list+0x57/0x80 irq_work_run+0x2c/0x30 smp_irq_work_interrupt+0x2e/0x40 irq_work_interrupt+0x96/0xa0 ? _raw_spin_unlock_irqrestore+0x45/0x80 try_to_wake_up+0x4a/0x5b0 wake_up_state+0x10/0x20 __kthread_unpark+0x67/0x70 kthread_unpark+0x22/0x30 cpuhp_online_idle+0x3e/0x70 cpu_startup_entry+0x6a/0x450 start_secondary+0x154/0x180 This can be reproduced by running the ftrace test case of kselftest, the test case will hot-unplug the cpu and the cpu will attach to the NULL sched-domain during scheduler teardown. The step 2 for the rewrite select_idle_siblings(): | Step 2) tracks the average cost of the scan and compares this to the | average idle time guestimate for the CPU doing the wakeup. If the cpu which doing the wakeup is the going hot-unplug cpu, then NULL sched domain will be dereferenced to acquire the average cost of the scan. This patch fix it by failing the search of an idle CPU in the LLC process if this sched domain is NULL. Cc: Ingo Molnar Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Wanpeng Li --- kernel/sched/fair.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 543b2f2..03a6620 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5472,19 +5472,29 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd */ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { - struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); + struct sched_domain *this_sd; u64 avg_idle = this_rq()->avg_idle; - u64 avg_cost = this_sd->avg_scan_cost; + u64 avg_cost; u64 time, cost; s64 delta; int cpu, wrap; + rcu_read_lock(); + this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); + if (!this_sd) { + cpu = -1; + goto unlock; + } + avg_cost = this_sd->avg_scan_cost; + /* * Due to large variance we need a large fuzz factor; hackbench in * particularly is sensitive here. */ - if ((avg_idle / 512) < avg_cost) - return -1; + if ((avg_idle / 512) < avg_cost) { + cpu = -1; + goto unlock; + } time = local_clock(); @@ -5500,6 +5510,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t delta = (s64)(time - cost) / 8; this_sd->avg_scan_cost += delta; +unlock: + rcu_read_unlock(); return cpu; } -- 1.9.1