From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D1C55C43603 for ; Thu, 19 Dec 2019 12:20:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A02C4227BF for ; Thu, 19 Dec 2019 12:20:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726905AbfLSMTh (ORCPT ); Thu, 19 Dec 2019 07:19:37 -0500 Received: from foss.arm.com ([217.140.110.172]:37844 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726668AbfLSMTc (ORCPT ); Thu, 19 Dec 2019 07:19:32 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7CDC41007; Thu, 19 Dec 2019 04:19:31 -0800 (PST) Received: from e120937-lin.cambridge.arm.com (e120937-lin.cambridge.arm.com [10.1.197.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 325E13F719; Thu, 19 Dec 2019 04:19:29 -0800 (PST) From: Cristian Marussi To: linux-kernel@vger.kernel.org Cc: linux-arch@vger.kernel.org, mark.rutland@arm.com, peterz@infradead.org, catalin.marinas@arm.com, takahiro.akashi@linaro.org, james.morse@arm.com, hidehiro.kawai.ez@hitachi.com, tglx@linutronix.de, will@kernel.org, linux-arm-kernel@lists.infradead.org, mingo@redhat.com, x86@kernel.org, dzickus@redhat.com, ehabkost@redhat.com, linux@armlinux.org.uk, davem@davemloft.net, sparclinux@vger.kernel.org, hch@infradead.org Subject: [RFC PATCH v3 03/12] smp: coordinate concurrent crash/smp stop calls Date: Thu, 19 Dec 2019 12:18:56 +0000 Message-Id: <20191219121905.26905-4-cristian.marussi@arm.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191219121905.26905-1-cristian.marussi@arm.com> References: <20191219121905.26905-1-cristian.marussi@arm.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Once a stop request is in progress on one CPU, it must be carefully evaluated what to do if another stop request is issued concurrently on another CPU. Given that panic and crash dump code flows are mutually exclusive, the only alternative possible scenario which instead could lead to concurrent stop requests, is represented by the simultaneous stop requests possibly triggered by a concurrent halt/reboot/shutdown. In such a case, prioritize the panic/crash procedure and most importantly immediately park the offending CPU that was attempting the concurrent stop request: force it to idle quietly, waiting for the ongoing stop/dump requests to arrive. Failing to do so would result in the offending CPU being effectively lost on the next possible reboot triggered by the crash dump. [1] Another scenario, where the crash/stop code path was known to be executed twice, was during a normal panic/crash with crash_kexec_post_notifiers=1; since this issue is similar, fold also this case-handling into this new logic. [1] <<<<<---------- TRIGGERED PANIC [ 225.676014] ------------[ cut here ]------------ [ 225.676656] kernel BUG at arch/arm64/kernel/cpufeature.c:852! [ 225.677253] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 225.677660] Modules linked in: [ 225.678458] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 5.3.0-rc5-00004-gb8a210cd3c32-dirty #2 [ 225.678621] Hardware name: Foundation-v8A (DT) [ 225.678868] pstate: 000001c5 (nzcv dAIF -PAN -UAO) [ 225.679523] pc : has_cpuid_feature+0x35c/0x360 [ 225.679649] lr : verify_local_elf_hwcaps+0x6c/0xf0 [ 225.679759] sp : ffff0000118cbf60 [ 225.679855] x29: ffff0000118cbf60 x28: 0000000000000000 [ 225.680026] x27: 0000000000000000 x26: 0000000000000000 [ 225.680115] x25: ffff00001167a010 x24: ffff0000112f59f8 [ 225.680207] x23: 0000000000000000 x22: 0000000000000000 [ 225.680290] x21: ffff0000112ea018 x20: ffff000010fe5538 [ 225.680372] x19: ffff000010ba3f30 x18: 000000000000001e [ 225.680465] x17: 0000000000000000 x16: 0000000000000000 [ 225.680546] x15: 0000000000000000 x14: 0000000000000008 [ 225.680629] x13: 0209018b7a9404f4 x12: 0000000000000001 [ 225.680719] x11: 0000000000000080 x10: 00400032b5503510 [ 225.680803] x9 : 0000000000000000 x8 : ffff000010b93204 [ 225.680884] x7 : 00000000800001d8 x6 : 0000000000000005 [ 225.680975] x5 : 0000000000000000 x4 : 0000000000000000 [ 225.681056] x3 : 0000000000000000 x2 : 0000000000008000 [ 225.681139] x1 : 0000000000180480 x0 : 0000000000180480 [ 225.681423] Call trace: [ 225.681669] has_cpuid_feature+0x35c/0x360 [ 225.681762] verify_local_elf_hwcaps+0x6c/0xf0 [ 225.681836] check_local_cpu_capabilities+0x88/0x118 [ 225.681939] secondary_start_kernel+0xc4/0x168 [ 225.682432] Code: d53801e0 17ffff58 d5380600 17ffff56 (d4210000) [ 225.683998] smp: stopping secondary CPUs [ 225.684130] Delaying stop.... <<<<<------ INSTRUMENTED DEBUG_DELAY Rebooting. <<<<<------ MANUAL SIMULTANEOUS REBOOT [ 232.647472] reboot: Restarting system [ 232.648363] Reboot failed -- System halted [ 239.552413] smp: failed to stop secondary CPUs 0 [ 239.554730] Starting crashdump kernel... [ 239.555194] ------------[ cut here ]------------ [ 239.555406] Some CPUs may be stale, kdump will be unreliable. [ 239.555802] WARNING: CPU: 3 PID: 0 at arch/arm64/kernel/machine_kexec.c:156 machine_kexec+0x3c/0x2b0 [ 239.556044] Modules linked in: [ 239.556244] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 5.3.0-rc5-00004-gb8a210cd3c32-dirty #2 [ 239.556340] Hardware name: Foundation-v8A (DT) [ 239.556459] pstate: 600003c5 (nZCv DAIF -PAN -UAO) [ 239.556587] pc : machine_kexec+0x3c/0x2b0 [ 239.556700] lr : machine_kexec+0x3c/0x2b0 [ 239.556775] sp : ffff0000118cbad0 [ 239.556876] x29: ffff0000118cbad0 x28: ffff80087a8c3700 [ 239.557012] x27: 0000000000000000 x26: 0000000000000000 [ 239.557278] x25: ffff000010fe3ef0 x24: 00000000000003c0 .... [ 239.561568] Bye! [ 0.000000] Booting Linux on physical CPU 0x0000000003 [0x410fd0f0] [ 0.000000] Linux version 5.2.0-rc4-00001-g93bd4bc234d5-dirty [ 0.000000] Machine model: Foundation-v8A ... [ 0.197991] smp: Bringing up secondary CPUs ... [ 0.232643] psci: failed to boot CPU1 (-22) <<<<--- LOST CPU ON REBOOT [ 0.232861] CPU1: failed to boot: -22 [ 0.306291] Detected PIPT I-cache on CPU2 [ 0.310524] GICv3: CPU2: found redistributor 1 region 0:0x000000002f120000 [ 0.315618] CPU2: Booted secondary processor 0x0000000001 [0x410fd0f0] [ 0.395576] Detected PIPT I-cache on CPU3 [ 0.400431] GICv3: CPU3: found redistributor 2 region 0:0x000000002f140000 [ 0.407252] CPU3: Booted secondary processor 0x0000000002 [0x410fd0f0] [ 0.431811] smp: Brought up 1 node, 3 CPUs [ 0.439898] SMP: Total of 3 processors activated. Signed-off-by: Cristian Marussi --- v2 --> v3 - local var renamded v1 --> v2 - using new CONFIG_USE_COMMON_SMP_STOP --- include/linux/smp.h | 3 +++ kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 46 insertions(+), 5 deletions(-) diff --git a/include/linux/smp.h b/include/linux/smp.h index 1886e49a65bb..42be03ac1c0c 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -136,6 +136,9 @@ extern void arch_smp_crash_call(cpumask_t *cpus); /* Helper to query the outcome of an ongoing crash_stop operation */ bool smp_crash_stop_failed(void); + +/* A generic cpu parking helper, possibly overridden by architecture code */ +void arch_smp_cpu_park(void) __noreturn; #endif /* diff --git a/kernel/smp.c b/kernel/smp.c index 6224b0b1208b..29eb6eff2063 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -844,6 +844,12 @@ bool smp_stop_get_wait_settings(unsigned long *timeout) return atomic_read(&smp_stop_wait_forever); } +void __weak arch_smp_cpu_park(void) +{ + while (1) + cpu_relax(); +} + void __weak arch_smp_cpus_stop_complete(void) { } void __weak arch_smp_cpus_crash_complete(void) @@ -866,6 +872,9 @@ void __weak arch_smp_crash_call(cpumask_t *cpus) arch_smp_stop_call(cpus); } +#define REASON_STOP 1 +#define REASON_CRASH 2 + /* * This centralizes the common logic to: * @@ -881,8 +890,38 @@ static void __smp_send_stop_all(bool crash) { unsigned int this_cpu_id; cpumask_t mask; + static atomic_t stopping; + int was_stopping; this_cpu_id = smp_processor_id(); + /* make sure that concurrent stop requests are handled properly */ + was_stopping = atomic_cmpxchg(&stopping, 0, + crash ? REASON_CRASH : REASON_STOP); + if (was_stopping) { + /* + * This function can be called twice in panic path if + * crash_kexec is called with crash_kexec_post_notifiers=1, + * but obviously we execute this only once. + */ + if (crash && was_stopping == REASON_CRASH) + return; + /* + * In case of other concurrent STOP calls (like in a REBOOT + * started simultaneously as an ongoing PANIC/CRASH/REBOOT) + * we want to prioritize the ongoing PANIC/KEXEC call and + * force here the offending CPU that was attempting the + * concurrent STOP to just sit idle waiting to die. + * Failing to do so would result in a lost CPU on the next + * possible reboot triggered by crash_kexec(). + */ + if (!crash) { + pr_crit("CPU%d - kernel already stopping, parking myself.\n", + this_cpu_id); + local_irq_enable(); + /* does not return */ + arch_smp_cpu_park(); + } + } if (any_other_cpus_online(&mask, this_cpu_id)) { bool wait_forever; unsigned long timeout; @@ -950,6 +989,9 @@ bool __weak smp_crash_stop_failed(void) */ void __weak crash_smp_send_stop(void) { +#ifdef CONFIG_USE_COMMON_SMP_STOP + __smp_send_stop_all(true); +#else static int cpus_stopped; /* @@ -959,11 +1001,7 @@ void __weak crash_smp_send_stop(void) if (cpus_stopped) return; -#ifdef CONFIG_USE_COMMON_SMP_STOP - __smp_send_stop_all(true); -#else smp_send_stop(); -#endif - cpus_stopped = 1; +#endif } -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Cristian Marussi Date: Thu, 19 Dec 2019 12:18:56 +0000 Subject: [RFC PATCH v3 03/12] smp: coordinate concurrent crash/smp stop calls Message-Id: <20191219121905.26905-4-cristian.marussi@arm.com> List-Id: References: <20191219121905.26905-1-cristian.marussi@arm.com> In-Reply-To: <20191219121905.26905-1-cristian.marussi@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: linux-kernel@vger.kernel.org Cc: linux-arch@vger.kernel.org, mark.rutland@arm.com, peterz@infradead.org, catalin.marinas@arm.com, takahiro.akashi@linaro.org, james.morse@arm.com, hidehiro.kawai.ez@hitachi.com, tglx@linutronix.de, will@kernel.org, linux-arm-kernel@lists.infradead.org, mingo@redhat.com, x86@kernel.org, dzickus@redhat.com, ehabkost@redhat.com, linux@armlinux.org.uk, davem@davemloft.net, sparclinux@vger.kernel.org, hch@infradead.org Once a stop request is in progress on one CPU, it must be carefully evaluated what to do if another stop request is issued concurrently on another CPU. Given that panic and crash dump code flows are mutually exclusive, the only alternative possible scenario which instead could lead to concurrent stop requests, is represented by the simultaneous stop requests possibly triggered by a concurrent halt/reboot/shutdown. In such a case, prioritize the panic/crash procedure and most importantly immediately park the offending CPU that was attempting the concurrent stop request: force it to idle quietly, waiting for the ongoing stop/dump requests to arrive. Failing to do so would result in the offending CPU being effectively lost on the next possible reboot triggered by the crash dump. [1] Another scenario, where the crash/stop code path was known to be executed twice, was during a normal panic/crash with crash_kexec_post_notifiers=1; since this issue is similar, fold also this case-handling into this new logic. [1] <<<<<---------- TRIGGERED PANIC [ 225.676014] ------------[ cut here ]------------ [ 225.676656] kernel BUG at arch/arm64/kernel/cpufeature.c:852! [ 225.677253] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 225.677660] Modules linked in: [ 225.678458] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 5.3.0-rc5-00004-gb8a210cd3c32-dirty #2 [ 225.678621] Hardware name: Foundation-v8A (DT) [ 225.678868] pstate: 000001c5 (nzcv dAIF -PAN -UAO) [ 225.679523] pc : has_cpuid_feature+0x35c/0x360 [ 225.679649] lr : verify_local_elf_hwcaps+0x6c/0xf0 [ 225.679759] sp : ffff0000118cbf60 [ 225.679855] x29: ffff0000118cbf60 x28: 0000000000000000 [ 225.680026] x27: 0000000000000000 x26: 0000000000000000 [ 225.680115] x25: ffff00001167a010 x24: ffff0000112f59f8 [ 225.680207] x23: 0000000000000000 x22: 0000000000000000 [ 225.680290] x21: ffff0000112ea018 x20: ffff000010fe5538 [ 225.680372] x19: ffff000010ba3f30 x18: 000000000000001e [ 225.680465] x17: 0000000000000000 x16: 0000000000000000 [ 225.680546] x15: 0000000000000000 x14: 0000000000000008 [ 225.680629] x13: 0209018b7a9404f4 x12: 0000000000000001 [ 225.680719] x11: 0000000000000080 x10: 00400032b5503510 [ 225.680803] x9 : 0000000000000000 x8 : ffff000010b93204 [ 225.680884] x7 : 00000000800001d8 x6 : 0000000000000005 [ 225.680975] x5 : 0000000000000000 x4 : 0000000000000000 [ 225.681056] x3 : 0000000000000000 x2 : 0000000000008000 [ 225.681139] x1 : 0000000000180480 x0 : 0000000000180480 [ 225.681423] Call trace: [ 225.681669] has_cpuid_feature+0x35c/0x360 [ 225.681762] verify_local_elf_hwcaps+0x6c/0xf0 [ 225.681836] check_local_cpu_capabilities+0x88/0x118 [ 225.681939] secondary_start_kernel+0xc4/0x168 [ 225.682432] Code: d53801e0 17ffff58 d5380600 17ffff56 (d4210000) [ 225.683998] smp: stopping secondary CPUs [ 225.684130] Delaying stop.... <<<<<------ INSTRUMENTED DEBUG_DELAY Rebooting. <<<<<------ MANUAL SIMULTANEOUS REBOOT [ 232.647472] reboot: Restarting system [ 232.648363] Reboot failed -- System halted [ 239.552413] smp: failed to stop secondary CPUs 0 [ 239.554730] Starting crashdump kernel... [ 239.555194] ------------[ cut here ]------------ [ 239.555406] Some CPUs may be stale, kdump will be unreliable. [ 239.555802] WARNING: CPU: 3 PID: 0 at arch/arm64/kernel/machine_kexec.c:156 machine_kexec+0x3c/0x2b0 [ 239.556044] Modules linked in: [ 239.556244] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 5.3.0-rc5-00004-gb8a210cd3c32-dirty #2 [ 239.556340] Hardware name: Foundation-v8A (DT) [ 239.556459] pstate: 600003c5 (nZCv DAIF -PAN -UAO) [ 239.556587] pc : machine_kexec+0x3c/0x2b0 [ 239.556700] lr : machine_kexec+0x3c/0x2b0 [ 239.556775] sp : ffff0000118cbad0 [ 239.556876] x29: ffff0000118cbad0 x28: ffff80087a8c3700 [ 239.557012] x27: 0000000000000000 x26: 0000000000000000 [ 239.557278] x25: ffff000010fe3ef0 x24: 00000000000003c0 .... [ 239.561568] Bye! [ 0.000000] Booting Linux on physical CPU 0x0000000003 [0x410fd0f0] [ 0.000000] Linux version 5.2.0-rc4-00001-g93bd4bc234d5-dirty [ 0.000000] Machine model: Foundation-v8A ... [ 0.197991] smp: Bringing up secondary CPUs ... [ 0.232643] psci: failed to boot CPU1 (-22) <<<<--- LOST CPU ON REBOOT [ 0.232861] CPU1: failed to boot: -22 [ 0.306291] Detected PIPT I-cache on CPU2 [ 0.310524] GICv3: CPU2: found redistributor 1 region 0:0x000000002f120000 [ 0.315618] CPU2: Booted secondary processor 0x0000000001 [0x410fd0f0] [ 0.395576] Detected PIPT I-cache on CPU3 [ 0.400431] GICv3: CPU3: found redistributor 2 region 0:0x000000002f140000 [ 0.407252] CPU3: Booted secondary processor 0x0000000002 [0x410fd0f0] [ 0.431811] smp: Brought up 1 node, 3 CPUs [ 0.439898] SMP: Total of 3 processors activated. Signed-off-by: Cristian Marussi --- v2 --> v3 - local var renamded v1 --> v2 - using new CONFIG_USE_COMMON_SMP_STOP --- include/linux/smp.h | 3 +++ kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 46 insertions(+), 5 deletions(-) diff --git a/include/linux/smp.h b/include/linux/smp.h index 1886e49a65bb..42be03ac1c0c 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -136,6 +136,9 @@ extern void arch_smp_crash_call(cpumask_t *cpus); /* Helper to query the outcome of an ongoing crash_stop operation */ bool smp_crash_stop_failed(void); + +/* A generic cpu parking helper, possibly overridden by architecture code */ +void arch_smp_cpu_park(void) __noreturn; #endif /* diff --git a/kernel/smp.c b/kernel/smp.c index 6224b0b1208b..29eb6eff2063 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -844,6 +844,12 @@ bool smp_stop_get_wait_settings(unsigned long *timeout) return atomic_read(&smp_stop_wait_forever); } +void __weak arch_smp_cpu_park(void) +{ + while (1) + cpu_relax(); +} + void __weak arch_smp_cpus_stop_complete(void) { } void __weak arch_smp_cpus_crash_complete(void) @@ -866,6 +872,9 @@ void __weak arch_smp_crash_call(cpumask_t *cpus) arch_smp_stop_call(cpus); } +#define REASON_STOP 1 +#define REASON_CRASH 2 + /* * This centralizes the common logic to: * @@ -881,8 +890,38 @@ static void __smp_send_stop_all(bool crash) { unsigned int this_cpu_id; cpumask_t mask; + static atomic_t stopping; + int was_stopping; this_cpu_id = smp_processor_id(); + /* make sure that concurrent stop requests are handled properly */ + was_stopping = atomic_cmpxchg(&stopping, 0, + crash ? REASON_CRASH : REASON_STOP); + if (was_stopping) { + /* + * This function can be called twice in panic path if + * crash_kexec is called with crash_kexec_post_notifiers=1, + * but obviously we execute this only once. + */ + if (crash && was_stopping = REASON_CRASH) + return; + /* + * In case of other concurrent STOP calls (like in a REBOOT + * started simultaneously as an ongoing PANIC/CRASH/REBOOT) + * we want to prioritize the ongoing PANIC/KEXEC call and + * force here the offending CPU that was attempting the + * concurrent STOP to just sit idle waiting to die. + * Failing to do so would result in a lost CPU on the next + * possible reboot triggered by crash_kexec(). + */ + if (!crash) { + pr_crit("CPU%d - kernel already stopping, parking myself.\n", + this_cpu_id); + local_irq_enable(); + /* does not return */ + arch_smp_cpu_park(); + } + } if (any_other_cpus_online(&mask, this_cpu_id)) { bool wait_forever; unsigned long timeout; @@ -950,6 +989,9 @@ bool __weak smp_crash_stop_failed(void) */ void __weak crash_smp_send_stop(void) { +#ifdef CONFIG_USE_COMMON_SMP_STOP + __smp_send_stop_all(true); +#else static int cpus_stopped; /* @@ -959,11 +1001,7 @@ void __weak crash_smp_send_stop(void) if (cpus_stopped) return; -#ifdef CONFIG_USE_COMMON_SMP_STOP - __smp_send_stop_all(true); -#else smp_send_stop(); -#endif - cpus_stopped = 1; +#endif } -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A09F7C43603 for ; Thu, 19 Dec 2019 12:20:53 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 557BD20716 for ; Thu, 19 Dec 2019 12:20:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="IHFleaQ7" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 557BD20716 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:MIME-Version:Cc:List-Subscribe: List-Help:List-Post:List-Archive:List-Unsubscribe:List-Id:References: In-Reply-To:Message-Id:Date:Subject:To:From:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=OozWTdPyd+Boohi72hmYHc/3zmIG9+uUVaphvYx29bA=; b=IHFleaQ7mCf+GrBRR52wOVT0kb T9L1jjwlyB1TZ9yhi1BrSVmBPBqqogLcVObrhQ6+gzyWdqltyPMJN8ocZ3oUU0HLMv4zgbHMYRyZx rqZo9lWdqvlDSbM3ND99Mh11NZIkzE/m80uQD4U/uBDYi596pebwJgkTv6ZJwqPSWEBpXbo61v306 VM/8lpuMOaDDU1/Po10elojfMjKQ3FxdRxnRWeToCQotDlx4HS/MwEy9GBJZ3RHk5HwwQSE5b4KwM TMCElll3OsL2Gfb4K7Igh5w64+5GZ4dbkMrmDgUK5tEb9P5wY8F7Klxw+jC+c0+FvnJ3EiDDC4BeE suPlgEIA==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1ihunL-0004G3-Le; Thu, 19 Dec 2019 12:20:43 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1ihumB-00026A-SG for linux-arm-kernel@lists.infradead.org; Thu, 19 Dec 2019 12:19:33 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7CDC41007; Thu, 19 Dec 2019 04:19:31 -0800 (PST) Received: from e120937-lin.cambridge.arm.com (e120937-lin.cambridge.arm.com [10.1.197.50]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 325E13F719; Thu, 19 Dec 2019 04:19:29 -0800 (PST) From: Cristian Marussi To: linux-kernel@vger.kernel.org Subject: [RFC PATCH v3 03/12] smp: coordinate concurrent crash/smp stop calls Date: Thu, 19 Dec 2019 12:18:56 +0000 Message-Id: <20191219121905.26905-4-cristian.marussi@arm.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20191219121905.26905-1-cristian.marussi@arm.com> References: <20191219121905.26905-1-cristian.marussi@arm.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20191219_041932_011997_F1CEE780 X-CRM114-Status: GOOD ( 17.24 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linux-arch@vger.kernel.org, mark.rutland@arm.com, sparclinux@vger.kernel.org, dzickus@redhat.com, ehabkost@redhat.com, peterz@infradead.org, catalin.marinas@arm.com, x86@kernel.org, linux@armlinux.org.uk, hch@infradead.org, takahiro.akashi@linaro.org, mingo@redhat.com, james.morse@arm.com, hidehiro.kawai.ez@hitachi.com, tglx@linutronix.de, will@kernel.org, davem@davemloft.net, linux-arm-kernel@lists.infradead.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org Once a stop request is in progress on one CPU, it must be carefully evaluated what to do if another stop request is issued concurrently on another CPU. Given that panic and crash dump code flows are mutually exclusive, the only alternative possible scenario which instead could lead to concurrent stop requests, is represented by the simultaneous stop requests possibly triggered by a concurrent halt/reboot/shutdown. In such a case, prioritize the panic/crash procedure and most importantly immediately park the offending CPU that was attempting the concurrent stop request: force it to idle quietly, waiting for the ongoing stop/dump requests to arrive. Failing to do so would result in the offending CPU being effectively lost on the next possible reboot triggered by the crash dump. [1] Another scenario, where the crash/stop code path was known to be executed twice, was during a normal panic/crash with crash_kexec_post_notifiers=1; since this issue is similar, fold also this case-handling into this new logic. [1] <<<<<---------- TRIGGERED PANIC [ 225.676014] ------------[ cut here ]------------ [ 225.676656] kernel BUG at arch/arm64/kernel/cpufeature.c:852! [ 225.677253] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 225.677660] Modules linked in: [ 225.678458] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 5.3.0-rc5-00004-gb8a210cd3c32-dirty #2 [ 225.678621] Hardware name: Foundation-v8A (DT) [ 225.678868] pstate: 000001c5 (nzcv dAIF -PAN -UAO) [ 225.679523] pc : has_cpuid_feature+0x35c/0x360 [ 225.679649] lr : verify_local_elf_hwcaps+0x6c/0xf0 [ 225.679759] sp : ffff0000118cbf60 [ 225.679855] x29: ffff0000118cbf60 x28: 0000000000000000 [ 225.680026] x27: 0000000000000000 x26: 0000000000000000 [ 225.680115] x25: ffff00001167a010 x24: ffff0000112f59f8 [ 225.680207] x23: 0000000000000000 x22: 0000000000000000 [ 225.680290] x21: ffff0000112ea018 x20: ffff000010fe5538 [ 225.680372] x19: ffff000010ba3f30 x18: 000000000000001e [ 225.680465] x17: 0000000000000000 x16: 0000000000000000 [ 225.680546] x15: 0000000000000000 x14: 0000000000000008 [ 225.680629] x13: 0209018b7a9404f4 x12: 0000000000000001 [ 225.680719] x11: 0000000000000080 x10: 00400032b5503510 [ 225.680803] x9 : 0000000000000000 x8 : ffff000010b93204 [ 225.680884] x7 : 00000000800001d8 x6 : 0000000000000005 [ 225.680975] x5 : 0000000000000000 x4 : 0000000000000000 [ 225.681056] x3 : 0000000000000000 x2 : 0000000000008000 [ 225.681139] x1 : 0000000000180480 x0 : 0000000000180480 [ 225.681423] Call trace: [ 225.681669] has_cpuid_feature+0x35c/0x360 [ 225.681762] verify_local_elf_hwcaps+0x6c/0xf0 [ 225.681836] check_local_cpu_capabilities+0x88/0x118 [ 225.681939] secondary_start_kernel+0xc4/0x168 [ 225.682432] Code: d53801e0 17ffff58 d5380600 17ffff56 (d4210000) [ 225.683998] smp: stopping secondary CPUs [ 225.684130] Delaying stop.... <<<<<------ INSTRUMENTED DEBUG_DELAY Rebooting. <<<<<------ MANUAL SIMULTANEOUS REBOOT [ 232.647472] reboot: Restarting system [ 232.648363] Reboot failed -- System halted [ 239.552413] smp: failed to stop secondary CPUs 0 [ 239.554730] Starting crashdump kernel... [ 239.555194] ------------[ cut here ]------------ [ 239.555406] Some CPUs may be stale, kdump will be unreliable. [ 239.555802] WARNING: CPU: 3 PID: 0 at arch/arm64/kernel/machine_kexec.c:156 machine_kexec+0x3c/0x2b0 [ 239.556044] Modules linked in: [ 239.556244] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted 5.3.0-rc5-00004-gb8a210cd3c32-dirty #2 [ 239.556340] Hardware name: Foundation-v8A (DT) [ 239.556459] pstate: 600003c5 (nZCv DAIF -PAN -UAO) [ 239.556587] pc : machine_kexec+0x3c/0x2b0 [ 239.556700] lr : machine_kexec+0x3c/0x2b0 [ 239.556775] sp : ffff0000118cbad0 [ 239.556876] x29: ffff0000118cbad0 x28: ffff80087a8c3700 [ 239.557012] x27: 0000000000000000 x26: 0000000000000000 [ 239.557278] x25: ffff000010fe3ef0 x24: 00000000000003c0 .... [ 239.561568] Bye! [ 0.000000] Booting Linux on physical CPU 0x0000000003 [0x410fd0f0] [ 0.000000] Linux version 5.2.0-rc4-00001-g93bd4bc234d5-dirty [ 0.000000] Machine model: Foundation-v8A ... [ 0.197991] smp: Bringing up secondary CPUs ... [ 0.232643] psci: failed to boot CPU1 (-22) <<<<--- LOST CPU ON REBOOT [ 0.232861] CPU1: failed to boot: -22 [ 0.306291] Detected PIPT I-cache on CPU2 [ 0.310524] GICv3: CPU2: found redistributor 1 region 0:0x000000002f120000 [ 0.315618] CPU2: Booted secondary processor 0x0000000001 [0x410fd0f0] [ 0.395576] Detected PIPT I-cache on CPU3 [ 0.400431] GICv3: CPU3: found redistributor 2 region 0:0x000000002f140000 [ 0.407252] CPU3: Booted secondary processor 0x0000000002 [0x410fd0f0] [ 0.431811] smp: Brought up 1 node, 3 CPUs [ 0.439898] SMP: Total of 3 processors activated. Signed-off-by: Cristian Marussi --- v2 --> v3 - local var renamded v1 --> v2 - using new CONFIG_USE_COMMON_SMP_STOP --- include/linux/smp.h | 3 +++ kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 46 insertions(+), 5 deletions(-) diff --git a/include/linux/smp.h b/include/linux/smp.h index 1886e49a65bb..42be03ac1c0c 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -136,6 +136,9 @@ extern void arch_smp_crash_call(cpumask_t *cpus); /* Helper to query the outcome of an ongoing crash_stop operation */ bool smp_crash_stop_failed(void); + +/* A generic cpu parking helper, possibly overridden by architecture code */ +void arch_smp_cpu_park(void) __noreturn; #endif /* diff --git a/kernel/smp.c b/kernel/smp.c index 6224b0b1208b..29eb6eff2063 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -844,6 +844,12 @@ bool smp_stop_get_wait_settings(unsigned long *timeout) return atomic_read(&smp_stop_wait_forever); } +void __weak arch_smp_cpu_park(void) +{ + while (1) + cpu_relax(); +} + void __weak arch_smp_cpus_stop_complete(void) { } void __weak arch_smp_cpus_crash_complete(void) @@ -866,6 +872,9 @@ void __weak arch_smp_crash_call(cpumask_t *cpus) arch_smp_stop_call(cpus); } +#define REASON_STOP 1 +#define REASON_CRASH 2 + /* * This centralizes the common logic to: * @@ -881,8 +890,38 @@ static void __smp_send_stop_all(bool crash) { unsigned int this_cpu_id; cpumask_t mask; + static atomic_t stopping; + int was_stopping; this_cpu_id = smp_processor_id(); + /* make sure that concurrent stop requests are handled properly */ + was_stopping = atomic_cmpxchg(&stopping, 0, + crash ? REASON_CRASH : REASON_STOP); + if (was_stopping) { + /* + * This function can be called twice in panic path if + * crash_kexec is called with crash_kexec_post_notifiers=1, + * but obviously we execute this only once. + */ + if (crash && was_stopping == REASON_CRASH) + return; + /* + * In case of other concurrent STOP calls (like in a REBOOT + * started simultaneously as an ongoing PANIC/CRASH/REBOOT) + * we want to prioritize the ongoing PANIC/KEXEC call and + * force here the offending CPU that was attempting the + * concurrent STOP to just sit idle waiting to die. + * Failing to do so would result in a lost CPU on the next + * possible reboot triggered by crash_kexec(). + */ + if (!crash) { + pr_crit("CPU%d - kernel already stopping, parking myself.\n", + this_cpu_id); + local_irq_enable(); + /* does not return */ + arch_smp_cpu_park(); + } + } if (any_other_cpus_online(&mask, this_cpu_id)) { bool wait_forever; unsigned long timeout; @@ -950,6 +989,9 @@ bool __weak smp_crash_stop_failed(void) */ void __weak crash_smp_send_stop(void) { +#ifdef CONFIG_USE_COMMON_SMP_STOP + __smp_send_stop_all(true); +#else static int cpus_stopped; /* @@ -959,11 +1001,7 @@ void __weak crash_smp_send_stop(void) if (cpus_stopped) return; -#ifdef CONFIG_USE_COMMON_SMP_STOP - __smp_send_stop_all(true); -#else smp_send_stop(); -#endif - cpus_stopped = 1; +#endif } -- 2.17.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel