From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D491C4321A for ; Fri, 28 Jun 2019 12:29:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 525F72083B for ; Fri, 28 Jun 2019 12:29:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726711AbfF1M3E (ORCPT ); Fri, 28 Jun 2019 08:29:04 -0400 Received: from mail-wr1-f67.google.com ([209.85.221.67]:33199 "EHLO mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726587AbfF1M3E (ORCPT ); Fri, 28 Jun 2019 08:29:04 -0400 Received: by mail-wr1-f67.google.com with SMTP id n9so6140989wru.0 for ; Fri, 28 Jun 2019 05:29:02 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=aG/OfFQqUs+39UCm0pAGeKeTGmu26+WZWtFUm3Tny98=; b=QavqLRiMffrc+JxePleJYOmtM0q5At4sLJ+FmrHuGv4mfagdpzMBw0I0XSHX3IlBxR qiqp6ynZ/nV9svxlFgNbmpYukpVtQdTrD+frIBEcClFghfhfrr8a6J5oKs6mA/+TM2nQ DIBEOph2O1v4MpIeWVIVRGvKJeuVYnl+qcxRy40zmh1E7GnEq6gjwGxfS1LHbZJh7s4I vSRh9eivcNNSRPUiwSL1UYou6x14NX+0V9XGg4DHAVzuu9ZKaVEyWgjm//K53ObgEUQO xD0iF6WLG3s737arsc0kIX03STEJARYpP7RiRG8y06W0CQ4W6clK7FlbAqL3GI4BNKBh r3sg== X-Gm-Message-State: APjAAAVDRKKkijrJ7NgaON5bxE9zUwVPRuNb6h6XW+R/QoRz5B7Bxjpd THnhbgqpY8t1Z/4sr0NeFiu8jJEtbOtRtg== X-Google-Smtp-Source: APXvYqwYNe0Yo5cDbRJ3pYz7dPo/PjBpERriYEki/0ce1JPSpBxRwrUrfr0hunfz/YfcG8Ttg9LdAA== X-Received: by 2002:a05:6000:112:: with SMTP id o18mr8247045wrx.153.1561724941516; Fri, 28 Jun 2019 05:29:01 -0700 (PDT) Received: from dhcp-1-158.brq.redhat.com (nat-pool-brq-t.redhat.com. [213.175.37.10]) by smtp.googlemail.com with ESMTPSA id t1sm3241473wra.74.2019.06.28.05.29.00 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 28 Jun 2019 05:29:01 -0700 (PDT) From: Grzegorz Halat To: linux-kernel@vger.kernel.org Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H . Peter Anvin" , x86@kernel.org, Don Zickus , Oleksandr Natalenko , Grzegorz Halat Subject: [PATCH RFC] x86/reboot: Don't wait infinitely for IPI completion during reboot Date: Fri, 28 Jun 2019 14:28:13 +0200 Message-Id: <20190628122813.15500-1-ghalat@redhat.com> X-Mailer: git-send-email 2.21.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, This patch is trying to address a deadlock that may happen during a reboot. CPU0 can stop a CPU which is holding a lock and there may be another CPU with disabled interrupts waiting for the same lock. In this situation, the CPU waiting for the lock can't be stopped and CPU0 will be indefinitely waiting for IPI completion. To avoid indefinitely looping kernel should send NMI after timeout. It seems to me that this was the original intention of commit 7d007d21e539 ("x86/reboot: Use NMI to assist in shutting down if IRQ fails") but the removal of 'wait' in the first loop has been forgotten. I'm sending the patch as RFC, maybe there is a better way to handle this situation. Another thing which wonders me is pr_emerg() after IPI. Is this safe? I think that IPI can shut down a CPU that is holding a lock used by pr_emerg(), so there may be a deadlock. CPU0 is stuck in native_stop_other_cpus() waiting for shutdown of CPU2: #6 [ffffb1a7826b7dd0] delay_tsc at ffffffffb640f0b4 #7 [ffffb1a7826b7dd0] native_stop_other_cpus at ffffffffb5c47ffa #8 [ffffb1a7826b7df0] native_machine_shutdown at ffffffffb5c47132 #9 [ffffb1a7826b7df8] native_machine_restart at ffffffffb5c47649 #10 [ffffb1a7826b7e00] __do_sys_reboot at ffffffffb5ccff62 #11 [ffffb1a7826b7f38] do_syscall_64 at ffffffffb5c0424b The loop, in case of reboot() syscall wait==1: File: linux/arch/x86/kernel/smp.c 218: timeout = USEC_PER_SEC; 219: while (num_online_cpus() > 1 && (wait || timeout--)) 220: udelay(1); CPU1 has been shutdown while holding tasklist_lock: #0 [ffff984418083fb8] stop_this_cpu at ffffffffb5c28495 << CPU stopped #1 [ffff984418083fc0] smp_reboot_interrupt at ffffffffb5c48169 #2 [ffff984418083ff0] reboot_interrupt at ffffffffb6600bbf --- --- #3 [ffffb1a782417be8] reboot_interrupt at ffffffffb6600bbf [exception RIP: delay_tsc+32] #4 [ffffb1a782417c98] probe_12098 at ffffffffc0813fab #5 [ffffb1a782417cb0] do_wait at ffffffffb5cae5b4 #6 [ffffb1a782417d08] optimized_callback at ffffffffb5c56be3 #7 [ffffb1a782417d98] do_wait at ffffffffb5cae5b3 << read_lock() here #8 [ffffb1a782417df0] kernel_wait4 at ffffffffb5cafa4e #9 [ffffb1a782417e78] __do_sys_wait4 at ffffffffb5cafb73 #10 [ffffb1a782417f38] do_syscall_64 at ffffffffb5c0424b Note: probe_12098 is SystemTap probe which injects delay. I'm using SystemTap to reliably reproduce the issue CPU2 is trying to grab tasklist_lock with disabled interrupts: RFLAGS: 00000006 #5 [ffffb1a782397e78] queued_write_lock_slowpath at ffffffffb5d0052e #6 [ffffb1a782397e88] do_exit at ffffffffb5caf08e #7 [ffffb1a782397f08] do_group_exit at ffffffffb5caf8da #8 [ffffb1a782397f30] __x64_sys_exit_group at ffffffffb5caf954 #9 [ffffb1a782397f38] do_syscall_64 at ffffffffb5c0424b Signed-off-by: Grzegorz Halat --- arch/x86/kernel/smp.c | 28 ++++++++++------------------ 1 file changed, 10 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c index 4693e2f3a03e..9186e432b8d6 100644 --- a/arch/x86/kernel/smp.c +++ b/arch/x86/kernel/smp.c @@ -211,30 +211,23 @@ static void native_stop_other_cpus(int wait) apic->send_IPI_allbutself(REBOOT_VECTOR); - /* - * Don't wait longer than a second if the caller - * didn't ask us to wait. - */ + /* Don't wait longer than a second for IPI completion */ timeout = USEC_PER_SEC; - while (num_online_cpus() > 1 && (wait || timeout--)) + while (num_online_cpus() > 1 && timeout--) udelay(1); } /* if the REBOOT_VECTOR didn't work, try with the NMI */ - if ((num_online_cpus() > 1) && (!smp_no_nmi_ipi)) { - if (register_nmi_handler(NMI_LOCAL, smp_stop_nmi_callback, - NMI_FLAG_FIRST, "smp_stop")) - /* Note: we ignore failures here */ - /* Hope the REBOOT_IRQ is good enough */ - goto finish; - - /* sync above data before sending IRQ */ - wmb(); - - pr_emerg("Shutting down cpus with NMI\n"); + if (num_online_cpus() > 1) { + if (!smp_no_nmi_ipi && !register_nmi_handler(NMI_LOCAL, + smp_stop_nmi_callback, NMI_FLAG_FIRST, "smp_stop")){ + /* sync above data before sending IRQ */ + wmb(); - apic->send_IPI_allbutself(NMI_VECTOR); + pr_emerg("Shutting down cpus with NMI\n"); + apic->send_IPI_allbutself(NMI_VECTOR); + } /* * Don't wait longer than a 10 ms if the caller * didn't ask us to wait. @@ -244,7 +237,6 @@ static void native_stop_other_cpus(int wait) udelay(1); } -finish: local_irq_save(flags); disable_local_APIC(); mcheck_cpu_clear(this_cpu_ptr(&cpu_info)); -- 2.18.1