From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753428AbdJLQRD (ORCPT ); Thu, 12 Oct 2017 12:17:03 -0400 Received: from mail.kernel.org ([198.145.29.99]:38488 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753330AbdJLQRB (ORCPT ); Thu, 12 Oct 2017 12:17:01 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2BEEE21877 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=goodmis.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=rostedt@goodmis.org Date: Thu, 12 Oct 2017 12:16:58 -0400 From: Steven Rostedt To: LKML Cc: Petr Mladek , Sergey Senozhatsky , Peter Zijlstra , Andrew Morton , Linus Torvalds , Thomas Gleixner , Ingo Molnar Subject: NMI watchdog dump does not print on hard lockup Message-ID: <20171012121658.187c5af6@gandalf.local.home> X-Mailer: Claws Mail 3.14.0 (GTK+ 2.24.31; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org While doing my presentation for ELC and OSS in Prague in a couple of weeks, I notice an issue with the printk_safe logic. Then I wrote code to see if my fears were justified. I noticed that the NMI printks now depend on an irq_work to trigger to flush the data out that was written by printks during the NMI. But if the irq work can't trigger, nothing will get out of the screen. To test this, I first added this: raw_spin_lock(&global_trace.start_lock); raw_spin_lock(&global_trace.start_lock); raw_spin_unlock(&global_trace.start_lock); raw_spin_unlock(&global_trace.start_lock); To the write function of /sys/kernel/debug/tracing/free_buffer That way I could trigger a lockup (this case a soft lockup) when I wanted to. # echo 1 > /sys/kernel/debug/tracing/free_buffer Sure enough, in a minute after doing this, the soft lockup warning triggered. Then I changed it to: raw_spin_lock_irq(&global_trace.start_lock); raw_spin_lock(&global_trace.start_lock); raw_spin_unlock(&global_trace.start_lock); raw_spin_unlock_irq(&global_trace.start_lock); And to my surprise, the hard lockup warning triggered. But then I noticed that the lockup was detected from another CPU. So I changed this to: static void lock_up_cpu(void *data) { unsigned long flags; raw_spin_lock_irqsave(&global_trace.start_lock, flags); raw_spin_lock(&global_trace.start_lock); raw_spin_unlock(&global_trace.start_lock); raw_spin_unlock_irqrestore(&global_trace.start_lock, flags); } [..] on_each_cpu(lock_up_cpu, NULL, 1); This too triggered the warning. But I noticed that the calling function didn't hard lockup. (Not all CPUs were hard locked). Finally I did: on_each_cpu(lock_up_cpu, NULL, 0); lock_up_cpu(tr); And boom! It locked up (lockdep was enabled, so I could see it showing the deadlock), but then it stopped there. No output. The NMI watchdog will only detect hard lockups if there is at least one CPU that is still active. This could be an issue on non SMP boxes. We need a way to have NMI flush to consoles when a lockup is detected, and not depend on an irq_work to do so. I'll update my presentation to discuss this flaw ;-) -- Steve