* [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() @ 2018-01-22 22:00 Tejun Heo 2018-01-23 19:03 ` Rik van Riel 2018-01-24 10:38 ` [tip:locking/urgent] locking/lockdep: " tip-bot for Tejun Heo 0 siblings, 2 replies; 10+ messages in thread From: Tejun Heo @ 2018-01-22 22:00 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, kernel-team debug_show_all_locks() iterates all tasks and print held locks whole holding tasklist_lock. This can take a while on a slow console device and may end up triggering NMI hardlockup detector if someone else ends up waiting for tasklist_lock. Touch the NMI watchdog while printing the held locks to avoid spuriously triggering the hardlockup detector. Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/locking/lockdep.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 5fa1324..5216590 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -49,6 +49,7 @@ #include <linux/gfp.h> #include <linux/random.h> #include <linux/jhash.h> +#include <linux/nmi.h> #include <asm/sections.h> @@ -4490,6 +4491,7 @@ void debug_show_all_locks(void) if (!unlock) if (read_trylock(&tasklist_lock)) unlock = 1; + touch_nmi_watchdog(); } while_each_thread(g, p); pr_warn("\n"); ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() 2018-01-22 22:00 [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() Tejun Heo @ 2018-01-23 19:03 ` Rik van Riel 2018-01-23 20:57 ` Tejun Heo 2018-01-24 10:38 ` [tip:locking/urgent] locking/lockdep: " tip-bot for Tejun Heo 1 sibling, 1 reply; 10+ messages in thread From: Rik van Riel @ 2018-01-23 19:03 UTC (permalink / raw) To: Tejun Heo, Peter Zijlstra, Ingo Molnar; +Cc: linux-kernel, kernel-team On Mon, 2018-01-22 at 14:00 -0800, Tejun Heo wrote: > debug_show_all_locks() iterates all tasks and print held locks whole > holding tasklist_lock. This can take a while on a slow console > device > and may end up triggering NMI hardlockup detector if someone else > ends > up waiting for tasklist_lock. > > Touch the NMI watchdog while printing the held locks to avoid > spuriously triggering the hardlockup detector. > > Signed-off-by: Tejun Heo <tj@kernel.org> On this patch: Acked-by: Rik van Riel <riel@surriel.com> However, it seems like we run into things like this on a fairly regular (though not very frequent) basis. Would it make sense to go through the code and add sprinkle around a few more touch_nmi_watchdog() calls? After all, there are maybe a few dozen places where we print out a lot of debugging information. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() 2018-01-23 19:03 ` Rik van Riel @ 2018-01-23 20:57 ` Tejun Heo 2018-01-23 21:00 ` Steven Rostedt 0 siblings, 1 reply; 10+ messages in thread From: Tejun Heo @ 2018-01-23 20:57 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, kernel-team, Steven Rostedt, Sergey Senozhatsky, Petr Mladek Hello, (cc'ing Steven, Sergey and Petr who are working on printk) On Tue, Jan 23, 2018 at 02:03:57PM -0500, Rik van Riel wrote: > On Mon, 2018-01-22 at 14:00 -0800, Tejun Heo wrote: > > debug_show_all_locks() iterates all tasks and print held locks whole > > holding tasklist_lock. This can take a while on a slow console > > device > > and may end up triggering NMI hardlockup detector if someone else > > ends > > up waiting for tasklist_lock. > > > > Touch the NMI watchdog while printing the held locks to avoid > > spuriously triggering the hardlockup detector. > > > > Signed-off-by: Tejun Heo <tj@kernel.org> > > On this patch: > > Acked-by: Rik van Riel <riel@surriel.com> > > > However, it seems like we run into things like > this on a fairly regular (though not very frequent) > basis. Would it make sense to go through the code > and add sprinkle around a few more touch_nmi_watchdog() > calls? > > After all, there are maybe a few dozen places where > we print out a lot of debugging information. Yeah, it's ridiculous how often printk ends up escalating otherwise recoverable situations into system crashes. I don't know what the right answer is. For spurious NMI hardlockups, maybe auditing debug paths and adding touch_nmi_watchdog() would be enough but that also is a pretty leaky approach. Thanks. -- tejun ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() 2018-01-23 20:57 ` Tejun Heo @ 2018-01-23 21:00 ` Steven Rostedt 2018-01-23 21:11 ` Tejun Heo 0 siblings, 1 reply; 10+ messages in thread From: Steven Rostedt @ 2018-01-23 21:00 UTC (permalink / raw) To: Tejun Heo Cc: Rik van Riel, Peter Zijlstra, Ingo Molnar, linux-kernel, kernel-team, Sergey Senozhatsky, Petr Mladek On Tue, 23 Jan 2018 12:57:06 -0800 Tejun Heo <tj@kernel.org> wrote: > Yeah, it's ridiculous how often printk ends up escalating otherwise > recoverable situations into system crashes. I don't know what the > right answer is. For spurious NMI hardlockups, maybe auditing debug > paths and adding touch_nmi_watchdog() would be enough but that also is > a pretty leaky approach. What about if every printk were to touch NMI watchdog? NMI watchdog is really there for when the system locks up. If the system is locked up doing printk, at least we see what is happening, and not a total freeze. -- Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() 2018-01-23 21:00 ` Steven Rostedt @ 2018-01-23 21:11 ` Tejun Heo 2018-01-24 2:49 ` Sergey Senozhatsky 0 siblings, 1 reply; 10+ messages in thread From: Tejun Heo @ 2018-01-23 21:11 UTC (permalink / raw) To: Steven Rostedt Cc: Rik van Riel, Peter Zijlstra, Ingo Molnar, linux-kernel, kernel-team, Sergey Senozhatsky, Petr Mladek Hello, Steven. On Tue, Jan 23, 2018 at 04:00:54PM -0500, Steven Rostedt wrote: > On Tue, 23 Jan 2018 12:57:06 -0800 > Tejun Heo <tj@kernel.org> wrote: > > > Yeah, it's ridiculous how often printk ends up escalating otherwise > > recoverable situations into system crashes. I don't know what the > > right answer is. For spurious NMI hardlockups, maybe auditing debug > > paths and adding touch_nmi_watchdog() would be enough but that also is > > a pretty leaky approach. > > What about if every printk were to touch NMI watchdog? > > NMI watchdog is really there for when the system locks up. If the > system is locked up doing printk, at least we see what is happening, > and not a total freeze. Yeah, that would definitely be a solution. The downside is that when the system completely locks up from printk storm while holding critical locks (say, tasklist_lock), the watchdog won't be able to reset the system. I guess the judgement would depend on what one expects of the NMI watchdog, but I personally would be happier with printk touching NMI automatically. Thanks. -- tejun ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() 2018-01-23 21:11 ` Tejun Heo @ 2018-01-24 2:49 ` Sergey Senozhatsky 2018-01-24 2:54 ` Steven Rostedt 0 siblings, 1 reply; 10+ messages in thread From: Sergey Senozhatsky @ 2018-01-24 2:49 UTC (permalink / raw) To: Tejun Heo Cc: Steven Rostedt, Rik van Riel, Peter Zijlstra, Ingo Molnar, linux-kernel, kernel-team, Sergey Senozhatsky, Petr Mladek Hello, On (01/23/18 13:11), Tejun Heo wrote: [..] > > What about if every printk were to touch NMI watchdog? > > > > NMI watchdog is really there for when the system locks up. If the > > system is locked up doing printk, at least we see what is happening, > > and not a total freeze. > > Yeah, that would definitely be a solution. The downside is that when > the system completely locks up from printk storm while holding > critical locks (say, tasklist_lock), the watchdog won't be able to > reset the system. Agreed. It's not only NMI watchdog. RCU also might get stalled by printk. > I guess the judgement would depend on what one expects of the NMI watchdog, > but I personally would be happier with printk touching NMI automatically. In the long term I think I'd rather move printk to a batched mode: printk for X seconds (depending on watchdog threshold) tops and offload, don't stay in the same context. It seems, sometimes, that "offloading will ruin printk" thing might be a bit exaggerated. IMHO. -ss P.S. Another problem, and I mentioned it somewhere in another email, is that upstream printk people don't receive enough [if any at all] feedback from guys who face printk issues. That's why every time printk_kthread re-surfaces the reaction is "this is not a real problem, no one is seeing printk issues like these, you idiot!". It'd be great to have more "we need ABC, because of XYZ, but printk crashes the system. Here is the backtrace, fix it" reports. As of now, those things mostly are not reported, that's why people are not convinced. Just my 5 cents. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() 2018-01-24 2:49 ` Sergey Senozhatsky @ 2018-01-24 2:54 ` Steven Rostedt 2018-01-24 5:00 ` Sergey Senozhatsky 0 siblings, 1 reply; 10+ messages in thread From: Steven Rostedt @ 2018-01-24 2:54 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Tejun Heo, Rik van Riel, Peter Zijlstra, Ingo Molnar, linux-kernel, kernel-team, Sergey Senozhatsky, Petr Mladek On Wed, 24 Jan 2018 11:49:55 +0900 Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote: > Another problem, and I mentioned it somewhere in another email, is that > upstream printk people don't receive enough [if any at all] feedback from > guys who face printk issues. That's why every time printk_kthread re-surfaces > the reaction is "this is not a real problem, no one is seeing printk issues > like these, you idiot!". It'd be great to have more "we need ABC, because of > XYZ, but printk crashes the system. Here is the backtrace, fix it" reports. > As of now, those things mostly are not reported, that's why people are not > convinced. Just my 5 cents. If you are seeing these issues, have whoever is reporting them to Cc LKML, and those of us that would care to listen. -- Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() 2018-01-24 2:54 ` Steven Rostedt @ 2018-01-24 5:00 ` Sergey Senozhatsky 2018-01-24 19:10 ` Tejun Heo 0 siblings, 1 reply; 10+ messages in thread From: Sergey Senozhatsky @ 2018-01-24 5:00 UTC (permalink / raw) To: Steven Rostedt Cc: Sergey Senozhatsky, Tejun Heo, Rik van Riel, Peter Zijlstra, Ingo Molnar, linux-kernel, kernel-team, Sergey Senozhatsky, Petr Mladek On (01/23/18 21:54), Steven Rostedt wrote: > > > Another problem, and I mentioned it somewhere in another email, is that > > upstream printk people don't receive enough [if any at all] feedback from > > guys who face printk issues. That's why every time printk_kthread re-surfaces > > the reaction is "this is not a real problem, no one is seeing printk issues > > like these, you idiot!". It'd be great to have more "we need ABC, because of > > XYZ, but printk crashes the system. Here is the backtrace, fix it" reports. > > As of now, those things mostly are not reported, that's why people are not > > convinced. Just my 5 cents. > > If you are seeing these issues, have whoever is reporting them to Cc > LKML, and those of us that would care to listen. OK. The lack of reports can also signify that we need to change the way we handle those reports. If we are going to reply "yes, your system crashed while doing completely innocent printout, but if we fix it then we can increase by 0.0001% chances of not getting any printouts at all in that corner case when your system is in recursive double panic over the latest bitcoin price and your keyboard is on fire" then I don't think people will care to report anything. Maybe Tejun will be kind enough to shed some light on how often FB fleet suffer from printk related issues, or what are the most common scenarios, etc. [sensitive information can be reported "off list"] -ss ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() 2018-01-24 5:00 ` Sergey Senozhatsky @ 2018-01-24 19:10 ` Tejun Heo 0 siblings, 0 replies; 10+ messages in thread From: Tejun Heo @ 2018-01-24 19:10 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Steven Rostedt, Rik van Riel, Peter Zijlstra, Ingo Molnar, linux-kernel, kernel-team, Sergey Senozhatsky, Petr Mladek Hello, Sergey, Steven. On Wed, Jan 24, 2018 at 02:00:35PM +0900, Sergey Senozhatsky wrote: > On (01/23/18 21:54), Steven Rostedt wrote: > > > > > Another problem, and I mentioned it somewhere in another email, is that > > > upstream printk people don't receive enough [if any at all] feedback from > > > guys who face printk issues. That's why every time printk_kthread re-surfaces > > > the reaction is "this is not a real problem, no one is seeing printk issues > > > like these, you idiot!". It'd be great to have more "we need ABC, because of > > > XYZ, but printk crashes the system. Here is the backtrace, fix it" reports. > > > As of now, those things mostly are not reported, that's why people are not > > > convinced. Just my 5 cents. > > > > If you are seeing these issues, have whoever is reporting them to Cc > > LKML, and those of us that would care to listen. > > OK. The lack of reports can also signify that we need to change the way we > handle those reports. If we are going to reply "yes, your system crashed > while doing completely innocent printout, but if we fix it then we can > increase by 0.0001% chances of not getting any printouts at all in that > corner case when your system is in recursive double panic over the latest > bitcoin price and your keyboard is on fire" then I don't think people will > care to report anything. > > Maybe Tejun will be kind enough to shed some light on how often FB fleet > suffer from printk related issues, or what are the most common scenarios, > etc. [sensitive information can be reported "off list"] There are efforts to automatically scrub and share kernel splats publicly, so hopefully we'd be able to provide a better visibility into the problems we encounter in the future. For now, there are some security implications and I'm not very sure how liberal I can share. In terms of frequency, it isn't catastrophic. I think Chris Mason described it pretty eloquently - "more often than hardware failures, not so often that we turn off serial console". One painful part is that it adds noise to signal by escalating an unrelated problem to RCU stalls or hard lockups just by occupying an unlucky context for too long. The relevance of these messages fall pretty rapidly as the number of consecutive lines increases, so it is frustrating to pay for them this way. After 15s of flushing along with a number of "X printk messages dropped" warnings, we aren't really doing anyone any service by trying to uphold "transmit the message as close to the printking context as possible". Thanks. -- tejun ^ permalink raw reply [flat|nested] 10+ messages in thread
* [tip:locking/urgent] locking/lockdep: Avoid triggering hardlockup from debug_show_all_locks() 2018-01-22 22:00 [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() Tejun Heo 2018-01-23 19:03 ` Rik van Riel @ 2018-01-24 10:38 ` tip-bot for Tejun Heo 1 sibling, 0 replies; 10+ messages in thread From: tip-bot for Tejun Heo @ 2018-01-24 10:38 UTC (permalink / raw) To: linux-tip-commits; +Cc: linux-kernel, hpa, peterz, mingo, tj, torvalds, tglx Commit-ID: 88f1c87de11a86d839f4ce5313e552d96709b990 Gitweb: https://git.kernel.org/tip/88f1c87de11a86d839f4ce5313e552d96709b990 Author: Tejun Heo <tj@kernel.org> AuthorDate: Mon, 22 Jan 2018 14:00:55 -0800 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Wed, 24 Jan 2018 10:00:09 +0100 locking/lockdep: Avoid triggering hardlockup from debug_show_all_locks() debug_show_all_locks() iterates all tasks and print held locks whole holding tasklist_lock. This can take a while on a slow console device and may end up triggering NMI hardlockup detector if someone else ends up waiting for tasklist_lock. Touch the NMI watchdog while printing the held locks to avoid spuriously triggering the hardlockup detector. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Link: http://lkml.kernel.org/r/20180122220055.GB1771050@devbig577.frc2.facebook.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- kernel/locking/lockdep.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 5fa1324..5216590 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -49,6 +49,7 @@ #include <linux/gfp.h> #include <linux/random.h> #include <linux/jhash.h> +#include <linux/nmi.h> #include <asm/sections.h> @@ -4490,6 +4491,7 @@ retry: if (!unlock) if (read_trylock(&tasklist_lock)) unlock = 1; + touch_nmi_watchdog(); } while_each_thread(g, p); pr_warn("\n"); ^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-01-24 19:10 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-01-22 22:00 [PATCH] lockdep: Avoid triggering hardlockup from debug_show_all_locks() Tejun Heo 2018-01-23 19:03 ` Rik van Riel 2018-01-23 20:57 ` Tejun Heo 2018-01-23 21:00 ` Steven Rostedt 2018-01-23 21:11 ` Tejun Heo 2018-01-24 2:49 ` Sergey Senozhatsky 2018-01-24 2:54 ` Steven Rostedt 2018-01-24 5:00 ` Sergey Senozhatsky 2018-01-24 19:10 ` Tejun Heo 2018-01-24 10:38 ` [tip:locking/urgent] locking/lockdep: " tip-bot for Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).