From: Thomas Gleixner <tglx@linutronix.de> To: Steven Rostedt <rostedt@goodmis.org> Cc: LKML <linux-kernel@vger.kernel.org>, Peter Zijlstra <peterz@infradead.org>, Masami Hiramatsu <mhiramat@kernel.org>, Alexei Starovoitov <ast@kernel.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, "Paul E. McKenney" <paulmck@kernel.org>, Joel Fernandes <joel@joelfernandes.org>, Frederic Weisbecker <frederic@kernel.org> Subject: Re: Instrumentation and RCU Date: Mon, 09 Mar 2020 19:59:18 +0100 Message-ID: <87fteh73sp.fsf@nanos.tec.linutronix.de> (raw) In-Reply-To: <20200309141546.5b574908@gandalf.local.home> Steven Rostedt <rostedt@goodmis.org> writes: > On Mon, 09 Mar 2020 18:02:32 +0100 > Thomas Gleixner <tglx@linutronix.de> wrote: >> #1 Fragile low level entry code >> >> While I understand the desire of instrumentation to observe >> everything we really have to ask the question whether it is worth the >> trouble especially with entry trainwrecks like x86, PTI and other >> horrors in that area. >> >> I don't think so and we really should just bite the bullet and forbid >> any instrumentation in that code unless it is explicitly designed >> for that case, makes sense and has a real value from an observation >> perspective. >> >> This is very much related to #3.. > > Now for just entry portions (entering into interrupt context or > coming from userspace into normal kernel code) I agree, and I'm fine if we > can move things around and not trace these entry points if possible. I think we can. I spent quite some effort on splitting up the syscall entry/exit stuff so only the low level entry up to the point where RCU is watching and the low level exit where RCU is evtl. going back to idle plus the related ASM maze. It's quite some tedious work, that's why I came up with the section approach. >> #2) Breakpoint utilization >> >> As recent findings have shown, breakpoint utilization needs to be >> extremly careful about not creating infinite breakpoint recursions. >> >> I think that's pretty much obvious, but falls into the overall >> question of how to protect callchains. > > This is rather unique, and I agree that its best to at least get to a point > where we limit the tracing within breakpoint code. I'm fine with making > rcu_nmi_exit() nokprobe too. Yes, the break point stuff is unique, but it has nicely demonstrated how much of the code is affected by it. >> #3) RCU idle >> >> Being able to trace code inside RCU idle sections is very similar to >> the question raised in #1. >> >> Assume all of the instrumentation would be doing conditional RCU >> schemes, i.e.: >> >> if (rcuidle) >> .... >> else >> rcu_read_lock_sched() >> >> before invoking the actual instrumentation functions and of course >> undoing that right after it, that really begs the question whether >> it's worth it. >> >> Especially constructs like: >> >> trace_hardirqs_off() >> idx = srcu_read_lock() >> rcu_irq_enter_irqson(); >> ... >> rcu_irq_exit_irqson(); >> srcu_read_unlock(idx); >> >> if (user_mode) >> user_exit_irqsoff(); >> else >> rcu_irq_enter(); >> >> are really more than questionable. For 99.9999% of instrumentation >> users it's absolutely irrelevant whether this traces the interrupt >> disabled time of user_exit_irqsoff() or rcu_irq_enter() or not. >> >> But what's relevant is the tracer overhead which is e.g. inflicted >> with todays trace_hardirqs_off/on() implementation because that >> unconditionally uses the rcuidle variant with the scru/rcu_irq dance >> around every tracepoint. >> >> Even if the tracepoint sits in the ASM code it just covers about ~20 >> low level ASM instructions more. The tracer invocation, which is >> even done twice when coming from user space on x86 (the second call >> is optimized in the tracer C-code), costs definitely way more >> cycles. When you take the scru/rcu_irq dance into account it's a >> complete disaster performance wise. > > Is this specifically to do with the kernel/trace/trace_preemptirqs.c code > that was added by Joel? Don't know who it added, but yes, the above is what it does when it's expanded. I did some profiling with a trivial invalid syscall and trace_hardirqs_off() shows very prominent there, while with the approach I have in my entry series which does: lockdep_irq_off(); enter_from_user_mode(); __trace_hardirqs_off(); The tracing overhead goes down by ~ factor 2. That's a lot. >> #4 Protecting call chains >> >> Our current approach of annotating functions with notrace/noprobe is >> pretty much broken. >> >> Functions which are marked NOPROBE or notrace call out into functions >> which are not marked and while this might be ok, there are enough >> places where it is not. But we have no way to verify that. > > Note, if notrace is an issue it shows up pretty quickly, as just enabling > function tracing will enable all non notrace locations, and if something > shouldn't be traced, it will crash immediately. Steven, you're not really serious about this, right? This is tinkering at best. We have code pathes which are not necessarily covered in regular testing, depend on config options etc. Have you ever looked at code coverage maps? There are quite some spots which we don't reach in testing. So how do you explain the user that the blind spot he hit in the weird situation on his server which he wanted to analyze crashed his machine? Having 'off limit' sections allows us to do proper tool based analysis with full coverage. That's really the only sensible approach. > I have a RCU option for ftrace ops to set, if it requires RCU to be > watching, and in that case, it wont call the callback if RCU is not > watching. That's nice but does not solve the problem. >> That's just a recipe for disaster. We really cannot request from >> sysadmins who want to use instrumentation to stare at the code first >> whether they can place/enable an instrumentation point somewhere. >> That'd be just a bad joke. >> >> I really think we need to have proper text sections which are off >> limit for any form of instrumentation and have tooling to analyze the >> calls into other sections. These calls need to be annotated as safe >> and intentional. >> >> Thoughts? > > This can expand quite a bit. At least when I did something similar with > NMIs, as there was a time I wanted to flag all places that could be called > from NMI, and found that there's a lot of code that can be. > > I can imagine the same for marking nokprobes as well. And I really don't > want to make all notrace stop tracing the entire function and all that it > can call, as that will go back to removing all callers from NMIs as > do_nmi() itself is notrace. The point is that you have something like this: section "text.offlimit" nmi() { do_fragile_stuff_on_enter(); offlimit_safecall(do_some_instrumentable_stuff()); do_fragile_stuff_on_exit(); } section "text" do_some_instrumentable_stuff() { } So if someone adds another call section "text.offlimit" nmi() { do_fragile_stuff_on_enter(); offlimit_safecall(do_some_instrumentable_stuff()); do_some_other_instrumentable_stuff(); do_fragile_stuff_on_exit(); } which is also in section "text" then the analysis tool will find the missing offlimit_safecall() - or what ever method we chose to annotate that stuff. Surely not an annotation on the called function itself because that might be safe to call in one context but not in another. These annotations are halfways easy to monitor for abuse and they should be prominent enough in the code that at least for the people dealing with that kind of code they act as a warning flag. Thanks, tglx
next prev parent reply index Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-03-09 17:02 Thomas Gleixner 2020-03-09 18:15 ` Steven Rostedt 2020-03-09 18:42 ` Joel Fernandes 2020-03-09 19:07 ` Steven Rostedt 2020-03-09 19:20 ` Mathieu Desnoyers 2020-03-16 15:02 ` Joel Fernandes 2020-03-09 18:59 ` Thomas Gleixner [this message] 2020-03-10 8:09 ` Masami Hiramatsu 2020-03-10 11:43 ` Thomas Gleixner 2020-03-10 15:31 ` Mathieu Desnoyers 2020-03-10 15:46 ` Steven Rostedt 2020-03-10 16:21 ` Mathieu Desnoyers 2020-03-11 0:18 ` Masami Hiramatsu 2020-03-11 0:37 ` Mathieu Desnoyers 2020-03-11 7:48 ` Masami Hiramatsu 2020-03-10 16:06 ` Masami Hiramatsu 2020-03-12 13:53 ` Peter Zijlstra 2020-03-10 15:24 ` Mathieu Desnoyers 2020-03-10 17:05 ` Daniel Thompson 2020-03-09 18:37 ` Mathieu Desnoyers 2020-03-09 18:44 ` Steven Rostedt 2020-03-09 18:52 ` Mathieu Desnoyers 2020-03-09 19:09 ` Steven Rostedt 2020-03-09 19:25 ` Mathieu Desnoyers 2020-03-09 19:52 ` Thomas Gleixner 2020-03-10 15:03 ` Mathieu Desnoyers 2020-03-10 16:48 ` Thomas Gleixner 2020-03-10 17:40 ` Mathieu Desnoyers 2020-03-10 18:31 ` Thomas Gleixner 2020-03-10 18:37 ` Mathieu Desnoyers 2020-03-10 1:40 ` Alexei Starovoitov 2020-03-10 8:02 ` Thomas Gleixner 2020-03-10 16:54 ` Paul E. McKenney 2020-03-17 17:56 ` Joel Fernandes 2020-03-09 20:18 ` Peter Zijlstra 2020-03-09 20:47 ` Paul E. McKenney 2020-03-09 20:58 ` Steven Rostedt 2020-03-09 21:25 ` Paul E. McKenney 2020-03-09 23:52 ` Frederic Weisbecker 2020-03-10 2:26 ` Paul E. McKenney 2020-03-10 15:13 ` Mathieu Desnoyers 2020-03-10 16:49 ` Paul E. McKenney 2020-03-10 17:22 ` Mathieu Desnoyers 2020-03-10 17:26 ` Paul E. McKenney
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=87fteh73sp.fsf@nanos.tec.linutronix.de \ --to=tglx@linutronix.de \ --cc=ast@kernel.org \ --cc=frederic@kernel.org \ --cc=joel@joelfernandes.org \ --cc=linux-kernel@vger.kernel.org \ --cc=mathieu.desnoyers@efficios.com \ --cc=mhiramat@kernel.org \ --cc=paulmck@kernel.org \ --cc=peterz@infradead.org \ --cc=rostedt@goodmis.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
LKML Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \ linux-kernel@vger.kernel.org public-inbox-index lkml Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git