linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Instrumentation and RCU
@ 2020-03-09 17:02 Thomas Gleixner
  2020-03-09 18:15 ` Steven Rostedt
                   ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Thomas Gleixner @ 2020-03-09 17:02 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, Steven Rostedt, Masami Hiramatsu,
	Alexei Starovoitov, Mathieu Desnoyers, Paul E. McKenney,
	Joel Fernandes, Frederic Weisbecker

Folks,

I'm starting a new conversation because there are about 20 different
threads which look at that problem in various ways and the information
is so scattered that creating a coherent picture is pretty much
impossible.

There are several problems to solve:

   1) Fragile low level entry code

   2) Breakpoint utilization

   3) RCU idle

   4) Callchain protection

#1 Fragile low level entry code

   While I understand the desire of instrumentation to observe
   everything we really have to ask the question whether it is worth the
   trouble especially with entry trainwrecks like x86, PTI and other
   horrors in that area.

   I don't think so and we really should just bite the bullet and forbid
   any instrumentation in that code unless it is explicitly designed
   for that case, makes sense and has a real value from an observation
   perspective.

   This is very much related to #3..

#2) Breakpoint utilization

    As recent findings have shown, breakpoint utilization needs to be
    extremly careful about not creating infinite breakpoint recursions.

    I think that's pretty much obvious, but falls into the overall
    question of how to protect callchains.

#3) RCU idle

    Being able to trace code inside RCU idle sections is very similar to
    the question raised in #1.

    Assume all of the instrumentation would be doing conditional RCU
    schemes, i.e.:

    if (rcuidle)
    	....
    else
        rcu_read_lock_sched()

    before invoking the actual instrumentation functions and of course
    undoing that right after it, that really begs the question whether
    it's worth it.

    Especially constructs like:

    trace_hardirqs_off()
       idx = srcu_read_lock()
       rcu_irq_enter_irqson();
       ...
       rcu_irq_exit_irqson();
       srcu_read_unlock(idx);

    if (user_mode)
       user_exit_irqsoff();
    else
       rcu_irq_enter();

    are really more than questionable. For 99.9999% of instrumentation
    users it's absolutely irrelevant whether this traces the interrupt
    disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.

    But what's relevant is the tracer overhead which is e.g. inflicted
    with todays trace_hardirqs_off/on() implementation because that
    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
    around every tracepoint.

    Even if the tracepoint sits in the ASM code it just covers about ~20
    low level ASM instructions more. The tracer invocation, which is
    even done twice when coming from user space on x86 (the second call
    is optimized in the tracer C-code), costs definitely way more
    cycles. When you take the scru/rcu_irq dance into account it's a
    complete disaster performance wise.

#4 Protecting call chains

   Our current approach of annotating functions with notrace/noprobe is
   pretty much broken.

   Functions which are marked NOPROBE or notrace call out into functions
   which are not marked and while this might be ok, there are enough
   places where it is not. But we have no way to verify that.

   That's just a recipe for disaster. We really cannot request from
   sysadmins who want to use instrumentation to stare at the code first
   whether they can place/enable an instrumentation point somewhere.
   That'd be just a bad joke.

   I really think we need to have proper text sections which are off
   limit for any form of instrumentation and have tooling to analyze the
   calls into other sections. These calls need to be annotated as safe
   and intentional.

Thoughts?

Thanks,

        tglx






   


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 17:02 Instrumentation and RCU Thomas Gleixner
@ 2020-03-09 18:15 ` Steven Rostedt
  2020-03-09 18:42   ` Joel Fernandes
  2020-03-09 18:59   ` Thomas Gleixner
  2020-03-09 18:37 ` Mathieu Desnoyers
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 44+ messages in thread
From: Steven Rostedt @ 2020-03-09 18:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Masami Hiramatsu, Alexei Starovoitov,
	Mathieu Desnoyers, Paul E. McKenney, Joel Fernandes,
	Frederic Weisbecker

On Mon, 09 Mar 2020 18:02:32 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> Folks,
> 
> I'm starting a new conversation because there are about 20 different
> threads which look at that problem in various ways and the information
> is so scattered that creating a coherent picture is pretty much
> impossible.
> 
> There are several problems to solve:
> 
>    1) Fragile low level entry code
> 
>    2) Breakpoint utilization
> 
>    3) RCU idle
> 
>    4) Callchain protection
> 
> #1 Fragile low level entry code
> 
>    While I understand the desire of instrumentation to observe
>    everything we really have to ask the question whether it is worth the
>    trouble especially with entry trainwrecks like x86, PTI and other
>    horrors in that area.
> 
>    I don't think so and we really should just bite the bullet and forbid
>    any instrumentation in that code unless it is explicitly designed
>    for that case, makes sense and has a real value from an observation
>    perspective.
> 
>    This is very much related to #3..

Now for just entry portions (entering into interrupt context or
coming from userspace into normal kernel code) I agree, and I'm fine if we
can move things around and not trace these entry points if possible.

> 
> #2) Breakpoint utilization
> 
>     As recent findings have shown, breakpoint utilization needs to be
>     extremly careful about not creating infinite breakpoint recursions.
> 
>     I think that's pretty much obvious, but falls into the overall
>     question of how to protect callchains.

This is rather unique, and I agree that its best to at least get to a point
where we limit the tracing within breakpoint code. I'm fine with making
rcu_nmi_exit() nokprobe too.


> 
> #3) RCU idle
> 
>     Being able to trace code inside RCU idle sections is very similar to
>     the question raised in #1.
> 
>     Assume all of the instrumentation would be doing conditional RCU
>     schemes, i.e.:
> 
>     if (rcuidle)
>     	....
>     else
>         rcu_read_lock_sched()
> 
>     before invoking the actual instrumentation functions and of course
>     undoing that right after it, that really begs the question whether
>     it's worth it.
> 
>     Especially constructs like:
> 
>     trace_hardirqs_off()
>        idx = srcu_read_lock()
>        rcu_irq_enter_irqson();
>        ...
>        rcu_irq_exit_irqson();
>        srcu_read_unlock(idx);
> 
>     if (user_mode)
>        user_exit_irqsoff();
>     else
>        rcu_irq_enter();
> 
>     are really more than questionable. For 99.9999% of instrumentation
>     users it's absolutely irrelevant whether this traces the interrupt
>     disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
> 
>     But what's relevant is the tracer overhead which is e.g. inflicted
>     with todays trace_hardirqs_off/on() implementation because that
>     unconditionally uses the rcuidle variant with the scru/rcu_irq dance
>     around every tracepoint.
> 
>     Even if the tracepoint sits in the ASM code it just covers about ~20
>     low level ASM instructions more. The tracer invocation, which is
>     even done twice when coming from user space on x86 (the second call
>     is optimized in the tracer C-code), costs definitely way more
>     cycles. When you take the scru/rcu_irq dance into account it's a
>     complete disaster performance wise.

Is this specifically to do with the kernel/trace/trace_preemptirqs.c code
that was added by Joel?

> 
> #4 Protecting call chains
> 
>    Our current approach of annotating functions with notrace/noprobe is
>    pretty much broken.
> 
>    Functions which are marked NOPROBE or notrace call out into functions
>    which are not marked and while this might be ok, there are enough
>    places where it is not. But we have no way to verify that.

Note, if notrace is an issue it shows up pretty quickly, as just enabling
function tracing will enable all non notrace locations, and if something
shouldn't be traced, it will crash immediately.

I have a RCU option for ftrace ops to set, if it requires RCU to be
watching, and in that case, it wont call the callback if RCU is not
watching.


> 
>    That's just a recipe for disaster. We really cannot request from
>    sysadmins who want to use instrumentation to stare at the code first
>    whether they can place/enable an instrumentation point somewhere.
>    That'd be just a bad joke.
> 
>    I really think we need to have proper text sections which are off
>    limit for any form of instrumentation and have tooling to analyze the
>    calls into other sections. These calls need to be annotated as safe
>    and intentional.
> 
> Thoughts?

This can expand quite a bit. At least when I did something similar with
NMIs, as there was a time I wanted to flag all places that could be called
from NMI, and found that there's a lot of code that can be.

I can imagine the same for marking nokprobes as well. And I really don't
want to make all notrace stop tracing the entire function and all that it
can call, as that will go back to removing all callers from NMIs as
do_nmi() itself is notrace.


-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 17:02 Instrumentation and RCU Thomas Gleixner
  2020-03-09 18:15 ` Steven Rostedt
@ 2020-03-09 18:37 ` Mathieu Desnoyers
  2020-03-09 18:44   ` Steven Rostedt
                     ` (2 more replies)
  2020-03-09 20:18 ` Peter Zijlstra
  2020-03-09 20:47 ` Paul E. McKenney
  3 siblings, 3 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-09 18:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, rostedt, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

----- On Mar 9, 2020, at 1:02 PM, Thomas Gleixner tglx@linutronix.de wrote:

> Folks,
> 
> I'm starting a new conversation because there are about 20 different
> threads which look at that problem in various ways and the information
> is so scattered that creating a coherent picture is pretty much
> impossible.
> 
> There are several problems to solve:
> 
>   1) Fragile low level entry code
> 
>   2) Breakpoint utilization
> 
>   3) RCU idle
> 
>   4) Callchain protection
> 
> #1 Fragile low level entry code
> 
>   While I understand the desire of instrumentation to observe
>   everything we really have to ask the question whether it is worth the
>   trouble especially with entry trainwrecks like x86, PTI and other
>   horrors in that area.
> 
>   I don't think so and we really should just bite the bullet and forbid
>   any instrumentation in that code unless it is explicitly designed
>   for that case, makes sense and has a real value from an observation
>   perspective.
> 
>   This is very much related to #3..

Do I understand correctly that you intend on moving all kernel low level
entry/exit code into sections which cannot be instrumented by kprobes nor
the function tracer, and require explicit whitelisting, either through
annotations or use of explicit tracepoints ?

> 
> #2) Breakpoint utilization
> 
>    As recent findings have shown, breakpoint utilization needs to be
>    extremly careful about not creating infinite breakpoint recursions.
> 
>    I think that's pretty much obvious, but falls into the overall
>    question of how to protect callchains.

I think there is another question that arises here: the lack of automated
continuous testing of the kprobes coverage. We have performed some testing of
various random permutations of kprobes instrumentation, and have succeeded in
crashing the kernel in various ways. Unfortunately, that testing is not done
on a continuous basis, and maintainers understandably have little spare time
to play the whack-a-mole game of adding missing nokprobes annotations as
the kernel code evolves.

> 
> #3) RCU idle
> 
>    Being able to trace code inside RCU idle sections is very similar to
>    the question raised in #1.
> 
>    Assume all of the instrumentation would be doing conditional RCU
>    schemes, i.e.:
> 
>    if (rcuidle)
>    	....
>    else
>        rcu_read_lock_sched()
> 
>    before invoking the actual instrumentation functions and of course
>    undoing that right after it, that really begs the question whether
>    it's worth it.
> 
>    Especially constructs like:
> 
>    trace_hardirqs_off()
>       idx = srcu_read_lock()
>       rcu_irq_enter_irqson();
>       ...
>       rcu_irq_exit_irqson();
>       srcu_read_unlock(idx);
> 
>    if (user_mode)
>       user_exit_irqsoff();
>    else
>       rcu_irq_enter();
> 
>    are really more than questionable. For 99.9999% of instrumentation
>    users it's absolutely irrelevant whether this traces the interrupt
>    disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
> 
>    But what's relevant is the tracer overhead which is e.g. inflicted
>    with todays trace_hardirqs_off/on() implementation because that
>    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
>    around every tracepoint.

I think one of the big issues here is that most of the uses of
trace_hardirqs_off() are from sites which already have RCU watching,
so we are doing heavy-weight operations for nothing.

I strongly suspect that most of the overhead we've been trying to avoid when
introducing use of SRCU in rcuidle tracepoints was actually caused by callsites
which use rcuidle tracepoints while having RCU watching, just because there is a
handful of callsites which don't have RCU watching. This is confirmed
by the commit message of commit e6753f23d9 "tracepoint: Make rcuidle
tracepoint callers use SRCU":

   "In recent tests with IRQ on/off tracepoints, a large performance
    overhead ~10% is noticed when running hackbench. This is root caused to
    calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
    tracepoint code. Following a long discussion on the list [1] about this,
    we concluded that srcu is a better alternative for use during rcu idle.
    Although it does involve extra barriers, its lighter than the sched-rcu
    version which has to do additional RCU calls to notify RCU idle about
    entry into RCU sections.
[...]
    Test: Tested idle and preempt/irq tracepoints."

So I think we could go back to plain RCU for rcuidle tracepoints if we do
the cheaper "rcu_is_watching()" check rather than invoking
rcu_irq_{enter,exit}_irqson() unconditionally.

>    Even if the tracepoint sits in the ASM code it just covers about ~20
>    low level ASM instructions more. The tracer invocation, which is
>    even done twice when coming from user space on x86 (the second call
>    is optimized in the tracer C-code), costs definitely way more
>    cycles. When you take the scru/rcu_irq dance into account it's a
>    complete disaster performance wise.

Part of the issue here is the current overhead of SRCU read-side lock,
which contains memory barriers. The other part of the issue is the fact that
rcu_irq_{enter,exit}_irqson() contains an atomic_add_return atomic instruction.

We could use the approach proposed by Peterz's and Steven's patches to basically
do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
RCU for those cases. We could then simply go back on using regular RCU like so:

#define __DO_TRACE(tp, proto, args, cond, rcuidle)                      \
        do {                                                            \
                struct tracepoint_func *it_func_ptr;                    \
                void *it_func;                                          \
                void *__data;                                           \
                bool exit_rcu = false;                                  \
                                                                        \
                if (!(cond))                                            \
                        return;                                         \
                                                                        \
                if (rcuidle && !rcu_is_watching()) {                    \
                        rcu_irq_enter_irqson();                         \
                        exit_rcu = true;                                \
                }                                                       \
                preempt_disable_notrace();                              \
                it_func_ptr = rcu_dereference_raw((tp)->funcs);         \
                if (it_func_ptr) {                                      \
                        do {                                            \
                                it_func = (it_func_ptr)->func;          \
                                __data = (it_func_ptr)->data;           \
                                ((void(*)(proto))(it_func))(args);      \
                        } while ((++it_func_ptr)->func);                \
                }                                                       \
                preempt_enable_notrace();                               \
                if (exit_rcu)                                           \
                        rcu_irq_exit_irqson();                          \
        } while (0)



> #4 Protecting call chains
> 
>   Our current approach of annotating functions with notrace/noprobe is
>   pretty much broken.
> 
>   Functions which are marked NOPROBE or notrace call out into functions
>   which are not marked and while this might be ok, there are enough
>   places where it is not. But we have no way to verify that.
> 
>   That's just a recipe for disaster. We really cannot request from
>   sysadmins who want to use instrumentation to stare at the code first
>   whether they can place/enable an instrumentation point somewhere.
>   That'd be just a bad joke.
> 
>   I really think we need to have proper text sections which are off
>   limit for any form of instrumentation and have tooling to analyze the
>   calls into other sections. These calls need to be annotated as safe
>   and intentional.

If we go all the way into that direction, I suspect it might even make sense
to duplicate some kernel functions so they can still be part of the code which
can be instrumented, but provide tracing-specific copy which would be hidden
from instrumentation. I wonder what would be the size cost of this duplication.

In addition to splitting tracing code into a separate section, which I think
makes sense, I can think of another alternative way to provide call chains
protection: adding a "in_tracing" flag somewhere alongside each kernel stack.
Each thread and interrupt stack would have its own flag. However, trap handlers
should share the "in_tracing" flag with the context which triggers the trap.

If a tracer recurses, or if a tracer attempts to trace another tracer, the
instrumentation would break the recursion chain by preventing instrumentation
from firing. If we end up caring about tracers tracing other tracers, we could
have one distinct flag per tracer and let each tracer break the recursion chain.

Having this flag per kernel stack rather than per CPU or per thread would
allow tracing of nested interrupt handlers (and NMIs), but would break
call chains both within the same stack or going through a trap. I think
it could be a nice complementary safety net to handle mishaps in a non-fatal
way.

Thanks,

Mathieu

> 
> Thoughts?
> 
> Thanks,
> 
>         tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 18:15 ` Steven Rostedt
@ 2020-03-09 18:42   ` Joel Fernandes
  2020-03-09 19:07     ` Steven Rostedt
  2020-03-09 18:59   ` Thomas Gleixner
  1 sibling, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2020-03-09 18:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Mathieu Desnoyers, Paul E. McKenney,
	Frederic Weisbecker, Daniel Bristot de Oliveira,
	Valentin Schneider

On Mon, Mar 9, 2020 at 11:15 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 09 Mar 2020 18:02:32 +0100
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
[...]
> > #3) RCU idle
> >
> >     Being able to trace code inside RCU idle sections is very similar to
> >     the question raised in #1.
> >
> >     Assume all of the instrumentation would be doing conditional RCU
> >     schemes, i.e.:
> >
> >     if (rcuidle)
> >       ....
> >     else
> >         rcu_read_lock_sched()
> >
> >     before invoking the actual instrumentation functions and of course
> >     undoing that right after it, that really begs the question whether
> >     it's worth it.
> >
> >     Especially constructs like:
> >
> >     trace_hardirqs_off()
> >        idx = srcu_read_lock()
> >        rcu_irq_enter_irqson();
> >        ...
> >        rcu_irq_exit_irqson();
> >        srcu_read_unlock(idx);
> >
> >     if (user_mode)
> >        user_exit_irqsoff();
> >     else
> >        rcu_irq_enter();
> >
> >     are really more than questionable. For 99.9999% of instrumentation
> >     users it's absolutely irrelevant whether this traces the interrupt
> >     disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
> >
> >     But what's relevant is the tracer overhead which is e.g. inflicted
> >     with todays trace_hardirqs_off/on() implementation because that
> >     unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> >     around every tracepoint.
> >
> >     Even if the tracepoint sits in the ASM code it just covers about ~20
> >     low level ASM instructions more. The tracer invocation, which is
> >     even done twice when coming from user space on x86 (the second call
> >     is optimized in the tracer C-code), costs definitely way more
> >     cycles. When you take the scru/rcu_irq dance into account it's a
> >     complete disaster performance wise.
>
> Is this specifically to do with the kernel/trace/trace_preemptirqs.c code
> that was added by Joel?

Just started a vacation here and will be back on January 12th. Will
take a detailed look at Thomas's email at that time.

Adding some more folks (Daniel, Valentin) who have used the
preempt/irq tracepoints.

I agree we should reorder things and avoid these circular
dependencies, it bothers me too. I am happy to help with any clean ups
related to it. Let us definitely discuss more and fix it. Thanks.

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 18:37 ` Mathieu Desnoyers
@ 2020-03-09 18:44   ` Steven Rostedt
  2020-03-09 18:52     ` Mathieu Desnoyers
  2020-03-09 19:52   ` Thomas Gleixner
  2020-03-10  1:40   ` Alexei Starovoitov
  2 siblings, 1 reply; 44+ messages in thread
From: Steven Rostedt @ 2020-03-09 18:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

On Mon, 9 Mar 2020 14:37:40 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> So I think we could go back to plain RCU for rcuidle tracepoints if we do
> the cheaper "rcu_is_watching()" check rather than invoking
> rcu_irq_{enter,exit}_irqson() unconditionally.

You mean something like this?

-- Steve

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 5f4de82ffa0f..1904dbb3a921 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -164,7 +164,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		struct tracepoint_func *it_func_ptr;			\
 		void *it_func;						\
 		void *__data;						\
-		int __maybe_unused __idx = 0;				\
+		int __maybe_unused __idx = -1;				\
 									\
 		if (!(cond))						\
 			return;						\
@@ -179,8 +179,8 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 		 * For rcuidle callers, use srcu since sched-rcu	\
 		 * doesn't work from the idle path.			\
 		 */							\
-		if (rcuidle)						\
-			__idx = srcu_read_lock_notrace(&tracepoint_srcu);\
+		if (rcuidle && !rcu_is_watching())			\
+			__idx = srcu_read_lock_notrace(&tracepoint_srcu); \
 									\
 		it_func_ptr = rcu_dereference_raw((tp)->funcs);		\
 									\
@@ -199,7 +199,7 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
 			} while ((++it_func_ptr)->func);		\
 		}							\
 									\
-		if (rcuidle)						\
+		if (rcuidle && __idx != -1)				\
 			rcu_irq_exit_irqson();				\
 									\
 		preempt_enable_notrace();				\

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 18:44   ` Steven Rostedt
@ 2020-03-09 18:52     ` Mathieu Desnoyers
  2020-03-09 19:09       ` Steven Rostedt
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-09 18:52 UTC (permalink / raw)
  To: rostedt
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

----- On Mar 9, 2020, at 2:44 PM, rostedt rostedt@goodmis.org wrote:

> On Mon, 9 Mar 2020 14:37:40 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> So I think we could go back to plain RCU for rcuidle tracepoints if we do
>> the cheaper "rcu_is_watching()" check rather than invoking
>> rcu_irq_{enter,exit}_irqson() unconditionally.
> 
> You mean something like this?

I have a hard time applying this patch to any tree I can fetch, so
I will use caution and say: probably not. ;-)

And when I say "go back to plain RCU", I really mean removing use of SRCU
from the tracepoints until we have other purposes for it (e.g. taking
faults within specific tracepoint probes such as syscall enter/exit).

Thanks,

Mathieu

> 
> -- Steve
> 
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index 5f4de82ffa0f..1904dbb3a921 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -164,7 +164,7 @@ static inline struct tracepoint
> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> 		struct tracepoint_func *it_func_ptr;			\
> 		void *it_func;						\
> 		void *__data;						\
> -		int __maybe_unused __idx = 0;				\
> +		int __maybe_unused __idx = -1;				\
> 									\
> 		if (!(cond))						\
> 			return;						\
> @@ -179,8 +179,8 @@ static inline struct tracepoint
> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> 		 * For rcuidle callers, use srcu since sched-rcu	\
> 		 * doesn't work from the idle path.			\
> 		 */							\
> -		if (rcuidle)						\
> -			__idx = srcu_read_lock_notrace(&tracepoint_srcu);\
> +		if (rcuidle && !rcu_is_watching())			\
> +			__idx = srcu_read_lock_notrace(&tracepoint_srcu); \
> 									\
> 		it_func_ptr = rcu_dereference_raw((tp)->funcs);		\
> 									\
> @@ -199,7 +199,7 @@ static inline struct tracepoint
> *tracepoint_ptr_deref(tracepoint_ptr_t *p)
> 			} while ((++it_func_ptr)->func);		\
> 		}							\
> 									\
> -		if (rcuidle)						\
> +		if (rcuidle && __idx != -1)				\
> 			rcu_irq_exit_irqson();				\
> 									\
>  		preempt_enable_notrace();				\

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 18:15 ` Steven Rostedt
  2020-03-09 18:42   ` Joel Fernandes
@ 2020-03-09 18:59   ` Thomas Gleixner
  2020-03-10  8:09     ` Masami Hiramatsu
  1 sibling, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2020-03-09 18:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Peter Zijlstra, Masami Hiramatsu, Alexei Starovoitov,
	Mathieu Desnoyers, Paul E. McKenney, Joel Fernandes,
	Frederic Weisbecker

Steven Rostedt <rostedt@goodmis.org> writes:
> On Mon, 09 Mar 2020 18:02:32 +0100
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> #1 Fragile low level entry code
>> 
>>    While I understand the desire of instrumentation to observe
>>    everything we really have to ask the question whether it is worth the
>>    trouble especially with entry trainwrecks like x86, PTI and other
>>    horrors in that area.
>> 
>>    I don't think so and we really should just bite the bullet and forbid
>>    any instrumentation in that code unless it is explicitly designed
>>    for that case, makes sense and has a real value from an observation
>>    perspective.
>> 
>>    This is very much related to #3..
>
> Now for just entry portions (entering into interrupt context or
> coming from userspace into normal kernel code) I agree, and I'm fine if we
> can move things around and not trace these entry points if possible.

I think we can. I spent quite some effort on splitting up the syscall
entry/exit stuff so only the low level entry up to the point where RCU
is watching and the low level exit where RCU is evtl. going back to idle
plus the related ASM maze. It's quite some tedious work, that's why I
came up with the section approach.
 
>> #2) Breakpoint utilization
>> 
>>     As recent findings have shown, breakpoint utilization needs to be
>>     extremly careful about not creating infinite breakpoint recursions.
>> 
>>     I think that's pretty much obvious, but falls into the overall
>>     question of how to protect callchains.
>
> This is rather unique, and I agree that its best to at least get to a point
> where we limit the tracing within breakpoint code. I'm fine with making
> rcu_nmi_exit() nokprobe too.

Yes, the break point stuff is unique, but it has nicely demonstrated how
much of the code is affected by it.

>> #3) RCU idle
>> 
>>     Being able to trace code inside RCU idle sections is very similar to
>>     the question raised in #1.
>> 
>>     Assume all of the instrumentation would be doing conditional RCU
>>     schemes, i.e.:
>> 
>>     if (rcuidle)
>>     	....
>>     else
>>         rcu_read_lock_sched()
>> 
>>     before invoking the actual instrumentation functions and of course
>>     undoing that right after it, that really begs the question whether
>>     it's worth it.
>> 
>>     Especially constructs like:
>> 
>>     trace_hardirqs_off()
>>        idx = srcu_read_lock()
>>        rcu_irq_enter_irqson();
>>        ...
>>        rcu_irq_exit_irqson();
>>        srcu_read_unlock(idx);
>> 
>>     if (user_mode)
>>        user_exit_irqsoff();
>>     else
>>        rcu_irq_enter();
>> 
>>     are really more than questionable. For 99.9999% of instrumentation
>>     users it's absolutely irrelevant whether this traces the interrupt
>>     disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
>> 
>>     But what's relevant is the tracer overhead which is e.g. inflicted
>>     with todays trace_hardirqs_off/on() implementation because that
>>     unconditionally uses the rcuidle variant with the scru/rcu_irq dance
>>     around every tracepoint.
>> 
>>     Even if the tracepoint sits in the ASM code it just covers about ~20
>>     low level ASM instructions more. The tracer invocation, which is
>>     even done twice when coming from user space on x86 (the second call
>>     is optimized in the tracer C-code), costs definitely way more
>>     cycles. When you take the scru/rcu_irq dance into account it's a
>>     complete disaster performance wise.
>
> Is this specifically to do with the kernel/trace/trace_preemptirqs.c code
> that was added by Joel?

Don't know who it added, but yes, the above is what it does when it's
expanded. I did some profiling with a trivial invalid syscall and
trace_hardirqs_off() shows very prominent there, while with the approach
I have in my entry series which does:

       lockdep_irq_off();
       enter_from_user_mode();
       __trace_hardirqs_off();

The tracing overhead goes down by ~ factor 2. That's a lot.
 
>> #4 Protecting call chains
>> 
>>    Our current approach of annotating functions with notrace/noprobe is
>>    pretty much broken.
>> 
>>    Functions which are marked NOPROBE or notrace call out into functions
>>    which are not marked and while this might be ok, there are enough
>>    places where it is not. But we have no way to verify that.
>
> Note, if notrace is an issue it shows up pretty quickly, as just enabling
> function tracing will enable all non notrace locations, and if something
> shouldn't be traced, it will crash immediately.

Steven, you're not really serious about this, right? This is tinkering
at best.

We have code pathes which are not necessarily covered in regular
testing, depend on config options etc.

Have you ever looked at code coverage maps? There are quite some spots
which we don't reach in testing.

So how do you explain the user that the blind spot he hit in the weird
situation on his server which he wanted to analyze crashed his machine?

Having 'off limit' sections allows us to do proper tool based analysis
with full coverage. That's really the only sensible approach.

> I have a RCU option for ftrace ops to set, if it requires RCU to be
> watching, and in that case, it wont call the callback if RCU is not
> watching.

That's nice but does not solve the problem.

>>    That's just a recipe for disaster. We really cannot request from
>>    sysadmins who want to use instrumentation to stare at the code first
>>    whether they can place/enable an instrumentation point somewhere.
>>    That'd be just a bad joke.
>> 
>>    I really think we need to have proper text sections which are off
>>    limit for any form of instrumentation and have tooling to analyze the
>>    calls into other sections. These calls need to be annotated as safe
>>    and intentional.
>> 
>> Thoughts?
>
> This can expand quite a bit. At least when I did something similar with
> NMIs, as there was a time I wanted to flag all places that could be called
> from NMI, and found that there's a lot of code that can be.
>
> I can imagine the same for marking nokprobes as well. And I really don't
> want to make all notrace stop tracing the entire function and all that it
> can call, as that will go back to removing all callers from NMIs as
> do_nmi() itself is notrace.

The point is that you have something like this:

section "text.offlimit"

nmi()
{
        do_fragile_stuff_on_enter();

        offlimit_safecall(do_some_instrumentable_stuff());

        do_fragile_stuff_on_exit();
}

section "text"

do_some_instrumentable_stuff()
{
}

So if someone adds another call

section "text.offlimit"

nmi()
{
        do_fragile_stuff_on_enter();

        offlimit_safecall(do_some_instrumentable_stuff());

        do_some_other_instrumentable_stuff();

        do_fragile_stuff_on_exit();
}

which is also in section "text" then the analysis tool will find the
missing offlimit_safecall() - or what ever method we chose to annotate
that stuff. Surely not an annotation on the called function itself
because that might be safe to call in one context but not in another.

These annotations are halfways easy to monitor for abuse and they should
be prominent enough in the code that at least for the people dealing
with that kind of code they act as a warning flag.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 18:42   ` Joel Fernandes
@ 2020-03-09 19:07     ` Steven Rostedt
  2020-03-09 19:20       ` Mathieu Desnoyers
  2020-03-16 15:02       ` Joel Fernandes
  0 siblings, 2 replies; 44+ messages in thread
From: Steven Rostedt @ 2020-03-09 19:07 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Mathieu Desnoyers, Paul E. McKenney,
	Frederic Weisbecker, Daniel Bristot de Oliveira,
	Valentin Schneider

On Mon, 9 Mar 2020 11:42:28 -0700
Joel Fernandes <joel@joelfernandes.org> wrote:

> Just started a vacation here and will be back on January 12th. Will
> take a detailed look at Thomas's email at that time.

January 12th! Wow! Enjoy your sabbatical and see you next year! ;-)

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 18:52     ` Mathieu Desnoyers
@ 2020-03-09 19:09       ` Steven Rostedt
  2020-03-09 19:25         ` Mathieu Desnoyers
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Rostedt @ 2020-03-09 19:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

On Mon, 9 Mar 2020 14:52:40 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> And when I say "go back to plain RCU", I really mean removing use of SRCU
> from the tracepoints until we have other purposes for it (e.g. taking
> faults within specific tracepoint probes such as syscall enter/exit).

Actually, with both you and Alexei talking about having a sleeping
tracepoint callback, where we can add a can sleep check (but not in the
DO_TRACE macro, I would think that registered sleeping callbacks would be
its own callback), I would think we do not want to remove the SRCU usage.

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 19:07     ` Steven Rostedt
@ 2020-03-09 19:20       ` Mathieu Desnoyers
  2020-03-16 15:02       ` Joel Fernandes
  1 sibling, 0 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-09 19:20 UTC (permalink / raw)
  To: rostedt
  Cc: Joel Fernandes, Google, Thomas Gleixner, linux-kernel,
	Peter Zijlstra, Masami Hiramatsu, Alexei Starovoitov, paulmck,
	Frederic Weisbecker, bristot, Valentin Schneider

----- On Mar 9, 2020, at 3:07 PM, rostedt rostedt@goodmis.org wrote:

> On Mon, 9 Mar 2020 11:42:28 -0700
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
>> Just started a vacation here and will be back on January 12th. Will
>> take a detailed look at Thomas's email at that time.
> 
> January 12th! Wow! Enjoy your sabbatical and see you next year! ;-)

If Joel can go through 10 months worth of email backlog, he is my hero. ;-)

On a side-note, I confirmed with Joel that he intends to come back on March 13th.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 19:09       ` Steven Rostedt
@ 2020-03-09 19:25         ` Mathieu Desnoyers
  0 siblings, 0 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-09 19:25 UTC (permalink / raw)
  To: rostedt
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

----- On Mar 9, 2020, at 3:09 PM, rostedt rostedt@goodmis.org wrote:

> On Mon, 9 Mar 2020 14:52:40 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> And when I say "go back to plain RCU", I really mean removing use of SRCU
>> from the tracepoints until we have other purposes for it (e.g. taking
>> faults within specific tracepoint probes such as syscall enter/exit).
> 
> Actually, with both you and Alexei talking about having a sleeping
> tracepoint callback, where we can add a can sleep check (but not in the
> DO_TRACE macro, I would think that registered sleeping callbacks would be
> its own callback), I would think we do not want to remove the SRCU usage.

Whether we keep it or add it back later when needed should not make much
difference.

In any case, considering that overhead which motivated use of SRCU for the rcuidle
case could instead be handled by using is_rcu_watching() and normal RCU, I would
prefer removing it from the rcuidle tracepoints for now, and add it back when we
add a new kind of "sleepable" tracepoints later on.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 18:37 ` Mathieu Desnoyers
  2020-03-09 18:44   ` Steven Rostedt
@ 2020-03-09 19:52   ` Thomas Gleixner
  2020-03-10 15:03     ` Mathieu Desnoyers
  2020-03-10  1:40   ` Alexei Starovoitov
  2 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2020-03-09 19:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Peter Zijlstra, rostedt, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> ----- On Mar 9, 2020, at 1:02 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> #1 Fragile low level entry code
>> 
>>   While I understand the desire of instrumentation to observe
>>   everything we really have to ask the question whether it is worth the
>>   trouble especially with entry trainwrecks like x86, PTI and other
>>   horrors in that area.
>> 
>>   I don't think so and we really should just bite the bullet and forbid
>>   any instrumentation in that code unless it is explicitly designed
>>   for that case, makes sense and has a real value from an observation
>>   perspective.
>> 
>>   This is very much related to #3..
>
> Do I understand correctly that you intend on moving all kernel low level
> entry/exit code into sections which cannot be instrumented by kprobes nor
> the function tracer, and require explicit whitelisting, either through
> annotations or use of explicit tracepoints ?

Pretty much so.

>> #2) Breakpoint utilization
>> 
>>    As recent findings have shown, breakpoint utilization needs to be
>>    extremly careful about not creating infinite breakpoint recursions.
>> 
>>    I think that's pretty much obvious, but falls into the overall
>>    question of how to protect callchains.
>
> I think there is another question that arises here: the lack of automated
> continuous testing of the kprobes coverage. We have performed some testing of
> various random permutations of kprobes instrumentation, and have succeeded in
> crashing the kernel in various ways. Unfortunately, that testing is not done
> on a continuous basis, and maintainers understandably have little spare time
> to play the whack-a-mole game of adding missing nokprobes annotations as
> the kernel code evolves.

That's why I think a section approach is more practicable.
 
>> #3) RCU idle
>>    are really more than questionable. For 99.9999% of instrumentation
>>    users it's absolutely irrelevant whether this traces the interrupt
>>    disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
>> 
>>    But what's relevant is the tracer overhead which is e.g. inflicted
>>    with todays trace_hardirqs_off/on() implementation because that
>>    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
>>    around every tracepoint.
>
> I think one of the big issues here is that most of the uses of
> trace_hardirqs_off() are from sites which already have RCU watching,
> so we are doing heavy-weight operations for nothing.

That and in some places in the entry code we do the heavy weight
operations to cover 100 instructions which turn on RCU anyway. That does
not make any sense at all.

> I strongly suspect that most of the overhead we've been trying to avoid when
> introducing use of SRCU in rcuidle tracepoints was actually caused by callsites
> which use rcuidle tracepoints while having RCU watching, just because there is a
> handful of callsites which don't have RCU watching. This is confirmed
> by the commit message of commit e6753f23d9 "tracepoint: Make rcuidle
> tracepoint callers use SRCU":
>
>    "In recent tests with IRQ on/off tracepoints, a large performance
>     overhead ~10% is noticed when running hackbench. This is root caused to
>     calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
>     tracepoint code. Following a long discussion on the list [1] about this,
>     we concluded that srcu is a better alternative for use during rcu idle.
>     Although it does involve extra barriers, its lighter than the sched-rcu
>     version which has to do additional RCU calls to notify RCU idle about
>     entry into RCU sections.
> [...]
>     Test: Tested idle and preempt/irq tracepoints."

In a quick test I did with a invalid syscall number with profiling the
trace_hardirqs_off() is pretty prominent and goes down by roughly a
factor of 2 when I move it past enter_from_user_mode() and use just the
non RCU idle variant.

> So I think we could go back to plain RCU for rcuidle tracepoints if we do
> the cheaper "rcu_is_watching()" check rather than invoking
> rcu_irq_{enter,exit}_irqson() unconditionally.

Haven't tried that yet for the overall usage of trace_hardirqs_off(),
but yes, it's going to be a measurable difference.

>>    Even if the tracepoint sits in the ASM code it just covers about ~20
>>    low level ASM instructions more. The tracer invocation, which is
>>    even done twice when coming from user space on x86 (the second call
>>    is optimized in the tracer C-code), costs definitely way more
>>    cycles. When you take the scru/rcu_irq dance into account it's a
>>    complete disaster performance wise.
>
> Part of the issue here is the current overhead of SRCU read-side lock,
> which contains memory barriers. The other part of the issue is the fact that
> rcu_irq_{enter,exit}_irqson() contains an atomic_add_return atomic instruction.
>
> We could use the approach proposed by Peterz's and Steven's patches to basically
> do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
> RCU for those cases. We could then simply go back on using regular RCU
> like so:

Right, but that still does the whole rcu_irq dance especially in the
entry code just to trace 50 or 100 instructions which are turning on RCU
anyway.

>> #4 Protecting call chains
>> 
>>   Our current approach of annotating functions with notrace/noprobe is
>>   pretty much broken.
>> 
>>   Functions which are marked NOPROBE or notrace call out into functions
>>   which are not marked and while this might be ok, there are enough
>>   places where it is not. But we have no way to verify that.
>> 
>>   That's just a recipe for disaster. We really cannot request from
>>   sysadmins who want to use instrumentation to stare at the code first
>>   whether they can place/enable an instrumentation point somewhere.
>>   That'd be just a bad joke.
>> 
>>   I really think we need to have proper text sections which are off
>>   limit for any form of instrumentation and have tooling to analyze the
>>   calls into other sections. These calls need to be annotated as safe
>>   and intentional.
>
> If we go all the way into that direction, I suspect it might even make sense
> to duplicate some kernel functions so they can still be part of the code which
> can be instrumented, but provide tracing-specific copy which would be hidden
> from instrumentation. I wonder what would be the size cost of this
> duplication.

That might happen, but we'll see how much of that is truly
requried. It's not horrible large. For the int3 isolation most of it was
fixing up code which should have been protected in the first place and
the only copied code was bsearch which is tiny.

> In addition to splitting tracing code into a separate section, which I think
> makes sense, I can think of another alternative way to provide call chains
> protection: adding a "in_tracing" flag somewhere alongside each kernel stack.
> Each thread and interrupt stack would have its own flag. However, trap handlers
> should share the "in_tracing" flag with the context which triggers the trap.
>
> If a tracer recurses, or if a tracer attempts to trace another tracer, the
> instrumentation would break the recursion chain by preventing instrumentation
> from firing. If we end up caring about tracers tracing other tracers, we could
> have one distinct flag per tracer and let each tracer break the recursion chain.
>
> Having this flag per kernel stack rather than per CPU or per thread would
> allow tracing of nested interrupt handlers (and NMIs), but would break
> call chains both within the same stack or going through a trap. I think
> it could be a nice complementary safety net to handle mishaps in a non-fatal
> way.

That works as long as none of this uses breakpoint based patching to
dynamically disable/enable stuff.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 17:02 Instrumentation and RCU Thomas Gleixner
  2020-03-09 18:15 ` Steven Rostedt
  2020-03-09 18:37 ` Mathieu Desnoyers
@ 2020-03-09 20:18 ` Peter Zijlstra
  2020-03-09 20:47 ` Paul E. McKenney
  3 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2020-03-09 20:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Steven Rostedt, Masami Hiramatsu, Alexei Starovoitov,
	Mathieu Desnoyers, Paul E. McKenney, Joel Fernandes,
	Frederic Weisbecker, Dan Carpenter, Josh Poimboeuf

On Mon, Mar 09, 2020 at 06:02:32PM +0100, Thomas Gleixner wrote:
> #4 Protecting call chains
> 
>    Our current approach of annotating functions with notrace/noprobe is
>    pretty much broken.
> 
>    Functions which are marked NOPROBE or notrace call out into functions
>    which are not marked and while this might be ok, there are enough
>    places where it is not. But we have no way to verify that.
> 
>    That's just a recipe for disaster. We really cannot request from
>    sysadmins who want to use instrumentation to stare at the code first
>    whether they can place/enable an instrumentation point somewhere.
>    That'd be just a bad joke.
> 
>    I really think we need to have proper text sections which are off
>    limit for any form of instrumentation and have tooling to analyze the
>    calls into other sections. These calls need to be annotated as safe
>    and intentional.

So the only tool I know of that does full callchains is smatch. And in
one of my series I did prod Dan about this.

The alternative is that we bite the bullet and add a vmlinux objtool
pass. This keeps getting mentioned, so maybe it is time :/ I'd hate it,
because it will increase build time at the slowest point, but it'd get
us the coverage we need here.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 17:02 Instrumentation and RCU Thomas Gleixner
                   ` (2 preceding siblings ...)
  2020-03-09 20:18 ` Peter Zijlstra
@ 2020-03-09 20:47 ` Paul E. McKenney
  2020-03-09 20:58   ` Steven Rostedt
                     ` (2 more replies)
  3 siblings, 3 replies; 44+ messages in thread
From: Paul E. McKenney @ 2020-03-09 20:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Peter Zijlstra, Steven Rostedt, Masami Hiramatsu,
	Alexei Starovoitov, Mathieu Desnoyers, Joel Fernandes,
	Frederic Weisbecker

On Mon, Mar 09, 2020 at 06:02:32PM +0100, Thomas Gleixner wrote:
> Folks,
> 
> I'm starting a new conversation because there are about 20 different
> threads which look at that problem in various ways and the information
> is so scattered that creating a coherent picture is pretty much
> impossible.
> 
> There are several problems to solve:
> 
>    1) Fragile low level entry code
> 
>    2) Breakpoint utilization
> 
>    3) RCU idle
> 
>    4) Callchain protection
> 
> #1 Fragile low level entry code
> 
>    While I understand the desire of instrumentation to observe
>    everything we really have to ask the question whether it is worth the
>    trouble especially with entry trainwrecks like x86, PTI and other
>    horrors in that area.
> 
>    I don't think so and we really should just bite the bullet and forbid
>    any instrumentation in that code unless it is explicitly designed
>    for that case, makes sense and has a real value from an observation
>    perspective.
> 
>    This is very much related to #3..
> 
> #2) Breakpoint utilization
> 
>     As recent findings have shown, breakpoint utilization needs to be
>     extremly careful about not creating infinite breakpoint recursions.
> 
>     I think that's pretty much obvious, but falls into the overall
>     question of how to protect callchains.
> 
> #3) RCU idle
> 
>     Being able to trace code inside RCU idle sections is very similar to
>     the question raised in #1.
> 
>     Assume all of the instrumentation would be doing conditional RCU
>     schemes, i.e.:
> 
>     if (rcuidle)
>     	....
>     else
>         rcu_read_lock_sched()
> 
>     before invoking the actual instrumentation functions and of course
>     undoing that right after it, that really begs the question whether
>     it's worth it.
> 
>     Especially constructs like:
> 
>     trace_hardirqs_off()
>        idx = srcu_read_lock()
>        rcu_irq_enter_irqson();
>        ...
>        rcu_irq_exit_irqson();
>        srcu_read_unlock(idx);
> 
>     if (user_mode)
>        user_exit_irqsoff();
>     else
>        rcu_irq_enter();
> 
>     are really more than questionable. For 99.9999% of instrumentation
>     users it's absolutely irrelevant whether this traces the interrupt
>     disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
> 
>     But what's relevant is the tracer overhead which is e.g. inflicted
>     with todays trace_hardirqs_off/on() implementation because that
>     unconditionally uses the rcuidle variant with the scru/rcu_irq dance
>     around every tracepoint.
> 
>     Even if the tracepoint sits in the ASM code it just covers about ~20
>     low level ASM instructions more. The tracer invocation, which is
>     even done twice when coming from user space on x86 (the second call
>     is optimized in the tracer C-code), costs definitely way more
>     cycles. When you take the scru/rcu_irq dance into account it's a
>     complete disaster performance wise.

Suppose that we had a variant of RCU that had about the same read-side
overhead as Preempt-RCU, but which could be used from idle as well as
from CPUs in the process of coming online or going offline?  I have not
thought through the irq/NMI/exception entry/exit cases, but I don't see
why that would be problem.

This would have explicit critical-section entry/exit code, so it would
not be any help for trampolines.

Would such a variant of RCU help?

Yeah, I know.  Just what the kernel doesn't need, yet another variant
of RCU...

							Thanx, Paul

> #4 Protecting call chains
> 
>    Our current approach of annotating functions with notrace/noprobe is
>    pretty much broken.
> 
>    Functions which are marked NOPROBE or notrace call out into functions
>    which are not marked and while this might be ok, there are enough
>    places where it is not. But we have no way to verify that.
> 
>    That's just a recipe for disaster. We really cannot request from
>    sysadmins who want to use instrumentation to stare at the code first
>    whether they can place/enable an instrumentation point somewhere.
>    That'd be just a bad joke.
> 
>    I really think we need to have proper text sections which are off
>    limit for any form of instrumentation and have tooling to analyze the
>    calls into other sections. These calls need to be annotated as safe
>    and intentional.
> 
> Thoughts?
> 
> Thanks,
> 
>         tglx
> 
> 
> 
> 
> 
> 
>    
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 20:47 ` Paul E. McKenney
@ 2020-03-09 20:58   ` Steven Rostedt
  2020-03-09 21:25     ` Paul E. McKenney
  2020-03-09 23:52   ` Frederic Weisbecker
  2020-03-10 15:13   ` Mathieu Desnoyers
  2 siblings, 1 reply; 44+ messages in thread
From: Steven Rostedt @ 2020-03-09 20:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Mathieu Desnoyers, Joel Fernandes,
	Frederic Weisbecker

On Mon, 9 Mar 2020 13:47:10 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> Yeah, I know.  Just what the kernel doesn't need, yet another variant
> of RCU...

And call this variant: yarcu (Yet Another RCU) (pronounced yar-ku ?)

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 20:58   ` Steven Rostedt
@ 2020-03-09 21:25     ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2020-03-09 21:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Mathieu Desnoyers, Joel Fernandes,
	Frederic Weisbecker

On Mon, Mar 09, 2020 at 04:58:11PM -0400, Steven Rostedt wrote:
> On Mon, 9 Mar 2020 13:47:10 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > Yeah, I know.  Just what the kernel doesn't need, yet another variant
> > of RCU...
> 
> And call this variant: yarcu (Yet Another RCU) (pronounced yar-ku ?)

I was thinking in terms of rcu_read_lock_trace(), but hey!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 20:47 ` Paul E. McKenney
  2020-03-09 20:58   ` Steven Rostedt
@ 2020-03-09 23:52   ` Frederic Weisbecker
  2020-03-10  2:26     ` Paul E. McKenney
  2020-03-10 15:13   ` Mathieu Desnoyers
  2 siblings, 1 reply; 44+ messages in thread
From: Frederic Weisbecker @ 2020-03-09 23:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Steven Rostedt,
	Masami Hiramatsu, Alexei Starovoitov, Mathieu Desnoyers,
	Joel Fernandes

On Mon, Mar 09, 2020 at 01:47:10PM -0700, Paul E. McKenney wrote:
> On Mon, Mar 09, 2020 at 06:02:32PM +0100, Thomas Gleixner wrote:
> > #3) RCU idle
> > 
> >     Being able to trace code inside RCU idle sections is very similar to
> >     the question raised in #1.
> > 
> >     Assume all of the instrumentation would be doing conditional RCU
> >     schemes, i.e.:
> > 
> >     if (rcuidle)
> >     	....
> >     else
> >         rcu_read_lock_sched()
> > 
> >     before invoking the actual instrumentation functions and of course
> >     undoing that right after it, that really begs the question whether
> >     it's worth it.
> > 
> >     Especially constructs like:
> > 
> >     trace_hardirqs_off()
> >        idx = srcu_read_lock()
> >        rcu_irq_enter_irqson();
> >        ...
> >        rcu_irq_exit_irqson();
> >        srcu_read_unlock(idx);
> > 
> >     if (user_mode)
> >        user_exit_irqsoff();
> >     else
> >        rcu_irq_enter();
> > 
> >     are really more than questionable. For 99.9999% of instrumentation
> >     users it's absolutely irrelevant whether this traces the interrupt
> >     disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
> > 
> >     But what's relevant is the tracer overhead which is e.g. inflicted
> >     with todays trace_hardirqs_off/on() implementation because that
> >     unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> >     around every tracepoint.
> > 
> >     Even if the tracepoint sits in the ASM code it just covers about ~20
> >     low level ASM instructions more. The tracer invocation, which is
> >     even done twice when coming from user space on x86 (the second call
> >     is optimized in the tracer C-code), costs definitely way more
> >     cycles. When you take the scru/rcu_irq dance into account it's a
> >     complete disaster performance wise.
> 
> Suppose that we had a variant of RCU that had about the same read-side
> overhead as Preempt-RCU, but which could be used from idle as well as
> from CPUs in the process of coming online or going offline?  I have not
> thought through the irq/NMI/exception entry/exit cases, but I don't see
> why that would be problem.
> 
> This would have explicit critical-section entry/exit code, so it would
> not be any help for trampolines.
> 
> Would such a variant of RCU help?
> 
> Yeah, I know.  Just what the kernel doesn't need, yet another variant
> of RCU...
> 

I was thinking about having a tracing-specific implementation of RCU.
Last week Steve told me that the tracing ring buffer has its own ad-hoc
RCU implementation which schedule a thread on each CPU to complete a grace
period (did I understand it right?). Of course such a flavour of RCU wouldn't
be nice to nohz_full but surely we can arrange some tweaks for those who
require strong isolation. I'm sure you're having a much better idea though.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 18:37 ` Mathieu Desnoyers
  2020-03-09 18:44   ` Steven Rostedt
  2020-03-09 19:52   ` Thomas Gleixner
@ 2020-03-10  1:40   ` Alexei Starovoitov
  2020-03-10  8:02     ` Thomas Gleixner
                       ` (2 more replies)
  2 siblings, 3 replies; 44+ messages in thread
From: Alexei Starovoitov @ 2020-03-10  1:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, rostedt,
	Masami Hiramatsu, Alexei Starovoitov, paulmck, Joel Fernandes,
	Google, Frederic Weisbecker, bpf

On Mon, Mar 09, 2020 at 02:37:40PM -0400, Mathieu Desnoyers wrote:
> > 
> >    But what's relevant is the tracer overhead which is e.g. inflicted
> >    with todays trace_hardirqs_off/on() implementation because that
> >    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> >    around every tracepoint.
> 
> I think one of the big issues here is that most of the uses of
> trace_hardirqs_off() are from sites which already have RCU watching,
> so we are doing heavy-weight operations for nothing.

I think kernel/trace/trace_preemptirq.c created too many problems for the
kernel without providing tangible benefits. My understanding no one is using it
in production. It's a tool to understand how kernel works. And such debugging
tool can and should be removed.

One of Thomas's patches mentioned that bpf can be invoked from hardirq and
preempt tracers. This connection doesn't exist in a direct way, but
theoretically it's possible. There is no practical use though and I would be
happy to blacklist such bpf usage at a minimum.

> We could use the approach proposed by Peterz's and Steven's patches to basically
> do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
> RCU for those cases. We could then simply go back on using regular RCU like so:
> 
> #define __DO_TRACE(tp, proto, args, cond, rcuidle)                      \
>         do {                                                            \
>                 struct tracepoint_func *it_func_ptr;                    \
>                 void *it_func;                                          \
>                 void *__data;                                           \
>                 bool exit_rcu = false;                                  \
>                                                                         \
>                 if (!(cond))                                            \
>                         return;                                         \
>                                                                         \
>                 if (rcuidle && !rcu_is_watching()) {                    \
>                         rcu_irq_enter_irqson();                         \
>                         exit_rcu = true;                                \
>                 }                                                       \
>                 preempt_disable_notrace();                              \
>                 it_func_ptr = rcu_dereference_raw((tp)->funcs);         \
>                 if (it_func_ptr) {                                      \
>                         do {                                            \
>                                 it_func = (it_func_ptr)->func;          \
>                                 __data = (it_func_ptr)->data;           \
>                                 ((void(*)(proto))(it_func))(args);      \
>                         } while ((++it_func_ptr)->func);                \
>                 }                                                       \
>                 preempt_enable_notrace();                               \
>                 if (exit_rcu)                                           \
>                         rcu_irq_exit_irqson();                          \
>         } while (0)

I think it's a fine approach interim.

Long term sounds like Paul is going to provide sleepable and low overhead
rcu_read_lock_for_tracers() that will include bpf.
My understanding that this new rcu flavor won't have "idle" issues,
so rcu_is_watching() checks will not be necessary.
And if we remove trace_preemptirq.c the only thing left will be Thomas's points
1 (low level entry) and 2 (breakpoints) that can be addressed without
creating fancy .text annotations and teach objtool about it.

In the mean time I've benchmarked srcu for sleepable bpf and it's quite heavy.
srcu_read_lock+unlock roughly adds 10x execution cost to trivial bpf prog.
I'm proceeding with it anyway, but really hoping that
rcu_read_lock_for_tracers() will materialize soon.

In general I'm sceptical that .text annotations will work. Let's say all of
idle is a red zone. But a ton of normal functions are called when idle. So
objtool will go and mark them as red zone too. This way large percent of the
kernel will be off limits for tracers. Which is imo not a good trade off. I
think addressing 1 and 2 with explicit notrace/nokprobe annotations will cover
all practical cases where people can shot themselves in a foot with a tracer. I
realize that there will be forever whack-a-mole game and these annotations will
never reach 100%. I think it's a fine trade off. Security is never 100% either.
Tracing is never going to be 100% safe too.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 23:52   ` Frederic Weisbecker
@ 2020-03-10  2:26     ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2020-03-10  2:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Steven Rostedt,
	Masami Hiramatsu, Alexei Starovoitov, Mathieu Desnoyers,
	Joel Fernandes

On Tue, Mar 10, 2020 at 12:52:11AM +0100, Frederic Weisbecker wrote:
> On Mon, Mar 09, 2020 at 01:47:10PM -0700, Paul E. McKenney wrote:
> > On Mon, Mar 09, 2020 at 06:02:32PM +0100, Thomas Gleixner wrote:
> > > #3) RCU idle
> > > 
> > >     Being able to trace code inside RCU idle sections is very similar to
> > >     the question raised in #1.
> > > 
> > >     Assume all of the instrumentation would be doing conditional RCU
> > >     schemes, i.e.:
> > > 
> > >     if (rcuidle)
> > >     	....
> > >     else
> > >         rcu_read_lock_sched()
> > > 
> > >     before invoking the actual instrumentation functions and of course
> > >     undoing that right after it, that really begs the question whether
> > >     it's worth it.
> > > 
> > >     Especially constructs like:
> > > 
> > >     trace_hardirqs_off()
> > >        idx = srcu_read_lock()
> > >        rcu_irq_enter_irqson();
> > >        ...
> > >        rcu_irq_exit_irqson();
> > >        srcu_read_unlock(idx);
> > > 
> > >     if (user_mode)
> > >        user_exit_irqsoff();
> > >     else
> > >        rcu_irq_enter();
> > > 
> > >     are really more than questionable. For 99.9999% of instrumentation
> > >     users it's absolutely irrelevant whether this traces the interrupt
> > >     disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
> > > 
> > >     But what's relevant is the tracer overhead which is e.g. inflicted
> > >     with todays trace_hardirqs_off/on() implementation because that
> > >     unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> > >     around every tracepoint.
> > > 
> > >     Even if the tracepoint sits in the ASM code it just covers about ~20
> > >     low level ASM instructions more. The tracer invocation, which is
> > >     even done twice when coming from user space on x86 (the second call
> > >     is optimized in the tracer C-code), costs definitely way more
> > >     cycles. When you take the scru/rcu_irq dance into account it's a
> > >     complete disaster performance wise.
> > 
> > Suppose that we had a variant of RCU that had about the same read-side
> > overhead as Preempt-RCU, but which could be used from idle as well as
> > from CPUs in the process of coming online or going offline?  I have not
> > thought through the irq/NMI/exception entry/exit cases, but I don't see
> > why that would be problem.
> > 
> > This would have explicit critical-section entry/exit code, so it would
> > not be any help for trampolines.
> > 
> > Would such a variant of RCU help?
> > 
> > Yeah, I know.  Just what the kernel doesn't need, yet another variant
> > of RCU...
> 
> I was thinking about having a tracing-specific implementation of RCU.
> Last week Steve told me that the tracing ring buffer has its own ad-hoc
> RCU implementation which schedule a thread on each CPU to complete a grace
> period (did I understand it right?). Of course such a flavour of RCU wouldn't
> be nice to nohz_full but surely we can arrange some tweaks for those who
> require strong isolation. I'm sure you're having a much better idea though.

Well, that too.  Please see CONFIG_TASKS_RCU_RUDE in current
"dev" on -rcu.  But yes, another is on its way...

Hey, it compiled, so it much be perfect, right?  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10  1:40   ` Alexei Starovoitov
@ 2020-03-10  8:02     ` Thomas Gleixner
  2020-03-10 16:54     ` Paul E. McKenney
  2020-03-17 17:56     ` Joel Fernandes
  2 siblings, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2020-03-10  8:02 UTC (permalink / raw)
  To: Alexei Starovoitov, Mathieu Desnoyers
  Cc: linux-kernel, Peter Zijlstra, rostedt, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker, bpf

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> In general I'm sceptical that .text annotations will work. Let's say all of
> idle is a red zone. But a ton of normal functions are called when idle. So
> objtool will go and mark them as red zone too.

No. If you carefully read what I proposed its:

noinst foo()
{
        protected_work();
        
        instr_begin();

        invoke_otherstuff();

        instr_end();

        moar_protected_work();

}

objtool will not mark anything. It will check that invocations out of
the protected section are marked as safe, i.e. inside a
instr_begin/end() pair.

So if you fail to mark protected_work() as noinstr then it will
complain. If you forget to put instr_begin/end() around the safe area it
will complain about invoke_otherstuff().

So it's a very targeted approach. objtool is there to verify that it's
consistent nothing else.

> This way large percent of the
> kernel will be off limits for tracers. Which is imo not a good trade off. I
> think addressing 1 and 2 with explicit notrace/nokprobe annotations will cover
> all practical cases where people can shot themselves in a foot with a
> tracer.

That's simply wishful thinking. The discussions in the last weeks have
clearly demonstrated that this is not the case. People were truly
convinced that e.g. probing rcu_idle_exit() is safe, but it was
not. Read the thread how long this went on.

> I realize that there will be forever whack-a-mole game and these
> annotations will never reach 100%. I think it's a fine trade
> off. Security is never 100% either.  Tracing is never going to be 100%
> safe too.

I disagree. Whack a mole games are horrible and have a guaranteed
high failure rate. Otherwise we would not discuss this at all.

And no, it's not a fine trade off.

If we can have technical means to prevent the wreckage, then not using
them for handwaving reasons is just violating the only sane engineering
principle:

        Correctness first

I spent the last 20 years mopping up the violations of this principle.

We have to stop the "features first, performance first" and "good
enough" mentality if we want to master the ever increasing complexity of
hardware and software in the long run.

From my experience of cleaning up stuff, I can tell you, that
correctness first neither hurts performance nor does it prevent
features, except those which are wrong to begin with.

As quite some people do not care about or even willfully ignore
"correctness first", we have to force them to adhere by technical means,
which spares us to mop up the mess they'd create otherwise.

And even for those who deeply care tooling support is a great help to
prevent the accidental slip up. I wish I could have spared chasing call
chains manually and then figure out two days later that I missed
something.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 18:59   ` Thomas Gleixner
@ 2020-03-10  8:09     ` Masami Hiramatsu
  2020-03-10 11:43       ` Thomas Gleixner
                         ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Masami Hiramatsu @ 2020-03-10  8:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, LKML, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Mathieu Desnoyers, Paul E. McKenney,
	Joel Fernandes, Frederic Weisbecker, Jason Wessel

Hi,

On Mon, 09 Mar 2020 19:59:18 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> >> #2) Breakpoint utilization
> >> 
> >>     As recent findings have shown, breakpoint utilization needs to be
> >>     extremly careful about not creating infinite breakpoint recursions.
> >> 
> >>     I think that's pretty much obvious, but falls into the overall
> >>     question of how to protect callchains.
> >
> > This is rather unique, and I agree that its best to at least get to a point
> > where we limit the tracing within breakpoint code. I'm fine with making
> > rcu_nmi_exit() nokprobe too.
> 
> Yes, the break point stuff is unique, but it has nicely demonstrated how
> much of the code is affected by it.

I see. I had followed the callchain several times, and always found new function.
So I agree with the off-limit section idea. That is a kind of entry code section
but more generic one. It is natural to split such sensitive code in different
place.

BTW, what about kdb stuffs? (+Cc Jason)

> >> #4 Protecting call chains
> >> 
> >>    Our current approach of annotating functions with notrace/noprobe is
> >>    pretty much broken.
> >> 
> >>    Functions which are marked NOPROBE or notrace call out into functions
> >>    which are not marked and while this might be ok, there are enough
> >>    places where it is not. But we have no way to verify that.

Agreed. That's the reason why I haven't add kprobe-fuzzer yet.
It is easy to make a fuzzer for kprobes by ftrace (note that we need
to enable CONFIG_KPROBE_EVENTS_ON_NOTRACE=y to check notrace functions),
but there is no way to kick the target code. In the result, most of the
kprobed functions are just not hit. I'm not sure such test code is
reasonable or not.

> > Note, if notrace is an issue it shows up pretty quickly, as just enabling
> > function tracing will enable all non notrace locations, and if something
> > shouldn't be traced, it will crash immediately.
> 
> Steven, you're not really serious about this, right? This is tinkering
> at best.
> 
> We have code pathes which are not necessarily covered in regular
> testing, depend on config options etc.
> 
> Have you ever looked at code coverage maps? There are quite some spots
> which we don't reach in testing.
> 
> So how do you explain the user that the blind spot he hit in the weird
> situation on his server which he wanted to analyze crashed his machine?
> 
> Having 'off limit' sections allows us to do proper tool based analysis
> with full coverage. That's really the only sensible approach.
> 
> > I have a RCU option for ftrace ops to set, if it requires RCU to be
> > watching, and in that case, it wont call the callback if RCU is not
> > watching.
> 
> That's nice but does not solve the problem.
> 
> >>    That's just a recipe for disaster. We really cannot request from
> >>    sysadmins who want to use instrumentation to stare at the code first
> >>    whether they can place/enable an instrumentation point somewhere.
> >>    That'd be just a bad joke.
> >> 
> >>    I really think we need to have proper text sections which are off
> >>    limit for any form of instrumentation and have tooling to analyze the
> >>    calls into other sections. These calls need to be annotated as safe
> >>    and intentional.
> >> 
> >> Thoughts?
> >
> > This can expand quite a bit. At least when I did something similar with
> > NMIs, as there was a time I wanted to flag all places that could be called
> > from NMI, and found that there's a lot of code that can be.
> >
> > I can imagine the same for marking nokprobes as well. And I really don't
> > want to make all notrace stop tracing the entire function and all that it
> > can call, as that will go back to removing all callers from NMIs as
> > do_nmi() itself is notrace.
> 
> The point is that you have something like this:
> 
> section "text.offlimit"
> 
> nmi()
> {
>         do_fragile_stuff_on_enter();
> 
>         offlimit_safecall(do_some_instrumentable_stuff());
> 
>         do_fragile_stuff_on_exit();
> }
> 
> section "text"
> 
> do_some_instrumentable_stuff()
> {
> }
> 
> So if someone adds another call
> 
> section "text.offlimit"
> 
> nmi()
> {
>         do_fragile_stuff_on_enter();
> 
>         offlimit_safecall(do_some_instrumentable_stuff());
> 
>         do_some_other_instrumentable_stuff();
> 
>         do_fragile_stuff_on_exit();
> }
> 
> which is also in section "text" then the analysis tool will find the
> missing offlimit_safecall() - or what ever method we chose to annotate
> that stuff. Surely not an annotation on the called function itself
> because that might be safe to call in one context but not in another.

Hmm, what the offlimit_safecall() does? and what happen if the 
do_fragile_stuff_on_enter() invokes a library code? I think we also need
to tweak kbuild to duplicate some library code to the off-limit text area.

> These annotations are halfways easy to monitor for abuse and they should
> be prominent enough in the code that at least for the people dealing
> with that kind of code they act as a warning flag.

This off-limit text will be good for entries, but I think we still not
able to remove all NOKPROBE_SYMBOLS with this.

For example __die() is marked a NOKPROBE because if we hit a recursive
int3, it calls BUG() to dump stacks etc for debug. So that function
must NOT probed. (I think we also should mark all backtrace functions
in this case, but not yet) Would we move those backtrace related
functions (including printk, and console drivers?) into the offlimit
text too?

Hmm, if there is a bust_kprobes(), that can be easy to fix this issue.


Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10  8:09     ` Masami Hiramatsu
@ 2020-03-10 11:43       ` Thomas Gleixner
  2020-03-10 15:31         ` Mathieu Desnoyers
                           ` (2 more replies)
  2020-03-10 15:24       ` Mathieu Desnoyers
  2020-03-10 17:05       ` Daniel Thompson
  2 siblings, 3 replies; 44+ messages in thread
From: Thomas Gleixner @ 2020-03-10 11:43 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, LKML, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Mathieu Desnoyers, Paul E. McKenney,
	Joel Fernandes, Frederic Weisbecker, Jason Wessel

Masami,

Masami Hiramatsu <mhiramat@kernel.org> writes:
> On Mon, 09 Mar 2020 19:59:18 +0100
> Thomas Gleixner <tglx@linutronix.de> wrote:
>
>> >> #2) Breakpoint utilization
>> >> 
>> >>     As recent findings have shown, breakpoint utilization needs to be
>> >>     extremly careful about not creating infinite breakpoint recursions.
>> >> 
>> >>     I think that's pretty much obvious, but falls into the overall
>> >>     question of how to protect callchains.
>> >
>> > This is rather unique, and I agree that its best to at least get to a point
>> > where we limit the tracing within breakpoint code. I'm fine with making
>> > rcu_nmi_exit() nokprobe too.
>> 
>> Yes, the break point stuff is unique, but it has nicely demonstrated how
>> much of the code is affected by it.
>
> I see. I had followed the callchain several times, and always found new function.
> So I agree with the off-limit section idea. That is a kind of entry code section
> but more generic one. It is natural to split such sensitive code in different
> place.
>
> BTW, what about kdb stuffs? (+Cc Jason)

That's yet another area of wreckage which nobody every looked at.

>> >> #4 Protecting call chains
>> >> 
>> >>    Our current approach of annotating functions with notrace/noprobe is
>> >>    pretty much broken.
>> >> 
>> >>    Functions which are marked NOPROBE or notrace call out into functions
>> >>    which are not marked and while this might be ok, there are enough
>> >>    places where it is not. But we have no way to verify that.
>
> Agreed. That's the reason why I haven't add kprobe-fuzzer yet.
> It is easy to make a fuzzer for kprobes by ftrace (note that we need
> to enable CONFIG_KPROBE_EVENTS_ON_NOTRACE=y to check notrace functions),
> but there is no way to kick the target code. In the result, most of the
> kprobed functions are just not hit. I'm not sure such test code is
> reasonable or not.

Well, test code is always reasonable, but you have to be aware that code
coverage is a really hard to solve problem with a code base as complex
as the kernel.

>> which is also in section "text" then the analysis tool will find the
>> missing offlimit_safecall() - or what ever method we chose to annotate
>> that stuff. Surely not an annotation on the called function itself
>> because that might be safe to call in one context but not in another.
>
> Hmm, what the offlimit_safecall() does? and what happen if the 
> do_fragile_stuff_on_enter() invokes a library code? I think we also need
> to tweak kbuild to duplicate some library code to the off-limit text
> area.

That's why we want the sections and the annotation. If something calls
out of a noinstr section into a regular text section and the call is not
annotated at the call site, then objtool can complain and tell you. What
Peter and I came up with looks like this:

noinstr foo()
	do_protected(); <- Safe because in the noinstr section

	instr_begin();	<- Marks the begin of a safe region, ignored
        		   by objtool

        do_stuff();     <- All good   

        instr_end();    <- End of the safe region. objtool starts
			   looking again

        do_other_stuff();  <- Unsafe because do_other_stuff() is
        		      not protected
and:

noinstr do_protected()
        bar();		<- objtool will complain here

See?

>> These annotations are halfways easy to monitor for abuse and they should
>> be prominent enough in the code that at least for the people dealing
>> with that kind of code they act as a warning flag.
>
> This off-limit text will be good for entries, but I think we still not
> able to remove all NOKPROBE_SYMBOLS with this.
>
> For example __die() is marked a NOKPROBE because if we hit a recursive
> int3, it calls BUG() to dump stacks etc for debug. So that function
> must NOT probed. (I think we also should mark all backtrace functions
> in this case, but not yet) Would we move those backtrace related
> functions (including printk, and console drivers?) into the offlimit
> text too?

That's something we need to figure out and decide on. Some of this stuff
sureley wants to be in the noinstr section. Other things might end up
still being explicitely annotated, but that should be the exception not
the rule.

> Hmm, if there is a bust_kprobes(), that can be easy to fix this issue.

That might help, but is obviously racy as hell.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 19:52   ` Thomas Gleixner
@ 2020-03-10 15:03     ` Mathieu Desnoyers
  2020-03-10 16:48       ` Thomas Gleixner
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-10 15:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, rostedt, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

----- On Mar 9, 2020, at 3:52 PM, Thomas Gleixner tglx@linutronix.de wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>> ----- On Mar 9, 2020, at 1:02 PM, Thomas Gleixner tglx@linutronix.de wrote:
[...]
> 
>>> #3) RCU idle
>>>    are really more than questionable. For 99.9999% of instrumentation
>>>    users it's absolutely irrelevant whether this traces the interrupt
>>>    disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
>>> 
>>>    But what's relevant is the tracer overhead which is e.g. inflicted
>>>    with todays trace_hardirqs_off/on() implementation because that
>>>    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
>>>    around every tracepoint.
>>
>> I think one of the big issues here is that most of the uses of
>> trace_hardirqs_off() are from sites which already have RCU watching,
>> so we are doing heavy-weight operations for nothing.
> 
> That and in some places in the entry code we do the heavy weight
> operations to cover 100 instructions which turn on RCU anyway. That does
> not make any sense at all.
> 
>> I strongly suspect that most of the overhead we've been trying to avoid when
>> introducing use of SRCU in rcuidle tracepoints was actually caused by callsites
>> which use rcuidle tracepoints while having RCU watching, just because there is a
>> handful of callsites which don't have RCU watching. This is confirmed
>> by the commit message of commit e6753f23d9 "tracepoint: Make rcuidle
>> tracepoint callers use SRCU":
>>
>>    "In recent tests with IRQ on/off tracepoints, a large performance
>>     overhead ~10% is noticed when running hackbench. This is root caused to
>>     calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
>>     tracepoint code. Following a long discussion on the list [1] about this,
>>     we concluded that srcu is a better alternative for use during rcu idle.
>>     Although it does involve extra barriers, its lighter than the sched-rcu
>>     version which has to do additional RCU calls to notify RCU idle about
>>     entry into RCU sections.
>> [...]
>>     Test: Tested idle and preempt/irq tracepoints."
> 
> In a quick test I did with a invalid syscall number with profiling the
> trace_hardirqs_off() is pretty prominent and goes down by roughly a
> factor of 2 when I move it past enter_from_user_mode() and use just the
> non RCU idle variant.

I think one issue here is that trace_hardirqs_off() is now shared between
lockdep and tracing. For lockdep, we have the following comment:

        /*
         * IRQ from user mode.
         *
         * We need to tell lockdep that IRQs are off.  We can't do this until
         * we fix gsbase, and we should do it before enter_from_user_mode
         * (which can take locks).  Since TRACE_IRQS_OFF is idempotent,
         * the simplest way to handle it is to just call it twice if
         * we enter from user mode.  There's no reason to optimize this since
         * TRACE_IRQS_OFF is a no-op if lockdep is off.
         */
        TRACE_IRQS_OFF

        CALL_enter_from_user_mode

1:
        ENTER_IRQ_STACK old_rsp=%rdi save_ret=1
        /* We entered an interrupt context - irqs are off: */
        TRACE_IRQS_OFF

which seems to imply that lockdep requires TRACE_IRQS_OFF to be performed
_before_ entering from usermode. I don't expect this to be useful at all for
other tracers though. I think this should be replaced by a new e.g.
LOCKDEP_ENTER_FROM_USER_MODE or such which would call into lockdep without
calling other tracers.

> 
>> So I think we could go back to plain RCU for rcuidle tracepoints if we do
>> the cheaper "rcu_is_watching()" check rather than invoking
>> rcu_irq_{enter,exit}_irqson() unconditionally.
> 
> Haven't tried that yet for the overall usage of trace_hardirqs_off(),
> but yes, it's going to be a measurable difference.

Ideally we should do both: change the TRACE_IRQS_OFF prior entering from usermode
for a lockdep-specific call, which would remove the frequent case which requires
temporarily enabling RCU, *and* change tracepoints to use rcu_is_watching(),
temporarily enable RCU, and use standard RCU for rcuidle cases.

> 
>>>    Even if the tracepoint sits in the ASM code it just covers about ~20
>>>    low level ASM instructions more. The tracer invocation, which is
>>>    even done twice when coming from user space on x86 (the second call
>>>    is optimized in the tracer C-code), costs definitely way more
>>>    cycles. When you take the scru/rcu_irq dance into account it's a
>>>    complete disaster performance wise.
>>
>> Part of the issue here is the current overhead of SRCU read-side lock,
>> which contains memory barriers. The other part of the issue is the fact that
>> rcu_irq_{enter,exit}_irqson() contains an atomic_add_return atomic instruction.
>>
>> We could use the approach proposed by Peterz's and Steven's patches to basically
>> do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only
>> enable
>> RCU for those cases. We could then simply go back on using regular RCU
>> like so:
> 
> Right, but that still does the whole rcu_irq dance especially in the
> entry code just to trace 50 or 100 instructions which are turning on RCU
> anyway.

Agreed. Would changing this to a lockdep-specific call as I suggest above
solve this ?

> 
>>> #4 Protecting call chains
[...]
> 
>> In addition to splitting tracing code into a separate section, which I think
>> makes sense, I can think of another alternative way to provide call chains
>> protection: adding a "in_tracing" flag somewhere alongside each kernel stack.
>> Each thread and interrupt stack would have its own flag. However, trap handlers
>> should share the "in_tracing" flag with the context which triggers the trap.
>>
>> If a tracer recurses, or if a tracer attempts to trace another tracer, the
>> instrumentation would break the recursion chain by preventing instrumentation
>> from firing. If we end up caring about tracers tracing other tracers, we could
>> have one distinct flag per tracer and let each tracer break the recursion chain.
>>
>> Having this flag per kernel stack rather than per CPU or per thread would
>> allow tracing of nested interrupt handlers (and NMIs), but would break
>> call chains both within the same stack or going through a trap. I think
>> it could be a nice complementary safety net to handle mishaps in a non-fatal
>> way.
> 
> That works as long as none of this uses breakpoint based patching to
> dynamically disable/enable stuff.

I'm clearly missing something here. I was expecting the "in_tracing" flag trick
to be able to fix the breakpoint recursion issue. What is the problem I'm missing
here ?

An additional point about splitting core code into separate non-instrumentable
sections: The part I'm concerned about is instrumentation of trap handlers.
_That_ will likely need to remain shared between the kernel and the tracers.
This is where I think the "in_tracing" flag approach would help us solve the
recursion issue without losing that important instrumentation coverage.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 20:47 ` Paul E. McKenney
  2020-03-09 20:58   ` Steven Rostedt
  2020-03-09 23:52   ` Frederic Weisbecker
@ 2020-03-10 15:13   ` Mathieu Desnoyers
  2020-03-10 16:49     ` Paul E. McKenney
  2 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-10 15:13 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, rostedt,
	Masami Hiramatsu, Alexei Starovoitov, Joel Fernandes, Google,
	Frederic Weisbecker



----- On Mar 9, 2020, at 4:47 PM, paulmck paulmck@kernel.org wrote:
[...]

> 
> Suppose that we had a variant of RCU that had about the same read-side
> overhead as Preempt-RCU, but which could be used from idle as well as
> from CPUs in the process of coming online or going offline?  I have not
> thought through the irq/NMI/exception entry/exit cases, but I don't see
> why that would be problem.
> 
> This would have explicit critical-section entry/exit code, so it would
> not be any help for trampolines.
> 
> Would such a variant of RCU help?
> 
> Yeah, I know.  Just what the kernel doesn't need, yet another variant
> of RCU...

Hi Paul,

I think that before introducing yet another RCU flavor, it's important
to take a step back and look at the tracer requirements first. If those
end up being covered by currently available RCU flavors, then why add
another ?

I can start with a few use-cases I have in mind. Others should feel free
to pitch in:

Tracing callsite context:

1) Thread context

   1.1) Preemption enabled

   One tracepoint in this category is syscall enter/exit. We should introduce
   a variant of tracepoints relying on SRCU for this use-case so we can take
   page faults when fetching userspace data.

   1.2) Preemption disabled

   Tree-RCU works fine.

   1.3) IRQs disabled

   Tree-RCU works fine.

2) IRQ handler context

   Tree-RCU works fine.

3) NMI context

   Tree-RCU works fine.

4) cpuidle context (!rcu_is_watching())

   - By all means, we should not have tracepoints requiring to temporarily enable
     RCU in frequent code-paths. It appears that we should be able to remove the few
     offenders we currently have (e.g. enter from usermode),
   - For tracepoints which are infrequently called from !rcu_is_watching context, checking
     whether RCU is watching and only enabling when needed should be fast enough.

Are there other use-cases am I missing that would justify adding another flavor of RCU ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10  8:09     ` Masami Hiramatsu
  2020-03-10 11:43       ` Thomas Gleixner
@ 2020-03-10 15:24       ` Mathieu Desnoyers
  2020-03-10 17:05       ` Daniel Thompson
  2 siblings, 0 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-10 15:24 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Thomas Gleixner, rostedt, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker, Jason Wessel

----- On Mar 10, 2020, at 4:09 AM, Masami Hiramatsu mhiramat@kernel.org wrote:

> Hi,
> 
> On Mon, 09 Mar 2020 19:59:18 +0100
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 

[...]

>> which is also in section "text" then the analysis tool will find the
>> missing offlimit_safecall() - or what ever method we chose to annotate
>> that stuff. Surely not an annotation on the called function itself
>> because that might be safe to call in one context but not in another.
> 
> Hmm, what the offlimit_safecall() does? and what happen if the
> do_fragile_stuff_on_enter() invokes a library code? I think we also need
> to tweak kbuild to duplicate some library code to the off-limit text area.
> 
>> These annotations are halfways easy to monitor for abuse and they should
>> be prominent enough in the code that at least for the people dealing
>> with that kind of code they act as a warning flag.
> 
> This off-limit text will be good for entries, but I think we still not
> able to remove all NOKPROBE_SYMBOLS with this.
> 
> For example __die() is marked a NOKPROBE because if we hit a recursive
> int3, it calls BUG() to dump stacks etc for debug. So that function
> must NOT probed. (I think we also should mark all backtrace functions
> in this case, but not yet) Would we move those backtrace related
> functions (including printk, and console drivers?) into the offlimit
> text too?
> 
> Hmm, if there is a bust_kprobes(), that can be easy to fix this issue.

In order to solve the recursion issue without losing instrumentation
coverage of traps and significant system events, I really think we should
look into adding some kind of "in_tracing" flag near each stack. We could
name this "struct recursion_context" or something similar. We could then
add this structure alongside each thread and ISR stack, but make sure traps
(including breakpoints) still point to the recursion context of whatever
caused the trap.

This should allow us to detect and prevent recursion from tracers directly
at the instrumentation level.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 11:43       ` Thomas Gleixner
@ 2020-03-10 15:31         ` Mathieu Desnoyers
  2020-03-10 15:46           ` Steven Rostedt
  2020-03-10 16:06         ` Masami Hiramatsu
  2020-03-12 13:53         ` Peter Zijlstra
  2 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-10 15:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Masami Hiramatsu, rostedt, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker, Jason Wessel

----- On Mar 10, 2020, at 7:43 AM, Thomas Gleixner tglx@linutronix.de wrote:

[...]
> 
> That's why we want the sections and the annotation. If something calls
> out of a noinstr section into a regular text section and the call is not
> annotated at the call site, then objtool can complain and tell you. What
> Peter and I came up with looks like this:
> 
> noinstr foo()
>	do_protected(); <- Safe because in the noinstr section
> 
>	instr_begin();	<- Marks the begin of a safe region, ignored
>        		   by objtool
> 
>        do_stuff();     <- All good
> 
>        instr_end();    <- End of the safe region. objtool starts
>			   looking again
> 
>        do_other_stuff();  <- Unsafe because do_other_stuff() is
>        		      not protected
> and:
> 
> noinstr do_protected()
>        bar();		<- objtool will complain here
> 
> See?

I think there are two distinct problems we are trying to solve here,
and it would be good to spell them out to see which pieces of technical
solution apply to which.

Problem #1) Tracer invoked from partially initialized kernel context

  - Moving the early/late entry/exit points into sections invisible from
    instrumentation seems to make tons of sense for this.

Problem #2) Tracer recursion

  - I'm much less convinced that hiding entry points from instrumentation
    works for this. As an example, with the isntr_begin/end() approach you
    propose above, as soon as you have a tracer recursing into itself because
    something below do_stuff() has been instrumented, having hidden the entry
    point did not help at all.

So I would be tempted to use the "hide entry/exit points" with explicit
instr begin/end annotation to solve Problem #1, but I'm still thinking there
is value in the per recursion context "in_tracing" flag to prevent tracer
recursion.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 15:31         ` Mathieu Desnoyers
@ 2020-03-10 15:46           ` Steven Rostedt
  2020-03-10 16:21             ` Mathieu Desnoyers
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Rostedt @ 2020-03-10 15:46 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker, Jason Wessel

On Tue, 10 Mar 2020 11:31:51 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> I think there are two distinct problems we are trying to solve here,
> and it would be good to spell them out to see which pieces of technical
> solution apply to which.
> 
> Problem #1) Tracer invoked from partially initialized kernel context
> 
>   - Moving the early/late entry/exit points into sections invisible from
>     instrumentation seems to make tons of sense for this.
> 
> Problem #2) Tracer recursion
> 
>   - I'm much less convinced that hiding entry points from instrumentation
>     works for this. As an example, with the isntr_begin/end() approach you
>     propose above, as soon as you have a tracer recursing into itself because
>     something below do_stuff() has been instrumented, having hidden the entry
>     point did not help at all.
> 
> So I would be tempted to use the "hide entry/exit points" with explicit
> instr begin/end annotation to solve Problem #1, but I'm still thinking there
> is value in the per recursion context "in_tracing" flag to prevent tracer
> recursion.

The only recursion issue that I've seen discussed is breakpoints. And
that's outside of the tracer infrastructure. Basically, if someone added a
breakpoint for a kprobe on something that gets called in the int3 code
before kprobes is called we have (let's say rcu_nmi_enter()):


 rcu_nmi_enter();
  <int3>
     do_int3() {
        rcu_nmi_enter();
          <int3>
             do_int3();
                [..]

Where would a "in_tracer" flag help here? Perhaps a "in_breakpoint" could?

-- Steve

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 11:43       ` Thomas Gleixner
  2020-03-10 15:31         ` Mathieu Desnoyers
@ 2020-03-10 16:06         ` Masami Hiramatsu
  2020-03-12 13:53         ` Peter Zijlstra
  2 siblings, 0 replies; 44+ messages in thread
From: Masami Hiramatsu @ 2020-03-10 16:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Steven Rostedt, LKML, Peter Zijlstra, Alexei Starovoitov,
	Mathieu Desnoyers, Paul E. McKenney, Joel Fernandes,
	Frederic Weisbecker, Jason Wessel

On Tue, 10 Mar 2020 12:43:27 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:

> Masami,
> 
> Masami Hiramatsu <mhiramat@kernel.org> writes:
> > On Mon, 09 Mar 2020 19:59:18 +0100
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> >> >> #2) Breakpoint utilization
> >> >> 
> >> >>     As recent findings have shown, breakpoint utilization needs to be
> >> >>     extremly careful about not creating infinite breakpoint recursions.
> >> >> 
> >> >>     I think that's pretty much obvious, but falls into the overall
> >> >>     question of how to protect callchains.
> >> >
> >> > This is rather unique, and I agree that its best to at least get to a point
> >> > where we limit the tracing within breakpoint code. I'm fine with making
> >> > rcu_nmi_exit() nokprobe too.
> >> 
> >> Yes, the break point stuff is unique, but it has nicely demonstrated how
> >> much of the code is affected by it.
> >
> > I see. I had followed the callchain several times, and always found new function.
> > So I agree with the off-limit section idea. That is a kind of entry code section
> > but more generic one. It is natural to split such sensitive code in different
> > place.
> >
> > BTW, what about kdb stuffs? (+Cc Jason)
> 
> That's yet another area of wreckage which nobody every looked at.
> 
> >> >> #4 Protecting call chains
> >> >> 
> >> >>    Our current approach of annotating functions with notrace/noprobe is
> >> >>    pretty much broken.
> >> >> 
> >> >>    Functions which are marked NOPROBE or notrace call out into functions
> >> >>    which are not marked and while this might be ok, there are enough
> >> >>    places where it is not. But we have no way to verify that.
> >
> > Agreed. That's the reason why I haven't add kprobe-fuzzer yet.
> > It is easy to make a fuzzer for kprobes by ftrace (note that we need
> > to enable CONFIG_KPROBE_EVENTS_ON_NOTRACE=y to check notrace functions),
> > but there is no way to kick the target code. In the result, most of the
> > kprobed functions are just not hit. I'm not sure such test code is
> > reasonable or not.
> 
> Well, test code is always reasonable, but you have to be aware that code
> coverage is a really hard to solve problem with a code base as complex
> as the kernel.

Yes, especially, the corner case which we need to find is hard to cover.

> >> which is also in section "text" then the analysis tool will find the
> >> missing offlimit_safecall() - or what ever method we chose to annotate
> >> that stuff. Surely not an annotation on the called function itself
> >> because that might be safe to call in one context but not in another.
> >
> > Hmm, what the offlimit_safecall() does? and what happen if the 
> > do_fragile_stuff_on_enter() invokes a library code? I think we also need
> > to tweak kbuild to duplicate some library code to the off-limit text
> > area.
> 
> That's why we want the sections and the annotation. If something calls
> out of a noinstr section into a regular text section and the call is not
> annotated at the call site, then objtool can complain and tell you. What
> Peter and I came up with looks like this:
> 
> noinstr foo()
> 	do_protected(); <- Safe because in the noinstr section
> 
> 	instr_begin();	<- Marks the begin of a safe region, ignored
>         		   by objtool
> 
>         do_stuff();     <- All good   
> 
>         instr_end();    <- End of the safe region. objtool starts
> 			   looking again
> 
>         do_other_stuff();  <- Unsafe because do_other_stuff() is
>         		      not protected
> and:
> 
> noinstr do_protected()
>         bar();		<- objtool will complain here
> 
> See?

OK, so this is for what the instr_begin() and instr_end() ensure the
instrumentation safeness. Would you think this will also applied to
notrace functons? I mean, how can large this section be.

It seems there are several different aspect (or level), RCU idle,
kprobes recursive call(NOKPROBE), and ftrace recursive call(notrace).
Would the offlimit section include all of them or some specific cases?

> >> These annotations are halfways easy to monitor for abuse and they should
> >> be prominent enough in the code that at least for the people dealing
> >> with that kind of code they act as a warning flag.
> >
> > This off-limit text will be good for entries, but I think we still not
> > able to remove all NOKPROBE_SYMBOLS with this.
> >
> > For example __die() is marked a NOKPROBE because if we hit a recursive
> > int3, it calls BUG() to dump stacks etc for debug. So that function
> > must NOT probed. (I think we also should mark all backtrace functions
> > in this case, but not yet) Would we move those backtrace related
> > functions (including printk, and console drivers?) into the offlimit
> > text too?
> 
> That's something we need to figure out and decide on. Some of this stuff
> sureley wants to be in the noinstr section. Other things might end up
> still being explicitely annotated, but that should be the exception not
> the rule.

I think some library code can be duplicated in the offlimit section if
needed. And stacktrace can also be a part of offlimit as like as a
"service" call from normal world, since usually, stacktrace is for
debugging.

> > Hmm, if there is a bust_kprobes(), that can be easy to fix this issue.
> 
> That might help, but is obviously racy as hell.

I tried something like below. It needs a new set_memory_rw()
like function which can be called from int3 context.

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 4d7022a740ab..7ab5bbc69cc9 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -619,6 +619,9 @@ static void setup_singlestep(struct kprobe *p, struct pt_regs *regs,
 }
 NOKPROBE_SYMBOL(setup_singlestep);
 
+int __read_mostly kprobe_oops_in_progress;
+extern int __set_page_rw_raw(unsigned long addr);
+
 /*
  * We have reentered the kprobe_handler(), since another probe was hit while
  * within the handler. We save the original kprobes variables and just single
@@ -635,6 +638,17 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs *regs,
 		setup_singlestep(p, regs, kcb, 1);
 		break;
 	case KPROBE_REENTER:
+		if (unlikely(kprobe_oops_in_progress)) {
+			/*
+			 * We do not sync cores but there is no way to send
+			 * IPI from here. If other core hits int3, they can
+			 * recover by themselves.
+			 */
+			__set_page_rw_raw((unsigned long)p->addr);
+			*(p->addr) = p->ainsn.insn[0];
+			return 1;
+		}
+
 		/* A probe has been hit in the codepath leading up to, or just
 		 * after, single-stepping of a probed instruction. This entire
 		 * codepath should strictly reside in .kprobes.text section.
@@ -643,6 +657,8 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs *regs,
 		 */
 		pr_err("Unrecoverable kprobe detected.\n");
 		dump_kprobe(p);
+		/* Any further non-optimized kprobes are disabled forcibly */
+		kprobe_oops_in_progress++;
 		BUG();
 	default:
 		/* impossible cases */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index abbdecb75fad..0f550b1aad29 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -33,6 +33,7 @@
 #include <linux/nmi.h>
 #include <linux/gfp.h>
 #include <linux/kcore.h>
+#include <linux/kprobes.h>
 
 #include <asm/processor.h>
 #include <asm/bios_ebda.h>
@@ -1345,6 +1346,60 @@ int kern_addr_valid(unsigned long addr)
 	return pfn_valid(pte_pfn(*pte));
 }
 
+int __set_page_rw_raw(unsigned long addr)
+{
+	unsigned long above = ((long)addr) >> __VIRTUAL_MASK_SHIFT;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	if (above != 0 && above != -1UL)
+		return 0;
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return 0;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return 0;
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return 0;
+
+	if (pud_large(*pud)) {
+		if (!pfn_valid(pud_pfn(*pud)))
+			return 0;
+		pud_mkwrite(*pud);
+		goto flush;
+	}
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return 0;
+
+	if (pmd_large(*pmd)) {
+		if (!pfn_valid(pmd_pfn(*pmd)))
+			return 0;
+		pmd_mkwrite(*pmd);
+		goto flush;
+	}
+
+	pte = pte_offset_kernel(pmd, addr);
+	if (pte_none(*pte) || !pfn_valid(pte_pfn(*pte)))
+		return 0;
+
+	pte_mkwrite(*pte);
+
+flush:
+	__flush_tlb_one_kernel(addr);
+	return 1;
+}
+NOKPROBE_SYMBOL(__set_page_rw_raw);
+
 /*
  * Block size is the minimum amount of memory which can be hotplugged or
  * hotremoved. It must be power of two and must be equal or larger than


Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 15:46           ` Steven Rostedt
@ 2020-03-10 16:21             ` Mathieu Desnoyers
  2020-03-11  0:18               ` Masami Hiramatsu
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-10 16:21 UTC (permalink / raw)
  To: rostedt
  Cc: Thomas Gleixner, Masami Hiramatsu, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker, Jason Wessel

----- On Mar 10, 2020, at 11:46 AM, rostedt rostedt@goodmis.org wrote:

> On Tue, 10 Mar 2020 11:31:51 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> I think there are two distinct problems we are trying to solve here,
>> and it would be good to spell them out to see which pieces of technical
>> solution apply to which.
>> 
>> Problem #1) Tracer invoked from partially initialized kernel context
>> 
>>   - Moving the early/late entry/exit points into sections invisible from
>>     instrumentation seems to make tons of sense for this.
>> 
>> Problem #2) Tracer recursion
>> 
>>   - I'm much less convinced that hiding entry points from instrumentation
>>     works for this. As an example, with the isntr_begin/end() approach you
>>     propose above, as soon as you have a tracer recursing into itself because
>>     something below do_stuff() has been instrumented, having hidden the entry
>>     point did not help at all.
>> 
>> So I would be tempted to use the "hide entry/exit points" with explicit
>> instr begin/end annotation to solve Problem #1, but I'm still thinking there
>> is value in the per recursion context "in_tracing" flag to prevent tracer
>> recursion.
> 
> The only recursion issue that I've seen discussed is breakpoints. And
> that's outside of the tracer infrastructure. Basically, if someone added a
> breakpoint for a kprobe on something that gets called in the int3 code
> before kprobes is called we have (let's say rcu_nmi_enter()):
> 
> 
> rcu_nmi_enter();
>  <int3>
>     do_int3() {
>        rcu_nmi_enter();
>          <int3>
>             do_int3();
>                [..]
> 
> Where would a "in_tracer" flag help here? Perhaps a "in_breakpoint" could?

An approach where the "in_tracer" flag is tested and set by the instrumentation
(function tracer, kprobes, tracepoints) would work here. Let's say the beginning
of the int3 ISR is part of the code which is invisible to instrumentation, and
before we issue rcu_nmi_enter(), we handle the in_tracer flag:

rcu_nmi_enter();
 <int3>
    (recursion_ctx->in_tracer == false)
    set recursion_ctx->in_tracer = true
    do_int3() {
       rcu_nmi_enter();
         <int3>
            if (recursion_ctx->in_tracer == true)
                iret

We can change "in_tracer" for "in_breakpoint", "in_tracepoint" and
"in_function_trace" if we ever want to allow different types of instrumentation
to nest. I'm not sure whether this is useful or not through.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 15:03     ` Mathieu Desnoyers
@ 2020-03-10 16:48       ` Thomas Gleixner
  2020-03-10 17:40         ` Mathieu Desnoyers
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2020-03-10 16:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Peter Zijlstra, rostedt, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> ----- On Mar 9, 2020, at 3:52 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> In a quick test I did with a invalid syscall number with profiling the
>> trace_hardirqs_off() is pretty prominent and goes down by roughly a
>> factor of 2 when I move it past enter_from_user_mode() and use just the
>> non RCU idle variant.
>
> I think one issue here is that trace_hardirqs_off() is now shared between
> lockdep and tracing. For lockdep, we have the following comment:
>
>         /*
>          * IRQ from user mode.
>          *
>          * We need to tell lockdep that IRQs are off.  We can't do this until
>          * we fix gsbase, and we should do it before enter_from_user_mode
>          * (which can take locks).  Since TRACE_IRQS_OFF is idempotent,
>          * the simplest way to handle it is to just call it twice if
>          * we enter from user mode.  There's no reason to optimize this since
>          * TRACE_IRQS_OFF is a no-op if lockdep is off.
>          */
>         TRACE_IRQS_OFF
>
>         CALL_enter_from_user_mode
>
> 1:
>         ENTER_IRQ_STACK old_rsp=%rdi save_ret=1
>         /* We entered an interrupt context - irqs are off: */
>         TRACE_IRQS_OFF
>
> which seems to imply that lockdep requires TRACE_IRQS_OFF to be performed
> _before_ entering from usermode. I don't expect this to be useful at all for
> other tracers though. I think this should be replaced by a new e.g.
> LOCKDEP_ENTER_FROM_USER_MODE or such which would call into lockdep without
> calling other tracers.

See the entry series I'm working on. Aside of moving all this nonsense
into C-code it splits lockdep and tracing so it looks like this:

            lockdep_hardirqs_off();
            user_exit_irqsoff();
            __trace_hardirqs_off();

The latter uses regular RCU and not the scru/rcu_irq dance.

>> Right, but that still does the whole rcu_irq dance especially in the
>> entry code just to trace 50 or 100 instructions which are turning on RCU
>> anyway.
>
> Agreed. Would changing this to a lockdep-specific call as I suggest above
> solve this ?

That split exist for a few weeks now at least in my patches :)

>>> If a tracer recurses, or if a tracer attempts to trace another tracer, the
>>> instrumentation would break the recursion chain by preventing instrumentation
>>> from firing. If we end up caring about tracers tracing other tracers, we could
>>> have one distinct flag per tracer and let each tracer break the recursion chain.
>>>
>>> Having this flag per kernel stack rather than per CPU or per thread would
>>> allow tracing of nested interrupt handlers (and NMIs), but would break
>>> call chains both within the same stack or going through a trap. I think
>>> it could be a nice complementary safety net to handle mishaps in a non-fatal
>>> way.
>> 
>> That works as long as none of this uses breakpoint based patching to
>> dynamically disable/enable stuff.
>
> I'm clearly missing something here. I was expecting the "in_tracing" flag trick
> to be able to fix the breakpoint recursion issue. What is the problem I'm missing
> here ?

How do you "fix" that when you can't reach the tracepoint because you
trip over a breakpoint and then while trying to fixup that stuff you hit
another one?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 15:13   ` Mathieu Desnoyers
@ 2020-03-10 16:49     ` Paul E. McKenney
  2020-03-10 17:22       ` Mathieu Desnoyers
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2020-03-10 16:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, rostedt,
	Masami Hiramatsu, Alexei Starovoitov, Joel Fernandes, Google,
	Frederic Weisbecker

On Tue, Mar 10, 2020 at 11:13:27AM -0400, Mathieu Desnoyers wrote:
> 
> 
> ----- On Mar 9, 2020, at 4:47 PM, paulmck paulmck@kernel.org wrote:
> [...]
> 
> > 
> > Suppose that we had a variant of RCU that had about the same read-side
> > overhead as Preempt-RCU, but which could be used from idle as well as
> > from CPUs in the process of coming online or going offline?  I have not
> > thought through the irq/NMI/exception entry/exit cases, but I don't see
> > why that would be problem.
> > 
> > This would have explicit critical-section entry/exit code, so it would
> > not be any help for trampolines.
> > 
> > Would such a variant of RCU help?
> > 
> > Yeah, I know.  Just what the kernel doesn't need, yet another variant
> > of RCU...
> 
> Hi Paul,
> 
> I think that before introducing yet another RCU flavor, it's important
> to take a step back and look at the tracer requirements first. If those
> end up being covered by currently available RCU flavors, then why add
> another ?

Well, we have BPF requirements as well.

> I can start with a few use-cases I have in mind. Others should feel free
> to pitch in:
> 
> Tracing callsite context:
> 
> 1) Thread context
> 
>    1.1) Preemption enabled
> 
>    One tracepoint in this category is syscall enter/exit. We should introduce
>    a variant of tracepoints relying on SRCU for this use-case so we can take
>    page faults when fetching userspace data.

Agreed, SRCU works fine for the page-fault case, as the read-side memory
barriers are in the noise compared to page-fault overhead.  Back in
the day, there were light-weight system calls.  Are all of these now
converted to VDSO or similar?

>    1.2) Preemption disabled
> 
>    Tree-RCU works fine.
> 
>    1.3) IRQs disabled
> 
>    Tree-RCU works fine.
> 
> 2) IRQ handler context
> 
>    Tree-RCU works fine.
> 
> 3) NMI context
> 
>    Tree-RCU works fine.
> 
> 4) cpuidle context (!rcu_is_watching())
> 
>    - By all means, we should not have tracepoints requiring to temporarily enable
>      RCU in frequent code-paths. It appears that we should be able to remove the few
>      offenders we currently have (e.g. enter from usermode),
>    - For tracepoints which are infrequently called from !rcu_is_watching context, checking
>      whether RCU is watching and only enabling when needed should be fast enough.
> 
> Are there other use-cases am I missing that would justify adding another flavor of RCU ?

BPF programs that might sometimes sleep, but are usually lightweight.

I will be double-checking this, of course.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10  1:40   ` Alexei Starovoitov
  2020-03-10  8:02     ` Thomas Gleixner
@ 2020-03-10 16:54     ` Paul E. McKenney
  2020-03-17 17:56     ` Joel Fernandes
  2 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2020-03-10 16:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Thomas Gleixner, linux-kernel, Peter Zijlstra,
	rostedt, Masami Hiramatsu, Alexei Starovoitov, Joel Fernandes,
	Google, Frederic Weisbecker, bpf

On Mon, Mar 09, 2020 at 06:40:45PM -0700, Alexei Starovoitov wrote:
> On Mon, Mar 09, 2020 at 02:37:40PM -0400, Mathieu Desnoyers wrote:
> > > 
> > >    But what's relevant is the tracer overhead which is e.g. inflicted
> > >    with todays trace_hardirqs_off/on() implementation because that
> > >    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> > >    around every tracepoint.
> > 
> > I think one of the big issues here is that most of the uses of
> > trace_hardirqs_off() are from sites which already have RCU watching,
> > so we are doing heavy-weight operations for nothing.
> 
> I think kernel/trace/trace_preemptirq.c created too many problems for the
> kernel without providing tangible benefits. My understanding no one is using it
> in production. It's a tool to understand how kernel works. And such debugging
> tool can and should be removed.
> 
> One of Thomas's patches mentioned that bpf can be invoked from hardirq and
> preempt tracers. This connection doesn't exist in a direct way, but
> theoretically it's possible. There is no practical use though and I would be
> happy to blacklist such bpf usage at a minimum.
> 
> > We could use the approach proposed by Peterz's and Steven's patches to basically
> > do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
> > RCU for those cases. We could then simply go back on using regular RCU like so:
> > 
> > #define __DO_TRACE(tp, proto, args, cond, rcuidle)                      \
> >         do {                                                            \
> >                 struct tracepoint_func *it_func_ptr;                    \
> >                 void *it_func;                                          \
> >                 void *__data;                                           \
> >                 bool exit_rcu = false;                                  \
> >                                                                         \
> >                 if (!(cond))                                            \
> >                         return;                                         \
> >                                                                         \
> >                 if (rcuidle && !rcu_is_watching()) {                    \
> >                         rcu_irq_enter_irqson();                         \
> >                         exit_rcu = true;                                \
> >                 }                                                       \
> >                 preempt_disable_notrace();                              \
> >                 it_func_ptr = rcu_dereference_raw((tp)->funcs);         \
> >                 if (it_func_ptr) {                                      \
> >                         do {                                            \
> >                                 it_func = (it_func_ptr)->func;          \
> >                                 __data = (it_func_ptr)->data;           \
> >                                 ((void(*)(proto))(it_func))(args);      \
> >                         } while ((++it_func_ptr)->func);                \
> >                 }                                                       \
> >                 preempt_enable_notrace();                               \
> >                 if (exit_rcu)                                           \
> >                         rcu_irq_exit_irqson();                          \
> >         } while (0)
> 
> I think it's a fine approach interim.
> 
> Long term sounds like Paul is going to provide sleepable and low overhead
> rcu_read_lock_for_tracers() that will include bpf.

It now builds without errors, so the obvious problems are taken care of...

Working on the less-obvious errors as rcutorture encounters them.

> My understanding that this new rcu flavor won't have "idle" issues,
> so rcu_is_watching() checks will not be necessary.

True.  However, if the from-idle code invokes other code relying on
vanilla RCU, such checks are still required.  But I must let others
weigh in on this.

> And if we remove trace_preemptirq.c the only thing left will be Thomas's points
> 1 (low level entry) and 2 (breakpoints) that can be addressed without
> creating fancy .text annotations and teach objtool about it.

And the intent is to cover these cases as well.  Of course, we all know
which road is paved with good intentions.  ;-)

> In the mean time I've benchmarked srcu for sleepable bpf and it's quite heavy.
> srcu_read_lock+unlock roughly adds 10x execution cost to trivial bpf prog.
> I'm proceeding with it anyway, but really hoping that
> rcu_read_lock_for_tracers() will materialize soon.

OK, 10x is a bit on the painful side!

							Thanx, Paul

> In general I'm sceptical that .text annotations will work. Let's say all of
> idle is a red zone. But a ton of normal functions are called when idle. So
> objtool will go and mark them as red zone too. This way large percent of the
> kernel will be off limits for tracers. Which is imo not a good trade off. I
> think addressing 1 and 2 with explicit notrace/nokprobe annotations will cover
> all practical cases where people can shot themselves in a foot with a tracer. I
> realize that there will be forever whack-a-mole game and these annotations will
> never reach 100%. I think it's a fine trade off. Security is never 100% either.
> Tracing is never going to be 100% safe too.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10  8:09     ` Masami Hiramatsu
  2020-03-10 11:43       ` Thomas Gleixner
  2020-03-10 15:24       ` Mathieu Desnoyers
@ 2020-03-10 17:05       ` Daniel Thompson
  2 siblings, 0 replies; 44+ messages in thread
From: Daniel Thompson @ 2020-03-10 17:05 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Thomas Gleixner, Steven Rostedt, LKML, Peter Zijlstra,
	Alexei Starovoitov, Mathieu Desnoyers, Paul E. McKenney,
	Joel Fernandes, Frederic Weisbecker, Jason Wessel

On Tue, Mar 10, 2020 at 05:09:51PM +0900, Masami Hiramatsu wrote:
> Hi,
> 
> On Mon, 09 Mar 2020 19:59:18 +0100
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > >> #2) Breakpoint utilization
> > >> 
> > >>     As recent findings have shown, breakpoint utilization needs to be
> > >>     extremly careful about not creating infinite breakpoint recursions.
> > >> 
> > >>     I think that's pretty much obvious, but falls into the overall
> > >>     question of how to protect callchains.
> > >
> > > This is rather unique, and I agree that its best to at least get to a point
> > > where we limit the tracing within breakpoint code. I'm fine with making
> > > rcu_nmi_exit() nokprobe too.
> > 
> > Yes, the break point stuff is unique, but it has nicely demonstrated how
> > much of the code is affected by it.
> 
> I see. I had followed the callchain several times, and always found new function.
> So I agree with the off-limit section idea. That is a kind of entry code section
> but more generic one. It is natural to split such sensitive code in different
> place.
> 
> BTW, what about kdb stuffs? (+Cc Jason)

There is some double breakpoint detection code but IIRC this merely
retrospectively warns the user that they have their hurt their system...
and whether the system would run long enough to reach that logic is
relatively unlikely.

For both kdb and kgdb, the main "defence" is the use case. Neither kdb
or kgdb faces the userspace (except via a SysRq, which can be disabled)
and triggering either already implies the user is not especially
concerned about things like availability guarantees since they are happy
for everything running on the system to be halted indefinitely.

Additionally breakpoints in kgdb/kdb are not wildcarded so there are no
need to worry about a user selecting a bad pattern! Setting a breakpoint
with kgdb/kdb needs a user (who is assumed to have kernel knowledge) to
have explicitly chose to study the dynamic behaviour of that particular
bit of code.

I'm not saying kgdb/kdb would not benefit from additional safety
barriers (it would), simply that the problem is less acute.


Daniel.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 16:49     ` Paul E. McKenney
@ 2020-03-10 17:22       ` Mathieu Desnoyers
  2020-03-10 17:26         ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-10 17:22 UTC (permalink / raw)
  To: paulmck
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, rostedt,
	Masami Hiramatsu, Alexei Starovoitov, Joel Fernandes, Google,
	Frederic Weisbecker

----- On Mar 10, 2020, at 12:49 PM, paulmck paulmck@kernel.org wrote:

> On Tue, Mar 10, 2020 at 11:13:27AM -0400, Mathieu Desnoyers wrote:
>> 
>> 
>> ----- On Mar 9, 2020, at 4:47 PM, paulmck paulmck@kernel.org wrote:
>> [...]
>> 
>> > 
>> > Suppose that we had a variant of RCU that had about the same read-side
>> > overhead as Preempt-RCU, but which could be used from idle as well as
>> > from CPUs in the process of coming online or going offline?  I have not
>> > thought through the irq/NMI/exception entry/exit cases, but I don't see
>> > why that would be problem.
>> > 
>> > This would have explicit critical-section entry/exit code, so it would
>> > not be any help for trampolines.
>> > 
>> > Would such a variant of RCU help?
>> > 
>> > Yeah, I know.  Just what the kernel doesn't need, yet another variant
>> > of RCU...
>> 
>> Hi Paul,
>> 
>> I think that before introducing yet another RCU flavor, it's important
>> to take a step back and look at the tracer requirements first. If those
>> end up being covered by currently available RCU flavors, then why add
>> another ?
> 
> Well, we have BPF requirements as well.
> 
>> I can start with a few use-cases I have in mind. Others should feel free
>> to pitch in:
>> 
>> Tracing callsite context:
>> 
>> 1) Thread context
>> 
>>    1.1) Preemption enabled
>> 
>>    One tracepoint in this category is syscall enter/exit. We should introduce
>>    a variant of tracepoints relying on SRCU for this use-case so we can take
>>    page faults when fetching userspace data.
> 
> Agreed, SRCU works fine for the page-fault case, as the read-side memory
> barriers are in the noise compared to page-fault overhead.  Back in
> the day, there were light-weight system calls.  Are all of these now
> converted to VDSO or similar?

There is a big difference between allowing page faults to happen, and expecting
page faults to happen every time. I suspect many use-cases will end up having
a fast-path which touches user-space data which is in the page cache, but
may end up triggering page faults in rare occasions.

Therefore, this might justify an SRCU which has low-overhead read-side.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 17:22       ` Mathieu Desnoyers
@ 2020-03-10 17:26         ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2020-03-10 17:26 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, rostedt,
	Masami Hiramatsu, Alexei Starovoitov, Joel Fernandes, Google,
	Frederic Weisbecker

On Tue, Mar 10, 2020 at 01:22:45PM -0400, Mathieu Desnoyers wrote:
> ----- On Mar 10, 2020, at 12:49 PM, paulmck paulmck@kernel.org wrote:
> 
> > On Tue, Mar 10, 2020 at 11:13:27AM -0400, Mathieu Desnoyers wrote:
> >> 
> >> 
> >> ----- On Mar 9, 2020, at 4:47 PM, paulmck paulmck@kernel.org wrote:
> >> [...]
> >> 
> >> > 
> >> > Suppose that we had a variant of RCU that had about the same read-side
> >> > overhead as Preempt-RCU, but which could be used from idle as well as
> >> > from CPUs in the process of coming online or going offline?  I have not
> >> > thought through the irq/NMI/exception entry/exit cases, but I don't see
> >> > why that would be problem.
> >> > 
> >> > This would have explicit critical-section entry/exit code, so it would
> >> > not be any help for trampolines.
> >> > 
> >> > Would such a variant of RCU help?
> >> > 
> >> > Yeah, I know.  Just what the kernel doesn't need, yet another variant
> >> > of RCU...
> >> 
> >> Hi Paul,
> >> 
> >> I think that before introducing yet another RCU flavor, it's important
> >> to take a step back and look at the tracer requirements first. If those
> >> end up being covered by currently available RCU flavors, then why add
> >> another ?
> > 
> > Well, we have BPF requirements as well.
> > 
> >> I can start with a few use-cases I have in mind. Others should feel free
> >> to pitch in:
> >> 
> >> Tracing callsite context:
> >> 
> >> 1) Thread context
> >> 
> >>    1.1) Preemption enabled
> >> 
> >>    One tracepoint in this category is syscall enter/exit. We should introduce
> >>    a variant of tracepoints relying on SRCU for this use-case so we can take
> >>    page faults when fetching userspace data.
> > 
> > Agreed, SRCU works fine for the page-fault case, as the read-side memory
> > barriers are in the noise compared to page-fault overhead.  Back in
> > the day, there were light-weight system calls.  Are all of these now
> > converted to VDSO or similar?
> 
> There is a big difference between allowing page faults to happen, and expecting
> page faults to happen every time. I suspect many use-cases will end up having
> a fast-path which touches user-space data which is in the page cache, but
> may end up triggering page faults in rare occasions.
> 
> Therefore, this might justify an SRCU which has low-overhead read-side.

OK, good to know, thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 16:48       ` Thomas Gleixner
@ 2020-03-10 17:40         ` Mathieu Desnoyers
  2020-03-10 18:31           ` Thomas Gleixner
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-10 17:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, rostedt, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

----- On Mar 10, 2020, at 12:48 PM, Thomas Gleixner tglx@linutronix.de wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
[...]
> See the entry series I'm working on. Aside of moving all this nonsense
> into C-code it splits lockdep and tracing so it looks like this:
> 
>            lockdep_hardirqs_off();
>            user_exit_irqsoff();
>            __trace_hardirqs_off();
> 
> The latter uses regular RCU and not the scru/rcu_irq dance.
> 

Awesome :)

> 
>>>> If a tracer recurses, or if a tracer attempts to trace another tracer, the
>>>> instrumentation would break the recursion chain by preventing instrumentation
>>>> from firing. If we end up caring about tracers tracing other tracers, we could
>>>> have one distinct flag per tracer and let each tracer break the recursion chain.
>>>>
>>>> Having this flag per kernel stack rather than per CPU or per thread would
>>>> allow tracing of nested interrupt handlers (and NMIs), but would break
>>>> call chains both within the same stack or going through a trap. I think
>>>> it could be a nice complementary safety net to handle mishaps in a non-fatal
>>>> way.
>>> 
>>> That works as long as none of this uses breakpoint based patching to
>>> dynamically disable/enable stuff.
>>
>> I'm clearly missing something here. I was expecting the "in_tracing" flag trick
>> to be able to fix the breakpoint recursion issue. What is the problem I'm
>> missing
>> here ?
> 
> How do you "fix" that when you can't reach the tracepoint because you
> trip over a breakpoint and then while trying to fixup that stuff you hit
> another one?

I may still be missing something, but if the fixup code (AFAIU the code performing
the out-of-line single-stepping of the original instruction) belongs to a section
hidden from instrumentation, it should not be an issue.

The basic idea would be, e.g. pseudo-code for int3:

<int3>  <---- in section which cannot be instrumented
if (recursion_ctx->in_tracer) {
   single-step original instruction
   iret
}
[...] prepare stuff
recursion_ctx->in_tracer = true;
instr_allowed()

call external kernel functions (which can be instrumented)

instr_disallowed()
recursion_ctx->in_tracer = false;
single-step original instruction
iret

The purpose of the "in_tracer" flag is to protect whatever is done within external
kernel functions (which can be instrumented) from triggering tracer recursion. It
needs to be combined with hiding of early/late low-level entry/exit functions from
instrumentation (as you propose) to work.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 17:40         ` Mathieu Desnoyers
@ 2020-03-10 18:31           ` Thomas Gleixner
  2020-03-10 18:37             ` Mathieu Desnoyers
  0 siblings, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2020-03-10 18:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Peter Zijlstra, rostedt, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> ----- On Mar 10, 2020, at 12:48 PM, Thomas Gleixner tglx@linutronix.de wrote:
>> How do you "fix" that when you can't reach the tracepoint because you
>> trip over a breakpoint and then while trying to fixup that stuff you hit
>> another one?
>
> I may still be missing something, but if the fixup code (AFAIU the code performing
> the out-of-line single-stepping of the original instruction) belongs to a section
> hidden from instrumentation, it should not be an issue.

Sure, but what guarantees that on the way there is nothing which might
call into instrumentable code? Nothing, really.

That's why I want the explicit sections which can be analyzed by
tools. Humans (including me) are really bad at it was demonstrated
several times.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 18:31           ` Thomas Gleixner
@ 2020-03-10 18:37             ` Mathieu Desnoyers
  0 siblings, 0 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-10 18:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, rostedt, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker

----- On Mar 10, 2020, at 2:31 PM, Thomas Gleixner tglx@linutronix.de wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
>> ----- On Mar 10, 2020, at 12:48 PM, Thomas Gleixner tglx@linutronix.de wrote:
>>> How do you "fix" that when you can't reach the tracepoint because you
>>> trip over a breakpoint and then while trying to fixup that stuff you hit
>>> another one?
>>
>> I may still be missing something, but if the fixup code (AFAIU the code
>> performing
>> the out-of-line single-stepping of the original instruction) belongs to a
>> section
>> hidden from instrumentation, it should not be an issue.
> 
> Sure, but what guarantees that on the way there is nothing which might
> call into instrumentable code? Nothing, really.
> 
> That's why I want the explicit sections which can be analyzed by
> tools. Humans (including me) are really bad at it was demonstrated
> several times.

AFAIU, my in_tracer flag proposal complements yours.

The explicit sections thingy is required for early/late entry/exit code,
but does nothing to prevent "explicitly marked" safe-to-instrument kernel
code from triggering infinite recursion.

My flag proposal handles that second issue.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 16:21             ` Mathieu Desnoyers
@ 2020-03-11  0:18               ` Masami Hiramatsu
  2020-03-11  0:37                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 44+ messages in thread
From: Masami Hiramatsu @ 2020-03-11  0:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: rostedt, Thomas Gleixner, Masami Hiramatsu, linux-kernel,
	Peter Zijlstra, Alexei Starovoitov, paulmck, Joel Fernandes,
	Google, Frederic Weisbecker, Jason Wessel

Hi Mathieu,

On Tue, 10 Mar 2020 12:21:31 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- On Mar 10, 2020, at 11:46 AM, rostedt rostedt@goodmis.org wrote:
> 
> > On Tue, 10 Mar 2020 11:31:51 -0400 (EDT)
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> >> I think there are two distinct problems we are trying to solve here,
> >> and it would be good to spell them out to see which pieces of technical
> >> solution apply to which.
> >> 
> >> Problem #1) Tracer invoked from partially initialized kernel context
> >> 
> >>   - Moving the early/late entry/exit points into sections invisible from
> >>     instrumentation seems to make tons of sense for this.
> >> 
> >> Problem #2) Tracer recursion
> >> 
> >>   - I'm much less convinced that hiding entry points from instrumentation
> >>     works for this. As an example, with the isntr_begin/end() approach you
> >>     propose above, as soon as you have a tracer recursing into itself because
> >>     something below do_stuff() has been instrumented, having hidden the entry
> >>     point did not help at all.
> >> 
> >> So I would be tempted to use the "hide entry/exit points" with explicit
> >> instr begin/end annotation to solve Problem #1, but I'm still thinking there
> >> is value in the per recursion context "in_tracing" flag to prevent tracer
> >> recursion.
> > 
> > The only recursion issue that I've seen discussed is breakpoints. And
> > that's outside of the tracer infrastructure. Basically, if someone added a
> > breakpoint for a kprobe on something that gets called in the int3 code
> > before kprobes is called we have (let's say rcu_nmi_enter()):
> > 
> > 
> > rcu_nmi_enter();
> >  <int3>
> >     do_int3() {
> >        rcu_nmi_enter();
> >          <int3>
> >             do_int3();
> >                [..]
> > 
> > Where would a "in_tracer" flag help here? Perhaps a "in_breakpoint" could?
> 
> An approach where the "in_tracer" flag is tested and set by the instrumentation
> (function tracer, kprobes, tracepoints) would work here. Let's say the beginning
> of the int3 ISR is part of the code which is invisible to instrumentation, and
> before we issue rcu_nmi_enter(), we handle the in_tracer flag:
> 
> rcu_nmi_enter();
>  <int3>
>     (recursion_ctx->in_tracer == false)
>     set recursion_ctx->in_tracer = true
>     do_int3() {
>        rcu_nmi_enter();
>          <int3>
>             if (recursion_ctx->in_tracer == true)
>                 iret
> 
> We can change "in_tracer" for "in_breakpoint", "in_tracepoint" and
> "in_function_trace" if we ever want to allow different types of instrumentation
> to nest. I'm not sure whether this is useful or not through.

Kprobes already has its own "in_kprobe" flag, and the recursion path is
not so simple. Since the int3 replaces the original instruction, we have to
execute the original instruction with single-step and fixup.

This means it involves do_debug() too. Thus, we can not do iret directly
from do_int3 like above, but if recursion happens, we have no way to
recover to origonal execution path (and call BUG()).

As my previous email, I showed a patch which is something like
"bust_kprobes()" for oops path. That is not safe but no other way to escape
from this recursion hell. (Maybe we can try to call it instead of calling
BUG() so that the kernel can continue to run, but I'm not sure we can
safely make the pagetable to readonly again.)

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-11  0:18               ` Masami Hiramatsu
@ 2020-03-11  0:37                 ` Mathieu Desnoyers
  2020-03-11  7:48                   ` Masami Hiramatsu
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2020-03-11  0:37 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: rostedt, Thomas Gleixner, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker, Jason Wessel

----- On Mar 10, 2020, at 8:18 PM, Masami Hiramatsu mhiramat@kernel.org wrote:
[...]
 
>> An approach where the "in_tracer" flag is tested and set by the instrumentation
>> (function tracer, kprobes, tracepoints) would work here. Let's say the beginning
>> of the int3 ISR is part of the code which is invisible to instrumentation, and
>> before we issue rcu_nmi_enter(), we handle the in_tracer flag:
>> 
>> rcu_nmi_enter();
>>  <int3>
>>     (recursion_ctx->in_tracer == false)
>>     set recursion_ctx->in_tracer = true
>>     do_int3() {
>>        rcu_nmi_enter();
>>          <int3>
>>             if (recursion_ctx->in_tracer == true)
>>                 iret
>> 
>> We can change "in_tracer" for "in_breakpoint", "in_tracepoint" and
>> "in_function_trace" if we ever want to allow different types of instrumentation
>> to nest. I'm not sure whether this is useful or not through.
> 
> Kprobes already has its own "in_kprobe" flag, and the recursion path is
> not so simple. Since the int3 replaces the original instruction, we have to
> execute the original instruction with single-step and fixup.
> 
> This means it involves do_debug() too. Thus, we can not do iret directly
> from do_int3 like above, but if recursion happens, we have no way to
> recover to origonal execution path (and call BUG()).

I think that all the code involved when hitting a breakpoint which would
be the minimal subset required to act as if the kprobe was not there in the
first place (single-step, fixup) should be hidden from kprobes
instrumentation. I suspect this is the current intent today with noprobe
annotations, but Thomas' proposal brings this a step further.

However, any other kprobe code (and tracer callbacks) beyond that
minimalistic "effect-less" kprobe could be protected by a
per-recursion-context in_kprobe flag.

> As my previous email, I showed a patch which is something like
> "bust_kprobes()" for oops path. That is not safe but no other way to escape
> from this recursion hell. (Maybe we can try to call it instead of calling
> BUG() so that the kernel can continue to run, but I'm not sure we can
> safely make the pagetable to readonly again.)

As long as we provide a minimalistic "effect-less" kprobe implementation
in a non-instrumentable section which can be used whenever we are in a
recursion scenario, I think we could achieve something recursion-free without
requiring a bust_kprobes() work-around.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-11  0:37                 ` Mathieu Desnoyers
@ 2020-03-11  7:48                   ` Masami Hiramatsu
  0 siblings, 0 replies; 44+ messages in thread
From: Masami Hiramatsu @ 2020-03-11  7:48 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: rostedt, Thomas Gleixner, linux-kernel, Peter Zijlstra,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker, Jason Wessel

On Tue, 10 Mar 2020 20:37:41 -0400 (EDT)
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> ----- On Mar 10, 2020, at 8:18 PM, Masami Hiramatsu mhiramat@kernel.org wrote:
> [...]
>  
> >> An approach where the "in_tracer" flag is tested and set by the instrumentation
> >> (function tracer, kprobes, tracepoints) would work here. Let's say the beginning
> >> of the int3 ISR is part of the code which is invisible to instrumentation, and
> >> before we issue rcu_nmi_enter(), we handle the in_tracer flag:
> >> 
> >> rcu_nmi_enter();
> >>  <int3>
> >>     (recursion_ctx->in_tracer == false)
> >>     set recursion_ctx->in_tracer = true
> >>     do_int3() {
> >>        rcu_nmi_enter();
> >>          <int3>
> >>             if (recursion_ctx->in_tracer == true)
> >>                 iret
> >> 
> >> We can change "in_tracer" for "in_breakpoint", "in_tracepoint" and
> >> "in_function_trace" if we ever want to allow different types of instrumentation
> >> to nest. I'm not sure whether this is useful or not through.
> > 
> > Kprobes already has its own "in_kprobe" flag, and the recursion path is
> > not so simple. Since the int3 replaces the original instruction, we have to
> > execute the original instruction with single-step and fixup.
> > 
> > This means it involves do_debug() too. Thus, we can not do iret directly
> > from do_int3 like above, but if recursion happens, we have no way to
> > recover to origonal execution path (and call BUG()).
> 
> I think that all the code involved when hitting a breakpoint which would
> be the minimal subset required to act as if the kprobe was not there in the
> first place (single-step, fixup) should be hidden from kprobes
> instrumentation. I suspect this is the current intent today with noprobe
> annotations, but Thomas' proposal brings this a step further.
> 
> However, any other kprobe code (and tracer callbacks) beyond that
> minimalistic "effect-less" kprobe could be protected by a
> per-recursion-context in_kprobe flag.

Would you mean "in_kprobe" flag will prevent recursive execution of
kprobes but not prevent other tracer like tracepoint? If so, it is
already done I think. As I pointed, kprobe itself has in_kprobe like
flag for checking re-entrance. Thus the kprobe handler can call the
function which has a tracepoint safely.

Anyway, I agree with you to port all kprobe int3/debug handling parts
to the effect-less (offlimit) area, except for its pre/post handlers.

> > As my previous email, I showed a patch which is something like
> > "bust_kprobes()" for oops path. That is not safe but no other way to escape
> > from this recursion hell. (Maybe we can try to call it instead of calling
> > BUG() so that the kernel can continue to run, but I'm not sure we can
> > safely make the pagetable to readonly again.)
> 
> As long as we provide a minimalistic "effect-less" kprobe implementation
> in a non-instrumentable section which can be used whenever we are in a
> recursion scenario, I think we could achieve something recursion-free without
> requiring a bust_kprobes() work-around.

Yeah, I hope so. The bust_kprobes() is something like an emergency escape
hammer which everyone hopes never be used :)

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10 11:43       ` Thomas Gleixner
  2020-03-10 15:31         ` Mathieu Desnoyers
  2020-03-10 16:06         ` Masami Hiramatsu
@ 2020-03-12 13:53         ` Peter Zijlstra
  2 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2020-03-12 13:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Masami Hiramatsu, Steven Rostedt, LKML, Alexei Starovoitov,
	Mathieu Desnoyers, Paul E. McKenney, Joel Fernandes,
	Frederic Weisbecker, Jason Wessel

On Tue, Mar 10, 2020 at 12:43:27PM +0100, Thomas Gleixner wrote:
> That's why we want the sections and the annotation. If something calls
> out of a noinstr section into a regular text section and the call is not
> annotated at the call site, then objtool can complain and tell you. What
> Peter and I came up with looks like this:
> 
> noinstr foo()
> 	do_protected(); <- Safe because in the noinstr section
> 
> 	instr_begin();	<- Marks the begin of a safe region, ignored
>         		   by objtool
> 
>         do_stuff();     <- All good   
> 
>         instr_end();    <- End of the safe region. objtool starts
> 			   looking again
> 
>         do_other_stuff();  <- Unsafe because do_other_stuff() is
>         		      not protected
> and:
> 
> noinstr do_protected()
>         bar();		<- objtool will complain here
> 
> See?

Find here:

  https://lkml.kernel.org/r/20200312134107.700205216@infradead.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-09 19:07     ` Steven Rostedt
  2020-03-09 19:20       ` Mathieu Desnoyers
@ 2020-03-16 15:02       ` Joel Fernandes
  1 sibling, 0 replies; 44+ messages in thread
From: Joel Fernandes @ 2020-03-16 15:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Thomas Gleixner, LKML, Peter Zijlstra, Masami Hiramatsu,
	Alexei Starovoitov, Mathieu Desnoyers, Paul E. McKenney,
	Frederic Weisbecker, Daniel Bristot de Oliveira,
	Valentin Schneider

On Mon, Mar 09, 2020 at 03:07:09PM -0400, Steven Rostedt wrote:
> On Mon, 9 Mar 2020 11:42:28 -0700
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > Just started a vacation here and will be back on January 12th. Will
> > take a detailed look at Thomas's email at that time.
> 
> January 12th! Wow! Enjoy your sabbatical and see you next year! ;-)

Haha, I am seriously getting picked on for that. But anyway, I am back now
because I missed you guys :-D

(Just kidding, obviously I meant March 12th).

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10  1:40   ` Alexei Starovoitov
  2020-03-10  8:02     ` Thomas Gleixner
  2020-03-10 16:54     ` Paul E. McKenney
@ 2020-03-17 17:56     ` Joel Fernandes
  2 siblings, 0 replies; 44+ messages in thread
From: Joel Fernandes @ 2020-03-17 17:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Thomas Gleixner, linux-kernel, Peter Zijlstra,
	rostedt, Masami Hiramatsu, Alexei Starovoitov, paulmck,
	Frederic Weisbecker, bpf

On Mon, Mar 09, 2020 at 06:40:45PM -0700, Alexei Starovoitov wrote:
> On Mon, Mar 09, 2020 at 02:37:40PM -0400, Mathieu Desnoyers wrote:
> > > 
> > >    But what's relevant is the tracer overhead which is e.g. inflicted
> > >    with todays trace_hardirqs_off/on() implementation because that
> > >    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> > >    around every tracepoint.
> > 
> > I think one of the big issues here is that most of the uses of
> > trace_hardirqs_off() are from sites which already have RCU watching,
> > so we are doing heavy-weight operations for nothing.
> 
> I think kernel/trace/trace_preemptirq.c created too many problems for the
> kernel without providing tangible benefits. My understanding no one is using it
> in production.

Hi Alexei,
There are various people use the preempt/irq disable tracepoints for last 2
years at Google and ARM. There's also a BPF tool (in BCC) that uses those for
tracing critical sections. Also Daniel Bristot's entire Preempt-IRQ formal
verification stuff depends on it.

> It's a tool to understand how kernel works. And such debugging
> tool can and should be removed.

If we go by that line of reasoning, then function tracing also should be
removed from the kernel.

I am glad Thomas and Peter are working on it and looking forward to seeing
the patches,

thanks,

 - Joel


> One of Thomas's patches mentioned that bpf can be invoked from hardirq and
> preempt tracers. This connection doesn't exist in a direct way, but
> theoretically it's possible. There is no practical use though and I would be
> happy to blacklist such bpf usage at a minimum.
> 
> > We could use the approach proposed by Peterz's and Steven's patches to basically
> > do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
> > RCU for those cases. We could then simply go back on using regular RCU like so:
> > 
> > #define __DO_TRACE(tp, proto, args, cond, rcuidle)                      \
> >         do {                                                            \
> >                 struct tracepoint_func *it_func_ptr;                    \
> >                 void *it_func;                                          \
> >                 void *__data;                                           \
> >                 bool exit_rcu = false;                                  \
> >                                                                         \
> >                 if (!(cond))                                            \
> >                         return;                                         \
> >                                                                         \
> >                 if (rcuidle && !rcu_is_watching()) {                    \
> >                         rcu_irq_enter_irqson();                         \
> >                         exit_rcu = true;                                \
> >                 }                                                       \
> >                 preempt_disable_notrace();                              \
> >                 it_func_ptr = rcu_dereference_raw((tp)->funcs);         \
> >                 if (it_func_ptr) {                                      \
> >                         do {                                            \
> >                                 it_func = (it_func_ptr)->func;          \
> >                                 __data = (it_func_ptr)->data;           \
> >                                 ((void(*)(proto))(it_func))(args);      \
> >                         } while ((++it_func_ptr)->func);                \
> >                 }                                                       \
> >                 preempt_enable_notrace();                               \
> >                 if (exit_rcu)                                           \
> >                         rcu_irq_exit_irqson();                          \
> >         } while (0)
> 
> I think it's a fine approach interim.
> 
> Long term sounds like Paul is going to provide sleepable and low overhead
> rcu_read_lock_for_tracers() that will include bpf.
> My understanding that this new rcu flavor won't have "idle" issues,
> so rcu_is_watching() checks will not be necessary.
> And if we remove trace_preemptirq.c the only thing left will be Thomas's points
> 1 (low level entry) and 2 (breakpoints) that can be addressed without
> creating fancy .text annotations and teach objtool about it.
> 
> In the mean time I've benchmarked srcu for sleepable bpf and it's quite heavy.
> srcu_read_lock+unlock roughly adds 10x execution cost to trivial bpf prog.
> I'm proceeding with it anyway, but really hoping that
> rcu_read_lock_for_tracers() will materialize soon.
> 
> In general I'm sceptical that .text annotations will work. Let's say all of
> idle is a red zone. But a ton of normal functions are called when idle. So
> objtool will go and mark them as red zone too. This way large percent of the
> kernel will be off limits for tracers. Which is imo not a good trade off. I
> think addressing 1 and 2 with explicit notrace/nokprobe annotations will cover
> all practical cases where people can shot themselves in a foot with a tracer. I
> realize that there will be forever whack-a-mole game and these annotations will
> never reach 100%. I think it's a fine trade off. Security is never 100% either.
> Tracing is never going to be 100% safe too.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-03-17 17:56 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-09 17:02 Instrumentation and RCU Thomas Gleixner
2020-03-09 18:15 ` Steven Rostedt
2020-03-09 18:42   ` Joel Fernandes
2020-03-09 19:07     ` Steven Rostedt
2020-03-09 19:20       ` Mathieu Desnoyers
2020-03-16 15:02       ` Joel Fernandes
2020-03-09 18:59   ` Thomas Gleixner
2020-03-10  8:09     ` Masami Hiramatsu
2020-03-10 11:43       ` Thomas Gleixner
2020-03-10 15:31         ` Mathieu Desnoyers
2020-03-10 15:46           ` Steven Rostedt
2020-03-10 16:21             ` Mathieu Desnoyers
2020-03-11  0:18               ` Masami Hiramatsu
2020-03-11  0:37                 ` Mathieu Desnoyers
2020-03-11  7:48                   ` Masami Hiramatsu
2020-03-10 16:06         ` Masami Hiramatsu
2020-03-12 13:53         ` Peter Zijlstra
2020-03-10 15:24       ` Mathieu Desnoyers
2020-03-10 17:05       ` Daniel Thompson
2020-03-09 18:37 ` Mathieu Desnoyers
2020-03-09 18:44   ` Steven Rostedt
2020-03-09 18:52     ` Mathieu Desnoyers
2020-03-09 19:09       ` Steven Rostedt
2020-03-09 19:25         ` Mathieu Desnoyers
2020-03-09 19:52   ` Thomas Gleixner
2020-03-10 15:03     ` Mathieu Desnoyers
2020-03-10 16:48       ` Thomas Gleixner
2020-03-10 17:40         ` Mathieu Desnoyers
2020-03-10 18:31           ` Thomas Gleixner
2020-03-10 18:37             ` Mathieu Desnoyers
2020-03-10  1:40   ` Alexei Starovoitov
2020-03-10  8:02     ` Thomas Gleixner
2020-03-10 16:54     ` Paul E. McKenney
2020-03-17 17:56     ` Joel Fernandes
2020-03-09 20:18 ` Peter Zijlstra
2020-03-09 20:47 ` Paul E. McKenney
2020-03-09 20:58   ` Steven Rostedt
2020-03-09 21:25     ` Paul E. McKenney
2020-03-09 23:52   ` Frederic Weisbecker
2020-03-10  2:26     ` Paul E. McKenney
2020-03-10 15:13   ` Mathieu Desnoyers
2020-03-10 16:49     ` Paul E. McKenney
2020-03-10 17:22       ` Mathieu Desnoyers
2020-03-10 17:26         ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).