bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Instrumentation and RCU
       [not found] ` <1403546357.21810.1583779060302.JavaMail.zimbra@efficios.com>
@ 2020-03-10  1:40   ` Alexei Starovoitov
  2020-03-10  8:02     ` Thomas Gleixner
                       ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Alexei Starovoitov @ 2020-03-10  1:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, rostedt,
	Masami Hiramatsu, Alexei Starovoitov, paulmck, Joel Fernandes,
	Google, Frederic Weisbecker, bpf

On Mon, Mar 09, 2020 at 02:37:40PM -0400, Mathieu Desnoyers wrote:
> > 
> >    But what's relevant is the tracer overhead which is e.g. inflicted
> >    with todays trace_hardirqs_off/on() implementation because that
> >    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> >    around every tracepoint.
> 
> I think one of the big issues here is that most of the uses of
> trace_hardirqs_off() are from sites which already have RCU watching,
> so we are doing heavy-weight operations for nothing.

I think kernel/trace/trace_preemptirq.c created too many problems for the
kernel without providing tangible benefits. My understanding no one is using it
in production. It's a tool to understand how kernel works. And such debugging
tool can and should be removed.

One of Thomas's patches mentioned that bpf can be invoked from hardirq and
preempt tracers. This connection doesn't exist in a direct way, but
theoretically it's possible. There is no practical use though and I would be
happy to blacklist such bpf usage at a minimum.

> We could use the approach proposed by Peterz's and Steven's patches to basically
> do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
> RCU for those cases. We could then simply go back on using regular RCU like so:
> 
> #define __DO_TRACE(tp, proto, args, cond, rcuidle)                      \
>         do {                                                            \
>                 struct tracepoint_func *it_func_ptr;                    \
>                 void *it_func;                                          \
>                 void *__data;                                           \
>                 bool exit_rcu = false;                                  \
>                                                                         \
>                 if (!(cond))                                            \
>                         return;                                         \
>                                                                         \
>                 if (rcuidle && !rcu_is_watching()) {                    \
>                         rcu_irq_enter_irqson();                         \
>                         exit_rcu = true;                                \
>                 }                                                       \
>                 preempt_disable_notrace();                              \
>                 it_func_ptr = rcu_dereference_raw((tp)->funcs);         \
>                 if (it_func_ptr) {                                      \
>                         do {                                            \
>                                 it_func = (it_func_ptr)->func;          \
>                                 __data = (it_func_ptr)->data;           \
>                                 ((void(*)(proto))(it_func))(args);      \
>                         } while ((++it_func_ptr)->func);                \
>                 }                                                       \
>                 preempt_enable_notrace();                               \
>                 if (exit_rcu)                                           \
>                         rcu_irq_exit_irqson();                          \
>         } while (0)

I think it's a fine approach interim.

Long term sounds like Paul is going to provide sleepable and low overhead
rcu_read_lock_for_tracers() that will include bpf.
My understanding that this new rcu flavor won't have "idle" issues,
so rcu_is_watching() checks will not be necessary.
And if we remove trace_preemptirq.c the only thing left will be Thomas's points
1 (low level entry) and 2 (breakpoints) that can be addressed without
creating fancy .text annotations and teach objtool about it.

In the mean time I've benchmarked srcu for sleepable bpf and it's quite heavy.
srcu_read_lock+unlock roughly adds 10x execution cost to trivial bpf prog.
I'm proceeding with it anyway, but really hoping that
rcu_read_lock_for_tracers() will materialize soon.

In general I'm sceptical that .text annotations will work. Let's say all of
idle is a red zone. But a ton of normal functions are called when idle. So
objtool will go and mark them as red zone too. This way large percent of the
kernel will be off limits for tracers. Which is imo not a good trade off. I
think addressing 1 and 2 with explicit notrace/nokprobe annotations will cover
all practical cases where people can shot themselves in a foot with a tracer. I
realize that there will be forever whack-a-mole game and these annotations will
never reach 100%. I think it's a fine trade off. Security is never 100% either.
Tracing is never going to be 100% safe too.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10  1:40   ` Instrumentation and RCU Alexei Starovoitov
@ 2020-03-10  8:02     ` Thomas Gleixner
  2020-03-10 16:54     ` Paul E. McKenney
  2020-03-17 17:56     ` Joel Fernandes
  2 siblings, 0 replies; 4+ messages in thread
From: Thomas Gleixner @ 2020-03-10  8:02 UTC (permalink / raw)
  To: Alexei Starovoitov, Mathieu Desnoyers
  Cc: linux-kernel, Peter Zijlstra, rostedt, Masami Hiramatsu,
	Alexei Starovoitov, paulmck, Joel Fernandes, Google,
	Frederic Weisbecker, bpf

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> In general I'm sceptical that .text annotations will work. Let's say all of
> idle is a red zone. But a ton of normal functions are called when idle. So
> objtool will go and mark them as red zone too.

No. If you carefully read what I proposed its:

noinst foo()
{
        protected_work();
        
        instr_begin();

        invoke_otherstuff();

        instr_end();

        moar_protected_work();

}

objtool will not mark anything. It will check that invocations out of
the protected section are marked as safe, i.e. inside a
instr_begin/end() pair.

So if you fail to mark protected_work() as noinstr then it will
complain. If you forget to put instr_begin/end() around the safe area it
will complain about invoke_otherstuff().

So it's a very targeted approach. objtool is there to verify that it's
consistent nothing else.

> This way large percent of the
> kernel will be off limits for tracers. Which is imo not a good trade off. I
> think addressing 1 and 2 with explicit notrace/nokprobe annotations will cover
> all practical cases where people can shot themselves in a foot with a
> tracer.

That's simply wishful thinking. The discussions in the last weeks have
clearly demonstrated that this is not the case. People were truly
convinced that e.g. probing rcu_idle_exit() is safe, but it was
not. Read the thread how long this went on.

> I realize that there will be forever whack-a-mole game and these
> annotations will never reach 100%. I think it's a fine trade
> off. Security is never 100% either.  Tracing is never going to be 100%
> safe too.

I disagree. Whack a mole games are horrible and have a guaranteed
high failure rate. Otherwise we would not discuss this at all.

And no, it's not a fine trade off.

If we can have technical means to prevent the wreckage, then not using
them for handwaving reasons is just violating the only sane engineering
principle:

        Correctness first

I spent the last 20 years mopping up the violations of this principle.

We have to stop the "features first, performance first" and "good
enough" mentality if we want to master the ever increasing complexity of
hardware and software in the long run.

From my experience of cleaning up stuff, I can tell you, that
correctness first neither hurts performance nor does it prevent
features, except those which are wrong to begin with.

As quite some people do not care about or even willfully ignore
"correctness first", we have to force them to adhere by technical means,
which spares us to mop up the mess they'd create otherwise.

And even for those who deeply care tooling support is a great help to
prevent the accidental slip up. I wish I could have spared chasing call
chains manually and then figure out two days later that I missed
something.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10  1:40   ` Instrumentation and RCU Alexei Starovoitov
  2020-03-10  8:02     ` Thomas Gleixner
@ 2020-03-10 16:54     ` Paul E. McKenney
  2020-03-17 17:56     ` Joel Fernandes
  2 siblings, 0 replies; 4+ messages in thread
From: Paul E. McKenney @ 2020-03-10 16:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Thomas Gleixner, linux-kernel, Peter Zijlstra,
	rostedt, Masami Hiramatsu, Alexei Starovoitov, Joel Fernandes,
	Google, Frederic Weisbecker, bpf

On Mon, Mar 09, 2020 at 06:40:45PM -0700, Alexei Starovoitov wrote:
> On Mon, Mar 09, 2020 at 02:37:40PM -0400, Mathieu Desnoyers wrote:
> > > 
> > >    But what's relevant is the tracer overhead which is e.g. inflicted
> > >    with todays trace_hardirqs_off/on() implementation because that
> > >    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> > >    around every tracepoint.
> > 
> > I think one of the big issues here is that most of the uses of
> > trace_hardirqs_off() are from sites which already have RCU watching,
> > so we are doing heavy-weight operations for nothing.
> 
> I think kernel/trace/trace_preemptirq.c created too many problems for the
> kernel without providing tangible benefits. My understanding no one is using it
> in production. It's a tool to understand how kernel works. And such debugging
> tool can and should be removed.
> 
> One of Thomas's patches mentioned that bpf can be invoked from hardirq and
> preempt tracers. This connection doesn't exist in a direct way, but
> theoretically it's possible. There is no practical use though and I would be
> happy to blacklist such bpf usage at a minimum.
> 
> > We could use the approach proposed by Peterz's and Steven's patches to basically
> > do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
> > RCU for those cases. We could then simply go back on using regular RCU like so:
> > 
> > #define __DO_TRACE(tp, proto, args, cond, rcuidle)                      \
> >         do {                                                            \
> >                 struct tracepoint_func *it_func_ptr;                    \
> >                 void *it_func;                                          \
> >                 void *__data;                                           \
> >                 bool exit_rcu = false;                                  \
> >                                                                         \
> >                 if (!(cond))                                            \
> >                         return;                                         \
> >                                                                         \
> >                 if (rcuidle && !rcu_is_watching()) {                    \
> >                         rcu_irq_enter_irqson();                         \
> >                         exit_rcu = true;                                \
> >                 }                                                       \
> >                 preempt_disable_notrace();                              \
> >                 it_func_ptr = rcu_dereference_raw((tp)->funcs);         \
> >                 if (it_func_ptr) {                                      \
> >                         do {                                            \
> >                                 it_func = (it_func_ptr)->func;          \
> >                                 __data = (it_func_ptr)->data;           \
> >                                 ((void(*)(proto))(it_func))(args);      \
> >                         } while ((++it_func_ptr)->func);                \
> >                 }                                                       \
> >                 preempt_enable_notrace();                               \
> >                 if (exit_rcu)                                           \
> >                         rcu_irq_exit_irqson();                          \
> >         } while (0)
> 
> I think it's a fine approach interim.
> 
> Long term sounds like Paul is going to provide sleepable and low overhead
> rcu_read_lock_for_tracers() that will include bpf.

It now builds without errors, so the obvious problems are taken care of...

Working on the less-obvious errors as rcutorture encounters them.

> My understanding that this new rcu flavor won't have "idle" issues,
> so rcu_is_watching() checks will not be necessary.

True.  However, if the from-idle code invokes other code relying on
vanilla RCU, such checks are still required.  But I must let others
weigh in on this.

> And if we remove trace_preemptirq.c the only thing left will be Thomas's points
> 1 (low level entry) and 2 (breakpoints) that can be addressed without
> creating fancy .text annotations and teach objtool about it.

And the intent is to cover these cases as well.  Of course, we all know
which road is paved with good intentions.  ;-)

> In the mean time I've benchmarked srcu for sleepable bpf and it's quite heavy.
> srcu_read_lock+unlock roughly adds 10x execution cost to trivial bpf prog.
> I'm proceeding with it anyway, but really hoping that
> rcu_read_lock_for_tracers() will materialize soon.

OK, 10x is a bit on the painful side!

							Thanx, Paul

> In general I'm sceptical that .text annotations will work. Let's say all of
> idle is a red zone. But a ton of normal functions are called when idle. So
> objtool will go and mark them as red zone too. This way large percent of the
> kernel will be off limits for tracers. Which is imo not a good trade off. I
> think addressing 1 and 2 with explicit notrace/nokprobe annotations will cover
> all practical cases where people can shot themselves in a foot with a tracer. I
> realize that there will be forever whack-a-mole game and these annotations will
> never reach 100%. I think it's a fine trade off. Security is never 100% either.
> Tracing is never going to be 100% safe too.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Instrumentation and RCU
  2020-03-10  1:40   ` Instrumentation and RCU Alexei Starovoitov
  2020-03-10  8:02     ` Thomas Gleixner
  2020-03-10 16:54     ` Paul E. McKenney
@ 2020-03-17 17:56     ` Joel Fernandes
  2 siblings, 0 replies; 4+ messages in thread
From: Joel Fernandes @ 2020-03-17 17:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mathieu Desnoyers, Thomas Gleixner, linux-kernel, Peter Zijlstra,
	rostedt, Masami Hiramatsu, Alexei Starovoitov, paulmck,
	Frederic Weisbecker, bpf

On Mon, Mar 09, 2020 at 06:40:45PM -0700, Alexei Starovoitov wrote:
> On Mon, Mar 09, 2020 at 02:37:40PM -0400, Mathieu Desnoyers wrote:
> > > 
> > >    But what's relevant is the tracer overhead which is e.g. inflicted
> > >    with todays trace_hardirqs_off/on() implementation because that
> > >    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
> > >    around every tracepoint.
> > 
> > I think one of the big issues here is that most of the uses of
> > trace_hardirqs_off() are from sites which already have RCU watching,
> > so we are doing heavy-weight operations for nothing.
> 
> I think kernel/trace/trace_preemptirq.c created too many problems for the
> kernel without providing tangible benefits. My understanding no one is using it
> in production.

Hi Alexei,
There are various people use the preempt/irq disable tracepoints for last 2
years at Google and ARM. There's also a BPF tool (in BCC) that uses those for
tracing critical sections. Also Daniel Bristot's entire Preempt-IRQ formal
verification stuff depends on it.

> It's a tool to understand how kernel works. And such debugging
> tool can and should be removed.

If we go by that line of reasoning, then function tracing also should be
removed from the kernel.

I am glad Thomas and Peter are working on it and looking forward to seeing
the patches,

thanks,

 - Joel


> One of Thomas's patches mentioned that bpf can be invoked from hardirq and
> preempt tracers. This connection doesn't exist in a direct way, but
> theoretically it's possible. There is no practical use though and I would be
> happy to blacklist such bpf usage at a minimum.
> 
> > We could use the approach proposed by Peterz's and Steven's patches to basically
> > do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only enable
> > RCU for those cases. We could then simply go back on using regular RCU like so:
> > 
> > #define __DO_TRACE(tp, proto, args, cond, rcuidle)                      \
> >         do {                                                            \
> >                 struct tracepoint_func *it_func_ptr;                    \
> >                 void *it_func;                                          \
> >                 void *__data;                                           \
> >                 bool exit_rcu = false;                                  \
> >                                                                         \
> >                 if (!(cond))                                            \
> >                         return;                                         \
> >                                                                         \
> >                 if (rcuidle && !rcu_is_watching()) {                    \
> >                         rcu_irq_enter_irqson();                         \
> >                         exit_rcu = true;                                \
> >                 }                                                       \
> >                 preempt_disable_notrace();                              \
> >                 it_func_ptr = rcu_dereference_raw((tp)->funcs);         \
> >                 if (it_func_ptr) {                                      \
> >                         do {                                            \
> >                                 it_func = (it_func_ptr)->func;          \
> >                                 __data = (it_func_ptr)->data;           \
> >                                 ((void(*)(proto))(it_func))(args);      \
> >                         } while ((++it_func_ptr)->func);                \
> >                 }                                                       \
> >                 preempt_enable_notrace();                               \
> >                 if (exit_rcu)                                           \
> >                         rcu_irq_exit_irqson();                          \
> >         } while (0)
> 
> I think it's a fine approach interim.
> 
> Long term sounds like Paul is going to provide sleepable and low overhead
> rcu_read_lock_for_tracers() that will include bpf.
> My understanding that this new rcu flavor won't have "idle" issues,
> so rcu_is_watching() checks will not be necessary.
> And if we remove trace_preemptirq.c the only thing left will be Thomas's points
> 1 (low level entry) and 2 (breakpoints) that can be addressed without
> creating fancy .text annotations and teach objtool about it.
> 
> In the mean time I've benchmarked srcu for sleepable bpf and it's quite heavy.
> srcu_read_lock+unlock roughly adds 10x execution cost to trivial bpf prog.
> I'm proceeding with it anyway, but really hoping that
> rcu_read_lock_for_tracers() will materialize soon.
> 
> In general I'm sceptical that .text annotations will work. Let's say all of
> idle is a red zone. But a ton of normal functions are called when idle. So
> objtool will go and mark them as red zone too. This way large percent of the
> kernel will be off limits for tracers. Which is imo not a good trade off. I
> think addressing 1 and 2 with explicit notrace/nokprobe annotations will cover
> all practical cases where people can shot themselves in a foot with a tracer. I
> realize that there will be forever whack-a-mole game and these annotations will
> never reach 100%. I think it's a fine trade off. Security is never 100% either.
> Tracing is never going to be 100% safe too.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-03-17 17:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87mu8p797b.fsf@nanos.tec.linutronix.de>
     [not found] ` <1403546357.21810.1583779060302.JavaMail.zimbra@efficios.com>
2020-03-10  1:40   ` Instrumentation and RCU Alexei Starovoitov
2020-03-10  8:02     ` Thomas Gleixner
2020-03-10 16:54     ` Paul E. McKenney
2020-03-17 17:56     ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).