LKML Archive on
 help / color / Atom feed
From: Mathieu Desnoyers <>
To: Thomas Gleixner <>
Cc: linux-kernel <>,
	Peter Zijlstra <>,
	rostedt <>,
	Masami Hiramatsu <>,
	Alexei Starovoitov <>, paulmck <>,
	"Joel Fernandes, Google" <>,
	Frederic Weisbecker <>
Subject: Re: Instrumentation and RCU
Date: Tue, 10 Mar 2020 11:03:15 -0400 (EDT)
Message-ID: <> (raw)
In-Reply-To: <>

----- On Mar 9, 2020, at 3:52 PM, Thomas Gleixner wrote:

> Mathieu Desnoyers <> writes:
>> ----- On Mar 9, 2020, at 1:02 PM, Thomas Gleixner wrote:
>>> #3) RCU idle
>>>    are really more than questionable. For 99.9999% of instrumentation
>>>    users it's absolutely irrelevant whether this traces the interrupt
>>>    disabled time of user_exit_irqsoff() or rcu_irq_enter() or not.
>>>    But what's relevant is the tracer overhead which is e.g. inflicted
>>>    with todays trace_hardirqs_off/on() implementation because that
>>>    unconditionally uses the rcuidle variant with the scru/rcu_irq dance
>>>    around every tracepoint.
>> I think one of the big issues here is that most of the uses of
>> trace_hardirqs_off() are from sites which already have RCU watching,
>> so we are doing heavy-weight operations for nothing.
> That and in some places in the entry code we do the heavy weight
> operations to cover 100 instructions which turn on RCU anyway. That does
> not make any sense at all.
>> I strongly suspect that most of the overhead we've been trying to avoid when
>> introducing use of SRCU in rcuidle tracepoints was actually caused by callsites
>> which use rcuidle tracepoints while having RCU watching, just because there is a
>> handful of callsites which don't have RCU watching. This is confirmed
>> by the commit message of commit e6753f23d9 "tracepoint: Make rcuidle
>> tracepoint callers use SRCU":
>>    "In recent tests with IRQ on/off tracepoints, a large performance
>>     overhead ~10% is noticed when running hackbench. This is root caused to
>>     calls to rcu_irq_enter_irqson and rcu_irq_exit_irqson from the
>>     tracepoint code. Following a long discussion on the list [1] about this,
>>     we concluded that srcu is a better alternative for use during rcu idle.
>>     Although it does involve extra barriers, its lighter than the sched-rcu
>>     version which has to do additional RCU calls to notify RCU idle about
>>     entry into RCU sections.
>> [...]
>>     Test: Tested idle and preempt/irq tracepoints."
> In a quick test I did with a invalid syscall number with profiling the
> trace_hardirqs_off() is pretty prominent and goes down by roughly a
> factor of 2 when I move it past enter_from_user_mode() and use just the
> non RCU idle variant.

I think one issue here is that trace_hardirqs_off() is now shared between
lockdep and tracing. For lockdep, we have the following comment:

         * IRQ from user mode.
         * We need to tell lockdep that IRQs are off.  We can't do this until
         * we fix gsbase, and we should do it before enter_from_user_mode
         * (which can take locks).  Since TRACE_IRQS_OFF is idempotent,
         * the simplest way to handle it is to just call it twice if
         * we enter from user mode.  There's no reason to optimize this since
         * TRACE_IRQS_OFF is a no-op if lockdep is off.


        ENTER_IRQ_STACK old_rsp=%rdi save_ret=1
        /* We entered an interrupt context - irqs are off: */

which seems to imply that lockdep requires TRACE_IRQS_OFF to be performed
_before_ entering from usermode. I don't expect this to be useful at all for
other tracers though. I think this should be replaced by a new e.g.
LOCKDEP_ENTER_FROM_USER_MODE or such which would call into lockdep without
calling other tracers.

>> So I think we could go back to plain RCU for rcuidle tracepoints if we do
>> the cheaper "rcu_is_watching()" check rather than invoking
>> rcu_irq_{enter,exit}_irqson() unconditionally.
> Haven't tried that yet for the overall usage of trace_hardirqs_off(),
> but yes, it's going to be a measurable difference.

Ideally we should do both: change the TRACE_IRQS_OFF prior entering from usermode
for a lockdep-specific call, which would remove the frequent case which requires
temporarily enabling RCU, *and* change tracepoints to use rcu_is_watching(),
temporarily enable RCU, and use standard RCU for rcuidle cases.

>>>    Even if the tracepoint sits in the ASM code it just covers about ~20
>>>    low level ASM instructions more. The tracer invocation, which is
>>>    even done twice when coming from user space on x86 (the second call
>>>    is optimized in the tracer C-code), costs definitely way more
>>>    cycles. When you take the scru/rcu_irq dance into account it's a
>>>    complete disaster performance wise.
>> Part of the issue here is the current overhead of SRCU read-side lock,
>> which contains memory barriers. The other part of the issue is the fact that
>> rcu_irq_{enter,exit}_irqson() contains an atomic_add_return atomic instruction.
>> We could use the approach proposed by Peterz's and Steven's patches to basically
>> do a lightweight "is_rcu_watching()" check for rcuidle tracepoint, and only
>> enable
>> RCU for those cases. We could then simply go back on using regular RCU
>> like so:
> Right, but that still does the whole rcu_irq dance especially in the
> entry code just to trace 50 or 100 instructions which are turning on RCU
> anyway.

Agreed. Would changing this to a lockdep-specific call as I suggest above
solve this ?

>>> #4 Protecting call chains
>> In addition to splitting tracing code into a separate section, which I think
>> makes sense, I can think of another alternative way to provide call chains
>> protection: adding a "in_tracing" flag somewhere alongside each kernel stack.
>> Each thread and interrupt stack would have its own flag. However, trap handlers
>> should share the "in_tracing" flag with the context which triggers the trap.
>> If a tracer recurses, or if a tracer attempts to trace another tracer, the
>> instrumentation would break the recursion chain by preventing instrumentation
>> from firing. If we end up caring about tracers tracing other tracers, we could
>> have one distinct flag per tracer and let each tracer break the recursion chain.
>> Having this flag per kernel stack rather than per CPU or per thread would
>> allow tracing of nested interrupt handlers (and NMIs), but would break
>> call chains both within the same stack or going through a trap. I think
>> it could be a nice complementary safety net to handle mishaps in a non-fatal
>> way.
> That works as long as none of this uses breakpoint based patching to
> dynamically disable/enable stuff.

I'm clearly missing something here. I was expecting the "in_tracing" flag trick
to be able to fix the breakpoint recursion issue. What is the problem I'm missing
here ?

An additional point about splitting core code into separate non-instrumentable
sections: The part I'm concerned about is instrumentation of trap handlers.
_That_ will likely need to remain shared between the kernel and the tracers.
This is where I think the "in_tracing" flag approach would help us solve the
recursion issue without losing that important instrumentation coverage.



Mathieu Desnoyers
EfficiOS Inc.

  reply index

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-09 17:02 Thomas Gleixner
2020-03-09 18:15 ` Steven Rostedt
2020-03-09 18:42   ` Joel Fernandes
2020-03-09 19:07     ` Steven Rostedt
2020-03-09 19:20       ` Mathieu Desnoyers
2020-03-16 15:02       ` Joel Fernandes
2020-03-09 18:59   ` Thomas Gleixner
2020-03-10  8:09     ` Masami Hiramatsu
2020-03-10 11:43       ` Thomas Gleixner
2020-03-10 15:31         ` Mathieu Desnoyers
2020-03-10 15:46           ` Steven Rostedt
2020-03-10 16:21             ` Mathieu Desnoyers
2020-03-11  0:18               ` Masami Hiramatsu
2020-03-11  0:37                 ` Mathieu Desnoyers
2020-03-11  7:48                   ` Masami Hiramatsu
2020-03-10 16:06         ` Masami Hiramatsu
2020-03-12 13:53         ` Peter Zijlstra
2020-03-10 15:24       ` Mathieu Desnoyers
2020-03-10 17:05       ` Daniel Thompson
2020-03-09 18:37 ` Mathieu Desnoyers
2020-03-09 18:44   ` Steven Rostedt
2020-03-09 18:52     ` Mathieu Desnoyers
2020-03-09 19:09       ` Steven Rostedt
2020-03-09 19:25         ` Mathieu Desnoyers
2020-03-09 19:52   ` Thomas Gleixner
2020-03-10 15:03     ` Mathieu Desnoyers [this message]
2020-03-10 16:48       ` Thomas Gleixner
2020-03-10 17:40         ` Mathieu Desnoyers
2020-03-10 18:31           ` Thomas Gleixner
2020-03-10 18:37             ` Mathieu Desnoyers
2020-03-10  1:40   ` Alexei Starovoitov
2020-03-10  8:02     ` Thomas Gleixner
2020-03-10 16:54     ` Paul E. McKenney
2020-03-17 17:56     ` Joel Fernandes
2020-03-09 20:18 ` Peter Zijlstra
2020-03-09 20:47 ` Paul E. McKenney
2020-03-09 20:58   ` Steven Rostedt
2020-03-09 21:25     ` Paul E. McKenney
2020-03-09 23:52   ` Frederic Weisbecker
2020-03-10  2:26     ` Paul E. McKenney
2020-03-10 15:13   ` Mathieu Desnoyers
2020-03-10 16:49     ` Paul E. McKenney
2020-03-10 17:22       ` Mathieu Desnoyers
2020-03-10 17:26         ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on

Archives are clonable:
	git clone --mirror lkml/git/0.git
	git clone --mirror lkml/git/1.git
	git clone --mirror lkml/git/2.git
	git clone --mirror lkml/git/3.git
	git clone --mirror lkml/git/4.git
	git clone --mirror lkml/git/5.git
	git clone --mirror lkml/git/6.git
	git clone --mirror lkml/git/7.git
	git clone --mirror lkml/git/8.git
	git clone --mirror lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ \
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone