From: Ingo Molnar <mingo@kernel.org>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Rik van Riel <riel@redhat.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
X86 ML <x86@kernel.org>,
williams@redhat.com, Andrew Lutomirski <luto@kernel.org>,
fweisbec@redhat.com, Peter Zijlstra <peterz@infradead.org>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
"Paul E. McKenney" <paulmck@us.ibm.com>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
Date: Sat, 2 May 2015 07:27:33 +0200 [thread overview]
Message-ID: <20150502052733.GA9983@gmail.com> (raw)
In-Reply-To: <CALCETrX2Di19atf4+Nx5cCOxbqoFdx+USLJ-WRSBy_Se25RE-g@mail.gmail.com>
* Andy Lutomirski <luto@amacapital.net> wrote:
> On Fri, May 1, 2015 at 12:11 PM, Rik van Riel <riel@redhat.com> wrote:
> > On 05/01/2015 02:40 PM, Ingo Molnar wrote:
> >
> >> Or we could do that in the syscall path with a single store of a
> >> constant flag to a location in the task struct. We have a number of
> >> natural flags that get written on syscall entry, such as:
> >>
> >> pushq_cfi $__USER_DS /* pt_regs->ss */
> >>
> >> That goes to a constant location on the kernel stack. On return from
> >> system calls we could write 0 to that location.
>
> Huh? IRET with zero there will fault, and we can't guarantee that
> all syscalls return using sysret. [...]
So IRET is a slowpath - I was thinking about the fastpath mainly.
> [...] Also, we'd have to audit all the entries, and
> system_call_after_swapgs currently enables interrupts early enough
> that an interrupt before all the pushes will do unpredictable things
> to pt_regs.
An irq hardware frame won't push zero to that selector value, right?
That's the only bad thing that would confuse the code.
> We could further abuse orig_ax, but that would require a lot of
> auditing. Honestly, though, I think keeping a flag in an
> otherwise-hot cache line is a better bet. [...]
That would work too, at the cost of one more instruction, as now we'd
have to both set and clear it.
> [...] Also, making it per-cpu instead of per-task will probably be
> easier on the RCU code, since otherwise the RCU code will have to do
> some kind of synchronization (even if it's a wait-free probe of the
> rq lock or similar) to figure out which pt_regs to look at.
So the synchronize_rcu() part is an even slower slow path, in
comparison with system call entry overhead.
But yes, safely accessing the remote task is a complication, but with
such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So
even if we do:
> If we went that route, I'd advocate sticking the flag in tss->sp1.
> That cacheline is unconditionally read on kernel entry already, and
> sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
> lost track). [1]
>
> Alternatively, I'd suggest that we actually add a whole new word to
> pt_regs.
... we'd still have to set TIF_NOHZ and are back to square one in
terms of race complexity.
But it's not overly complex: by taking the remote CPU's rq-lock from
synchronize_rcu() we could get a stable pointer to the currently
executing task.
And if we do that, we might as well use the opportunity and take a
look at pt_regs as well - this is why sticking it into pt_regs might
be better.
So I'd:
- enlarge pt_regs by a word and stick the flag there (this
allocation is essentially free)
- update the word from entry/exit
- synchronize_rcu() avoids having to send an IPI by taking a
peak at rq->curr's pt_regs::flag, and if:
- the flag is 0 then it has observed a quiescent state.
- the flag is 1, then it would set TIF_NOHZ and wait for a
completion from a TIF_NOHZ callback.
synchronize_rcu() often involves waiting for (timer tick driven) grace
periods anyway, so this is a relatively fast solution - and it would
limit the overhead to 2 instructions.
On systems that have zero nohz-full CPUs (i.e. !context_tracking_enabled)
we could patch out those two instructions into NOPs, which would be
eliminated in the decoder.
Regarding the user/kernel execution time split measurement:
1) the same flag could be used to sample a remote CPU's statistics
from another CPU and update the stats in the currently executing task.
As long as there's at least one non-nohz-full CPU, this would work. Or
are there systems were all CPUs are nohz-full?
2) Alternatively we could just drive user/kernel split statistics from
context switches, which would be inaccurate if the workload is
SCHED_FIFO that only rarely context switches.
How does this sound?
Thanks,
Ingo
next prev parent reply other threads:[~2015-05-02 5:27 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-30 21:23 [PATCH 0/3] reduce nohz_full syscall overhead by 10% riel
2015-04-30 21:23 ` [PATCH 1/3] reduce indentation in __acct_update_integrals riel
2015-04-30 21:23 ` [PATCH 2/3] remove local_irq_save from __acct_update_integrals riel
2015-04-30 21:23 ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry riel
2015-04-30 21:56 ` Andy Lutomirski
2015-05-01 6:40 ` Ingo Molnar
2015-05-01 15:20 ` Rik van Riel
2015-05-01 15:59 ` Ingo Molnar
2015-05-01 16:03 ` Andy Lutomirski
2015-05-01 16:21 ` Ingo Molnar
2015-05-01 16:26 ` Rik van Riel
2015-05-01 16:34 ` Ingo Molnar
2015-05-01 18:05 ` Rik van Riel
2015-05-01 18:40 ` Ingo Molnar
2015-05-01 19:11 ` Rik van Riel
2015-05-01 19:37 ` Andy Lutomirski
2015-05-02 5:27 ` Ingo Molnar [this message]
2015-05-02 18:27 ` Rik van Riel
2015-05-03 18:41 ` Andy Lutomirski
2015-05-07 10:35 ` Ingo Molnar
2015-05-04 9:26 ` Paolo Bonzini
2015-05-04 13:30 ` Rik van Riel
2015-05-04 14:06 ` Rik van Riel
2015-05-04 14:19 ` Rik van Riel
2015-05-04 15:59 ` question about RCU dynticks_nesting Rik van Riel
2015-05-04 18:39 ` Paul E. McKenney
2015-05-04 19:39 ` Rik van Riel
2015-05-04 20:02 ` Paul E. McKenney
2015-05-04 20:13 ` Rik van Riel
2015-05-04 20:38 ` Paul E. McKenney
2015-05-04 20:53 ` Rik van Riel
2015-05-05 5:54 ` Paul E. McKenney
2015-05-06 1:49 ` Mike Galbraith
2015-05-06 3:44 ` Mike Galbraith
2015-05-06 6:06 ` Paul E. McKenney
2015-05-06 6:52 ` Mike Galbraith
2015-05-06 7:01 ` Mike Galbraith
2015-05-07 0:59 ` Frederic Weisbecker
2015-05-07 15:44 ` Rik van Riel
2015-05-04 19:00 ` Rik van Riel
2015-05-04 19:39 ` Paul E. McKenney
2015-05-04 19:59 ` Rik van Riel
2015-05-04 20:40 ` Paul E. McKenney
2015-05-05 10:53 ` Peter Zijlstra
2015-05-05 12:34 ` Paul E. McKenney
2015-05-05 13:00 ` Peter Zijlstra
2015-05-05 18:35 ` Paul E. McKenney
2015-05-05 21:09 ` Rik van Riel
2015-05-06 5:41 ` Paul E. McKenney
2015-05-05 10:48 ` Peter Zijlstra
2015-05-05 10:51 ` Peter Zijlstra
2015-05-05 12:30 ` Paul E. McKenney
2015-05-02 4:06 ` [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry Mike Galbraith
2015-05-01 16:37 ` Ingo Molnar
2015-05-01 16:40 ` Rik van Riel
2015-05-01 16:45 ` Ingo Molnar
2015-05-01 16:54 ` Rik van Riel
2015-05-01 17:12 ` Ingo Molnar
2015-05-01 17:22 ` Rik van Riel
2015-05-01 17:59 ` Ingo Molnar
2015-05-01 16:22 ` Rik van Riel
2015-05-01 16:27 ` Ingo Molnar
2015-05-03 13:23 ` Mike Galbraith
2015-05-03 17:30 ` Rik van Riel
2015-05-03 18:24 ` Andy Lutomirski
2015-05-03 18:52 ` Rik van Riel
2015-05-07 10:48 ` Ingo Molnar
2015-05-07 12:18 ` Frederic Weisbecker
2015-05-07 12:29 ` Ingo Molnar
2015-05-07 15:47 ` Rik van Riel
2015-05-08 7:58 ` Ingo Molnar
2015-05-07 12:22 ` Andy Lutomirski
2015-05-07 12:44 ` Ingo Molnar
2015-05-07 12:49 ` Ingo Molnar
2015-05-08 6:17 ` Paul E. McKenney
2015-05-07 12:52 ` Andy Lutomirski
2015-05-07 15:08 ` Ingo Molnar
2015-05-07 17:47 ` Andy Lutomirski
2015-05-08 6:37 ` Ingo Molnar
2015-05-08 10:59 ` Andy Lutomirski
2015-05-08 11:27 ` Ingo Molnar
2015-05-08 12:56 ` Andy Lutomirski
2015-05-08 13:27 ` Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150502052733.GA9983@gmail.com \
--to=mingo@kernel.org \
--cc=fweisbec@redhat.com \
--cc=heiko.carstens@de.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=luto@kernel.org \
--cc=mingo@redhat.com \
--cc=paulmck@us.ibm.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=williams@redhat.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).