* RFC: Petition Intel/AMD to add POPF_IF insn
@ 2016-08-17 17:20 Denys Vlasenko
2016-08-17 17:30 ` Christian König
2016-08-17 17:32 ` Linus Torvalds
0 siblings, 2 replies; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 17:20 UTC (permalink / raw)
To: Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher,
Rafael J. Wysocki, Christian König, Vinod Koul,
Johannes Berg, Sara Sharon, Adrian Hunter, Linus Torvalds
Cc: x86, LKML
Last year, a proposal was floated to avoid costly POPF.
In particular, each spin_unlock_irqrestore() does it,
a rather common operation.
https://lkml.org/lkml/2015/4/21/290
[RFC PATCH] x86/asm/irq: Don't use POPF but STI
* Andy Lutomirski <luto@kernel.org> wrote:
> > Another different approach would be to formally state that
> > pv_irq_ops.save_fl() needs to return all the flags, which would
> > make local_irq_save() safe to use in this circumstance, but that
> > makes a hotpath longer for the sake of a single boot time check.
>
> ...which reminds me:
>
> Why does native_restore_fl restore anything other than IF? A branch
> and sti should be considerably faster than popf.
Ingo agreed:
====
Yes, this has come up in the past, something like the patch below?
Totally untested and not signed off yet: because we'd first have to
make sure (via irq flags debugging) that it's not used in reverse, to
re-disable interrupts:
local_irq_save(flags);
local_irq_enable();
...
local_irq_restore(flags); /* effective local_irq_disable() */
I don't think we have many (any?) such patterns left, but it has to be
checked first. If we have such cases then we'll have to use different
primitives there.
====
Linus replied:
=====
"popf" is fast for the "no changes to IF" case, and is a smaller
instruction anyway.
====
This basically shot down the proposal.
But in my measurements POPF is not fast even in the case where restored
flags are not changes at all:
mov $200*1000*1000, %eax
pushf
pop %rbx
.balign 64
loop: push %rbx
popf
dec %eax
jnz loop
# perf stat -r20 ./popf_1g
4,929,012,093 cycles # 3.412 GHz ( +- 0.02% )
835,721,371 instructions # 0.17 insn per cycle ( +- 0.02% )
1.446185359 seconds time elapsed ( +- 0.46% )
If I replace POPF with a pop into an unused register, I get this:
loop: push %rbx
pop %rcx
dec %eax
jnz loop
209,247,645 cycles # 3.209 GHz ( +- 0.11% )
801,898,016 instructions # 3.83 insn per cycle ( +- 0.00% )
0.066210725 seconds time elapsed ( +- 0.59% )
IOW, POPF takes at least 6 cycles.
Linus does have a point that a "test+branch+STI" may end up not a clear win because of the branch.
But the need to restore IF flag exists, it is necessary not only for Linux, but for any OS
running on x86: they all have some sort of spinlock.
The addition of a POPF instruction variant which looks only at IF bit and changes
only that bit in EFLAGS may be a good idea, for all OSes.
I propose that we ask Intel / AMD to do that.
Maybe by the "ignored prefix" trick which was used when LZCNT insn was introduced
as REPZ-prefixed BSR?
Currently, REPZ POPF (f3 9d) insn does execute. Redefine this opcode as
POPF_IF. Then the same kernel will work on old and new CPUs.
CC'ing some @intel and @amd emails...
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 17:20 RFC: Petition Intel/AMD to add POPF_IF insn Denys Vlasenko
@ 2016-08-17 17:30 ` Christian König
2016-08-17 18:34 ` Denys Vlasenko
2016-08-17 17:32 ` Linus Torvalds
1 sibling, 1 reply; 25+ messages in thread
From: Christian König @ 2016-08-17 17:30 UTC (permalink / raw)
To: Denys Vlasenko, Ingo Molnar, Andy Lutomirski, Dan Williams,
Alex Deucher, Rafael J. Wysocki, Vinod Koul, Johannes Berg,
Sara Sharon, Adrian Hunter, Linus Torvalds
Cc: x86, LKML
Well first of all you actually CCed AMDs graphics developers. So Alex
and I can't say much about CPU optimizations.
Additional to that a comment below.
Am 17.08.2016 um 19:20 schrieb Denys Vlasenko:
> Last year, a proposal was floated to avoid costly POPF.
> In particular, each spin_unlock_irqrestore() does it,
> a rather common operation.
>
> https://lkml.org/lkml/2015/4/21/290
> [RFC PATCH] x86/asm/irq: Don't use POPF but STI
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>> > Another different approach would be to formally state that
>> > pv_irq_ops.save_fl() needs to return all the flags, which would
>> > make local_irq_save() safe to use in this circumstance, but that
>> > makes a hotpath longer for the sake of a single boot time check.
>>
>> ...which reminds me:
>>
>> Why does native_restore_fl restore anything other than IF? A branch
>> and sti should be considerably faster than popf.
>
>
> Ingo agreed:
> ====
> Yes, this has come up in the past, something like the patch below?
>
> Totally untested and not signed off yet: because we'd first have to
> make sure (via irq flags debugging) that it's not used in reverse, to
> re-disable interrupts:
> local_irq_save(flags);
> local_irq_enable();
> ...
> local_irq_restore(flags); /* effective local_irq_disable() */
> I don't think we have many (any?) such patterns left, but it has to be
> checked first. If we have such cases then we'll have to use different
> primitives there.
> ====
>
> Linus replied:
> =====
> "popf" is fast for the "no changes to IF" case, and is a smaller
> instruction anyway.
> ====
>
>
> This basically shot down the proposal.
>
> But in my measurements POPF is not fast even in the case where restored
> flags are not changes at all:
>
> mov $200*1000*1000, %eax
> pushf
> pop %rbx
> .balign 64
> loop: push %rbx
> popf
> dec %eax
> jnz loop
>
> # perf stat -r20 ./popf_1g
> 4,929,012,093 cycles # 3.412
> GHz ( +- 0.02% )
> 835,721,371 instructions # 0.17 insn per
> cycle ( +- 0.02% )
> 1.446185359 seconds time
> elapsed ( +- 0.46% )
>
> If I replace POPF with a pop into an unused register, I get this:
You are comparing apples and bananas here.
The microcode path in the CPUs will probably through away the read if
you don't actually use the register value.
Regards,
Christian.
>
> loop: push %rbx
> pop %rcx
> dec %eax
> jnz loop
>
> 209,247,645 cycles # 3.209
> GHz ( +- 0.11% )
> 801,898,016 instructions # 3.83 insn per
> cycle ( +- 0.00% )
> 0.066210725 seconds time
> elapsed ( +- 0.59% )
>
>
> IOW, POPF takes at least 6 cycles.
>
>
> Linus does have a point that a "test+branch+STI" may end up not a
> clear win because of the branch.
>
> But the need to restore IF flag exists, it is necessary not only for
> Linux, but for any OS
> running on x86: they all have some sort of spinlock.
>
> The addition of a POPF instruction variant which looks only at IF bit
> and changes
> only that bit in EFLAGS may be a good idea, for all OSes.
>
> I propose that we ask Intel / AMD to do that.
>
> Maybe by the "ignored prefix" trick which was used when LZCNT insn was
> introduced
> as REPZ-prefixed BSR?
> Currently, REPZ POPF (f3 9d) insn does execute. Redefine this opcode as
> POPF_IF. Then the same kernel will work on old and new CPUs.
>
> CC'ing some @intel and @amd emails...
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 17:20 RFC: Petition Intel/AMD to add POPF_IF insn Denys Vlasenko
2016-08-17 17:30 ` Christian König
@ 2016-08-17 17:32 ` Linus Torvalds
2016-08-17 18:41 ` Denys Vlasenko
1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 17:32 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher,
Rafael J. Wysocki, Christian König, Vinod Koul,
Johannes Berg, Sara Sharon, Adrian Hunter,
the arch/x86 maintainers, LKML
On Wed, Aug 17, 2016 at 10:20 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> Last year, a proposal was floated to avoid costly POPF.
> In particular, each spin_unlock_irqrestore() does it,
> a rather common operation.
Quiet frankly, it takes so long for hw features that I don't think
it's worth it for something like this.
I'd rather see people play around with just the conditional "sti"
model instead and see how well that works. *Particularly* if we inline
that part of the spin_lock_irqrestore(), the branch prediction logic
may work very well. And even if not, it migth work better than popf.
But somebody would need to run the tests on an actual load
(microbenchmarks wouldn't work, because they wouldn't show the
different branch prediction cases).
power has been using "just do IF in software", which is also an
alternative, but it's much more complicated (you take the interrupt,
notice that the sw disable flag is set, don't actually run the irq but
set a flag and disable irqs for _real_ and return. Then you replay the
irq when enabling the software irq flag again)
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 17:30 ` Christian König
@ 2016-08-17 18:34 ` Denys Vlasenko
0 siblings, 0 replies; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 18:34 UTC (permalink / raw)
To: Christian König, Ingo Molnar, Andy Lutomirski, Dan Williams,
Alex Deucher, Rafael J. Wysocki, Vinod Koul, Johannes Berg,
Sara Sharon, Adrian Hunter, Linus Torvalds
Cc: x86, LKML
On 08/17/2016 07:30 PM, Christian König wrote:
>> But in my measurements POPF is not fast even in the case where restored
>> flags are not changed at all:
>>
>> mov $200*1000*1000, %eax
>> pushf
>> pop %rbx
>> .balign 64
>> loop: push %rbx
>> popf
>> dec %eax
>> jnz loop
>>
>> # perf stat -r20 ./popf_1g
>> 4,929,012,093 cycles # 3.412 GHz ( +- 0.02% )
>> 835,721,371 instructions # 0.17 insn per cycle ( +- 0.02% )
>> 1.446185359 seconds time elapsed ( +- 0.46% )
>>
>> If I replace POPF with a pop into an unused register, I get this:
>
> You are comparing apples and bananas here.
Yes, I know. Pop into a register here is basically free.
I'd also add a STI and measure how much it takes to
enable interrupts, but unfortunately STI throws a #GP
in CPL 3.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 17:32 ` Linus Torvalds
@ 2016-08-17 18:41 ` Denys Vlasenko
[not found] ` <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com>
0 siblings, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 18:41 UTC (permalink / raw)
To: Linus Torvalds
Cc: Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher,
Rafael J. Wysocki, Christian König, Vinod Koul,
Johannes Berg, Sara Sharon, Adrian Hunter,
the arch/x86 maintainers, LKML
On 08/17/2016 07:32 PM, Linus Torvalds wrote:
> On Wed, Aug 17, 2016 at 10:20 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> Last year, a proposal was floated to avoid costly POPF.
>> In particular, each spin_unlock_irqrestore() does it,
>> a rather common operation.
>
> Quiet frankly, it takes so long for hw features that I don't think
> it's worth it for something like this.
True, it will take some ~5 years for new hardware to become
widely used.
OTOH 5 years will inevitably pass. Just like now 32-bit Linux userspace
is commonly compiled to "i686" instruction set, because
inevitably enough time has passed that you can safely assume
most people run at least Pentium Pro-level CPU, with CMOV, CMPXCHG,
etc.
If the new instruction will be implemented with "REPZ POPF" encoding,
no recompilation or alternatives will be needed for the new kernel
to run on an old CPU. On an old CPU, entire EFLAGS will be restored.
On a new one, only EFLAGS.IF.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
[not found] ` <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com>
@ 2016-08-17 19:13 ` Andy Lutomirski
2016-08-17 19:26 ` Denys Vlasenko
2016-08-17 19:29 ` Linus Torvalds
0 siblings, 2 replies; 25+ messages in thread
From: Andy Lutomirski @ 2016-08-17 19:13 UTC (permalink / raw)
To: Linus Torvalds
Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On Wed, Aug 17, 2016 at 12:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Aug 17, 2016 11:41 AM, "Denys Vlasenko" <dvlasenk@redhat.com> wrote:
>>
>> OTOH 5 years will inevitably pass.
>
> Yes. But in five years, maybe we'll have a popf that is faster anyway.
>
> I'd actually prefer that in the end. The problem with popf right now seems
> to be mainly that it's effectively serializing and does stupid things in
> microcode. It doesn't have to be that way. It could actually do much better,
> but it hasn't been a high enough priority for Intel.
>
It wouldn't surprise me if that were easier said than done. popf
potentially changes AC, and AC affects address translation. popf also
potentially changes IOPL, and I don't know whether Intel chips track
IOPL in a way that lets them find all the dependent instructions
without serializing. But maybe their pipeline is fancy enough.
Personally, I still expect that a simple branch-and-sti is the way to
go. It wouldn't shock me if even a mispredicted branch and STI is
faster than POPF.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 19:13 ` Andy Lutomirski
@ 2016-08-17 19:26 ` Denys Vlasenko
2016-08-17 19:32 ` Linus Torvalds
2016-08-17 19:29 ` Linus Torvalds
1 sibling, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 19:26 UTC (permalink / raw)
To: Andy Lutomirski, Linus Torvalds
Cc: Sara Sharon, Dan Williams, Christian König, Vinod Koul,
Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski,
the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter
On 08/17/2016 09:13 PM, Andy Lutomirski wrote:
> On Wed, Aug 17, 2016 at 12:01 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Aug 17, 2016 11:41 AM, "Denys Vlasenko" <dvlasenk@redhat.com> wrote:
>>>
>>> OTOH 5 years will inevitably pass.
>>
>> Yes. But in five years, maybe we'll have a popf that is faster anyway.
>>
>> I'd actually prefer that in the end. The problem with popf right now seems
>> to be mainly that it's effectively serializing and does stupid things in
>> microcode. It doesn't have to be that way. It could actually do much better,
>> but it hasn't been a high enough priority for Intel.
>>
>
> It wouldn't surprise me if that were easier said than done. popf
> potentially changes AC, and AC affects address translation. popf also
> potentially changes IOPL, and I don't know whether Intel chips track
> IOPL in a way that lets them find all the dependent instructions
> without serializing. But maybe their pipeline is fancy enough.
Exactly. And more:
POPF potentially changes TF (and it works even in CPL3).
POPD changes DF - must serialize versus string insns.
POPF changes NT - must serialize versus IRET insns.
POPF changes VIF, from a different bit in a popped value,
and under a rather complex conditions.
Intel's documentation has a pseudo-code for the instructions.
For POPF, that pseudo-code is two pages long...
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 19:13 ` Andy Lutomirski
2016-08-17 19:26 ` Denys Vlasenko
@ 2016-08-17 19:29 ` Linus Torvalds
1 sibling, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 19:29 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On Wed, Aug 17, 2016 at 12:13 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> It wouldn't surprise me if that were easier said than done. popf
> potentially changes AC, and AC affects address translation.
Rigth. A lot of magical flags are in eflags, and popf *may* change them.
But that's why it's slow today. The popf microcode probably
unconditionally serializes things exactly because "things may change".
And the interrupt flag actually *is* pretty special too, in some
respects more so than AC (because it needs to serialize with any
pending interrupts).
And the microcode probably already has code that says "let's handle
the easy case quickly", where the easy case is "only the arithmetic
flags changed".
The arithmetic flags are special anyway, because they aren't actually
physically in the same register any more, but are separately tracked
and renamed etc.
But I'm sure Intel already treats IF specially in microcode, because
IF is really special in other ways (VIF handling in vm86 mode, but
also modern virtualization).
Yes, intel people tend to be afraid of the microcode stuff, and
generally not touch it. But the good news about popf is that is isn't
a serializing instruction, so it really *could* be optimized pretty
aggressively. And it does have to check for pending interrupts (and
*clearing* IF in particular needs to make sure that there isn't some
pending interrupt that the CPU is about to react to).
So it's not trivial. But the "enable interrupts" case for popf is
actually easier for hardware than the disable case from a serializing
standpoint, and I suspect the ucode doesn't take advantage of that
right now, and it's all just fairly unoptimized microcode.
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 19:26 ` Denys Vlasenko
@ 2016-08-17 19:32 ` Linus Torvalds
2016-08-17 19:35 ` Denys Vlasenko
2016-08-17 19:37 ` Linus Torvalds
0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 19:32 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On Wed, Aug 17, 2016 at 12:26 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> Exactly. And more:
All of which is ENTIRELY IRRELEVANT.
The 2-page pseudo-code is about behavior. It's trivial to
short-circuit. There are obvious optimizations, which involve just
making the rare cases go slow and have a trivial "if those bits didn't
change, don't go through the expense of testing them".
The problem is that IF is almost certainly right now in that rare case
mask, and so popf is stupidly slow for IF.
That in no way means that it *has* to be slow for IF, and more
importantly, the other flags don't even come into play. Forget about
them.
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 19:32 ` Linus Torvalds
@ 2016-08-17 19:35 ` Denys Vlasenko
2016-08-17 19:54 ` Linus Torvalds
2016-08-17 19:37 ` Linus Torvalds
1 sibling, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 19:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On 08/17/2016 09:32 PM, Linus Torvalds wrote:
> On Wed, Aug 17, 2016 at 12:26 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>
>> Exactly. And more:
>
> All of which is ENTIRELY IRRELEVANT.
>
> The 2-page pseudo-code is about behavior. It's trivial to
> short-circuit. There are obvious optimizations, which involve just
> making the rare cases go slow and have a trivial "if those bits didn't
> change, don't go through the expense of testing them".
>
> The problem is that IF is almost certainly right now in that rare case
> mask, and so popf is stupidly slow for IF.
I ran the test, see the first email in the thread.
Experimentally, POPF is stupidly slow _always_. 6 cycles
even if none of the "scary" flags are changed.
Either:
* its microcode has no rare case mask
or
* its microcode is so slow that even fast path is slow,
and slow path is worse
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 19:32 ` Linus Torvalds
2016-08-17 19:35 ` Denys Vlasenko
@ 2016-08-17 19:37 ` Linus Torvalds
2016-08-17 21:26 ` Linus Torvalds
1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 19:37 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On Wed, Aug 17, 2016 at 12:32 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The 2-page pseudo-code is about behavior. It's trivial to
> short-circuit. There are obvious optimizations, which involve just
> making the rare cases go slow and have a trivial "if those bits didn't
> change, don't go through the expense of testing them".
That said, the first thing we should test is to just see how the
"let's just optimize this in software" thing works.
Replace the "popf" with "if (val & X86_EFLAGS_IF) local_irq_enable();"
and see how that works out. Play with inlining it or not, and see if
the branch predictor matters.
We may simply not need popf optimizations.
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 19:35 ` Denys Vlasenko
@ 2016-08-17 19:54 ` Linus Torvalds
0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 19:54 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On Wed, Aug 17, 2016 at 12:35 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> Experimentally, POPF is stupidly slow _always_. 6 cycles
> even if none of the "scary" flags are changed.
6 cycles is nothing.
That's basically the overhead of "oops, I need to use the microcode sequencer".
One issue is that the intel decoders (AMD too, for that matter) can
only generate a fairly small set of uops for any instruction. Some
instructions are really trivial to decode (popf definitely falls under
that heading), but are more than just a couple of uops, so you end up
having to use the uop sequencer logic.
According to Agner Fog's tables, there's one or two
micro-architectures that actually dot he simple "popf" case with a
single cycle throughput, but that's the very unusual case.
You can't even fit the "pop a value, see if only the arithmetic flags
changed, trap to microcode otherwise" into the three of four uops that
the "complex decoder" can generate directly.
And that "fall back to the uop sequencer engine" tends to just always
cause several cycles regardless. So yes, microcode tends to be slow
even for what would otherwise be trivial operations. You'd think Intel
could do as well as they do for the L0 uop cache, but afaik they
don't.
Anyway, six cycles is fast. I'd *love* for popf to actually be just 6
cycles when IF changes. It's much much worse iirc (although honestly,
I haven't timed it in years - it's much easier to time just the
arithmetic flag changes).
It used to be more like a hundred cycles on Prescott.
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 19:37 ` Linus Torvalds
@ 2016-08-17 21:26 ` Linus Torvalds
2016-08-17 21:35 ` Linus Torvalds
0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 21:26 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]
On Wed, Aug 17, 2016 at 12:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Replace the "popf" with "if (val & X86_EFLAGS_IF) local_irq_enable();"
> and see how that works out. Play with inlining it or not, and see if
> the branch predictor matters.
.. actually, thinking a bit more about it, I really don't think the
branch predictor will even matter.
We sure as hell shouldn't have nested irq-safe interrupts in paths
that matter from a performance angle, so the normal case for
spin_unlock_irqrestore() should be to enable interrupts again.
And if interrupts are disabled because the caller is actually in
interrupt context, I don't think the branch prediction is going to
matter, compared to the irq overhead.
So test this trivial patch. It's ENTIRELY UNTESTED. It may be complete
crap and not even compile. But I did test it on
kernel/locking/spinlock.c, and the generated code is beautiful:
_raw_spin_unlock_irqrestore:
testl $512, %esi #, flags
movb $0, (%rdi) #, MEM[(volatile __u8 *)lock_2(D)]
je .L2
sti
.L2:
ret
so maybe the silly popf has always just been stupid.
Of course, if somebody uses native_restore_fl() to actually *disable*
interrupts (when they weren't already disabled), then this untested
patch will just not work. But why would you do somethign so stupid?
Famous last words...
Linus
[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1127 bytes --]
arch/x86/include/asm/irqflags.h | 26 ++++++++++++--------------
1 file changed, 12 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index b77f5edb03b0..76c4ebfab0be 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -8,6 +8,16 @@
* Interrupt control:
*/
+static inline void native_irq_disable(void)
+{
+ asm volatile("cli": : :"memory");
+}
+
+static inline void native_irq_enable(void)
+{
+ asm volatile("sti": : :"memory");
+}
+
static inline unsigned long native_save_fl(void)
{
unsigned long flags;
@@ -28,20 +38,8 @@ static inline unsigned long native_save_fl(void)
static inline void native_restore_fl(unsigned long flags)
{
- asm volatile("push %0 ; popf"
- : /* no output */
- :"g" (flags)
- :"memory", "cc");
-}
-
-static inline void native_irq_disable(void)
-{
- asm volatile("cli": : :"memory");
-}
-
-static inline void native_irq_enable(void)
-{
- asm volatile("sti": : :"memory");
+ if (flags & X86_EFLAGS_IF)
+ native_irq_enable();
}
static inline void native_safe_halt(void)
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 21:26 ` Linus Torvalds
@ 2016-08-17 21:35 ` Linus Torvalds
2016-08-17 21:43 ` Andy Lutomirski
2016-08-18 9:21 ` Denys Vlasenko
0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 21:35 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On Wed, Aug 17, 2016 at 2:26 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Of course, if somebody uses native_restore_fl() to actually *disable*
> interrupts (when they weren't already disabled), then this untested
> patch will just not work. But why would you do somethign so stupid?
> Famous last words...
Looking around, the vsmp code actually uses "native_restore_fl()" to
not just set the interrupt flag, but AC as well.
And the PV spinlock case has that "push;popf" sequence encoded in an alternate.
So that trivial patch may (or may not - still not tested) work for
some quick testing, but needs more effort for any *real* use.
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 21:35 ` Linus Torvalds
@ 2016-08-17 21:43 ` Andy Lutomirski
2016-08-17 21:48 ` Linus Torvalds
2016-08-18 9:21 ` Denys Vlasenko
1 sibling, 1 reply; 25+ messages in thread
From: Andy Lutomirski @ 2016-08-17 21:43 UTC (permalink / raw)
To: Linus Torvalds
Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On Wed, Aug 17, 2016 at 2:35 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Aug 17, 2016 at 2:26 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Of course, if somebody uses native_restore_fl() to actually *disable*
>> interrupts (when they weren't already disabled), then this untested
>> patch will just not work. But why would you do somethign so stupid?
>> Famous last words...
>
> Looking around, the vsmp code actually uses "native_restore_fl()" to
> not just set the interrupt flag, but AC as well.
>
> And the PV spinlock case has that "push;popf" sequence encoded in an alternate.
>
> So that trivial patch may (or may not - still not tested) work for
> some quick testing, but needs more effort for any *real* use.
>
It shouldn't be *too* bad, since xen_restore_fl only affects "IF".
And even if native_restore_fl needs to be able to turn IRQs off as
well as on, we can just do:
if (likely(flags & X86_EFLAGS_IF))
sti();
else
cli();
at some cost to code size but hopefully little to no runtime cost for
the sane cases.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 21:43 ` Andy Lutomirski
@ 2016-08-17 21:48 ` Linus Torvalds
2016-08-18 13:26 ` Denys Vlasenko
0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 21:48 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On Wed, Aug 17, 2016 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> It shouldn't be *too* bad, since xen_restore_fl only affects "IF".
> And even if native_restore_fl needs to be able to turn IRQs off as
> well as on, we can just do:
>
> if (likely(flags & X86_EFLAGS_IF))
> sti();
> else
> cli();
>
> at some cost to code size but hopefully little to no runtime cost for
> the sane cases.
No, that would be horrible for the case where we had an irq spinlock
in an irq-region. Then we'd have a pointless "cli" there.
So I'd rather just make sure that only the spinlock code actually uses
native_restore_fl().
And yes, the patch works at least minimally, since I'm writing this
with a kernel compiled with that.
Of course, somebody really should do timings on modern CPU's (in cpl0,
comparing native_fl() that enables interrupts with a popf)
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 21:35 ` Linus Torvalds
2016-08-17 21:43 ` Andy Lutomirski
@ 2016-08-18 9:21 ` Denys Vlasenko
2016-08-18 12:18 ` Denys Vlasenko
1 sibling, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18 9:21 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On 08/17/2016 11:35 PM, Linus Torvalds wrote:
> On Wed, Aug 17, 2016 at 2:26 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Of course, if somebody uses native_restore_fl() to actually *disable*
>> interrupts (when they weren't already disabled), then this untested
>> patch will just not work. But why would you do somethign so stupid?
>> Famous last words...
>
> Looking around, the vsmp code actually uses "native_restore_fl()" to
> not just set the interrupt flag, but AC as well.
>
> And the PV spinlock case has that "push;popf" sequence encoded in an alternate.
>
> So that trivial patch may (or may not - still not tested) work for
> some quick testing, but needs more effort for any *real* use.
I'm going to test the attached patch.
PV and debug maze is daunting... I discovered that Fedora kernels
compile with the set of .config options which result in
spin_unlock_irqrestore which looks like this:
ffffffff818d0f40 <_raw_spin_unlock_irqrestore>:
ffffffff818d0f40: e8 1b 31 00 00 callq ffffffff818d4060 <__fentry__> ffffffff818d0f41: R_X86_64_PC32 __fentry__-0x4
ffffffff818d0f45: 55 push %rbp
ffffffff818d0f46: 48 89 e5 mov %rsp,%rbp
ffffffff818d0f49: 41 54 push %r12
ffffffff818d0f4b: 53 push %rbx
ffffffff818d0f4c: 48 8b 55 08 mov 0x8(%rbp),%rdx
ffffffff818d0f50: 49 89 fc mov %rdi,%r12
ffffffff818d0f53: 48 89 f3 mov %rsi,%rbx
ffffffff818d0f56: 48 83 c7 18 add $0x18,%rdi
ffffffff818d0f5a: be 01 00 00 00 mov $0x1,%esi
ffffffff818d0f5f: e8 8c b8 83 ff callq ffffffff8110c7f0 <lock_release> ffffffff818d0f60: R_X86_64_PC32 lock_release-0x4
ffffffff818d0f64: 4c 89 e7 mov %r12,%rdi
ffffffff818d0f67: e8 d4 fb 83 ff callq ffffffff81110b40 <do_raw_spin_unlock> ffffffff818d0f68: R_X86_64_PC32 do_raw_spin_unlock-0x4
ffffffff818d0f6c: f6 c7 02 test $0x2,%bh
ffffffff818d0f6f: 74 1b je ffffffff818d0f8c <_raw_spin_unlock_irqrestore+0x4c>
ffffffff818d0f71: e8 aa 9d 83 ff callq ffffffff8110ad20 <trace_hardirqs_on> ffffffff818d0f72: R_X86_64_PC32 trace_hardirqs_on-0x4
ffffffff818d0f76: 48 89 df mov %rbx,%rdi
ffffffff818d0f79: ff 14 25 48 32 e2 81 callq *0xffffffff81e23248 ffffffff818d0f7c: R_X86_64_32S pv_irq_ops+0x8
ffffffff818d0f80: 65 ff 0d c9 c4 73 7e decl %gs:0x7e73c4c9(%rip) # d450 <__preempt_count> ffffffff818d0f83: R_X86_64_PC32 __preempt_count-0x4
ffffffff818d0f87: 5b pop %rbx
ffffffff818d0f88: 41 5c pop %r12
ffffffff818d0f8a: 5d pop %rbp
ffffffff818d0f8b: c3 retq
ffffffff818d0f8c: 48 89 df mov %rbx,%rdi
ffffffff818d0f8f: ff 14 25 48 32 e2 81 callq *0xffffffff81e23248 ffffffff818d0f92: R_X86_64_32S pv_irq_ops+0x8
ffffffff818d0f96: e8 35 5b 83 ff callq ffffffff81106ad0 <trace_hardirqs_off> ffffffff818d0f97: R_X86_64_PC32 trace_hardirqs_off-0x4
ffffffff818d0f9b: eb e3 jmp ffffffff818d0f80 <_raw_spin_unlock_irqrestore+0x40>
ffffffff818d0f9d: 0f 1f 00 nopl (%rax)
Gawd... really Fedora? All this is needed?
Testing with _this_ is not going to show any differences, I need to weed
all the cruft out and test a fast, non-debug .config.
Changed it like this:
+# CONFIG_UPROBES is not set
+# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
+# CONFIG_HYPERVISOR_GUEST is not set
+# CONFIG_SYS_HYPERVISOR is not set
+# CONFIG_FRAME_POINTER is not set
+# CONFIG_KMEMCHECK is not set
+# CONFIG_DEBUG_LOCK_ALLOC is not set
+# CONFIG_PROVE_LOCKING is not set
+# CONFIG_LOCK_STAT is not set
+# CONFIG_PROVE_RCU is not set
+# CONFIG_LATENCYTOP is not set
+# CONFIG_FTRACE is not set
+# CONFIG_BINARY_PRINTF is not set
Looks better (it's with the patch already):
00000000000000f0 <_raw_spin_unlock_irqrestore>:
f0: 53 push %rbx
f1: 48 89 f3 mov %rsi,%rbx
f4: e8 00 00 00 00 callq f9 <_raw_spin_unlock_irqrestore+0x9> f5: R_X86_64_PC32 do_raw_spin_unlock-0x4
f9: 80 e7 02 and $0x2,%bh
fc: 74 01 je ff <_raw_spin_unlock_irqrestore+0xf>
fe: fb sti
ff: 65 ff 0d 00 00 00 00 decl %gs:0x0(%rip) # 106 <_raw_spin_unlock_irqrestore+0x16> 102: R_X86_64_PC32 __preempt_count-0x4
106: 5b pop %rbx
107: c3 retq
This also raises questions. Such as "why do_raw_spin_unlock() is not inlined here?"
Anyway... on to more kernel builds and testing.
Please take a look at the below patch. If it's obviously buggy, let me know.
diff -urp linux-4.7.1.org/arch/x86/include/asm/irqflags.h linux-4.7.1.obj/arch/x86/include/asm/irqflags.h
--- linux-4.7.1.org/arch/x86/include/asm/irqflags.h 2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/include/asm/irqflags.h 2016-08-18 10:01:09.514219644 +0200
@@ -44,6 +44,16 @@ static inline void native_irq_enable(voi
asm volatile("sti": : :"memory");
}
+/*
+ * This produces a "test; jz; sti" insn sequence.
+ * It's faster than "push reg; popf". If sti is skipped, it's much faster.
+ */
+static inline void native_cond_irq_enable(unsigned long flags)
+{
+ if (flags & X86_EFLAGS_IF)
+ native_irq_enable();
+}
+
static inline void native_safe_halt(void)
{
asm volatile("sti; hlt": : :"memory");
@@ -69,7 +79,8 @@ static inline notrace unsigned long arch
static inline notrace void arch_local_irq_restore(unsigned long flags)
{
- native_restore_fl(flags);
+ if (flags & X86_EFLAGS_IF)
+ native_irq_enable();
}
static inline notrace void arch_local_irq_disable(void)
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt.c linux-4.7.1.obj/arch/x86/kernel/paravirt.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt.c 2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt.c 2016-08-18 10:01:24.797155109 +0200
@@ -313,7 +313,7 @@ struct pv_time_ops pv_time_ops = {
__visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
- .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
+ .restore_fl = __PV_IS_CALLEE_SAVE(native_cond_irq_enable),
.irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
.irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
.safe_halt = native_safe_halt,
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c 2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c 2016-08-18 04:43:19.388102755 +0200
@@ -2,7 +2,7 @@
DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
+DEF_NATIVE(pv_irq_ops, restore_fl, "testb $((1<<9)>>8),%ah; jz 1f; sti; 1: ;");
DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
DEF_NATIVE(pv_cpu_ops, iret, "iret");
DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c 2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c 2016-08-18 04:42:56.364545509 +0200
@@ -4,7 +4,7 @@
DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
+DEF_NATIVE(pv_irq_ops, restore_fl, "testw $(1<<9),%di; jz 1f; sti; 1: ;");
DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-18 9:21 ` Denys Vlasenko
@ 2016-08-18 12:18 ` Denys Vlasenko
2016-08-18 17:22 ` Paolo Bonzini
0 siblings, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18 12:18 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
[-- Attachment #1: Type: text/plain, Size: 8089 bytes --]
On 08/18/2016 11:21 AM, Denys Vlasenko wrote:
>>> Of course, if somebody uses native_restore_fl() to actually *disable*
>>> interrupts (when they weren't already disabled), then this untested
>>> patch will just not work. But why would you do somethign so stupid?
>>> Famous last words...
>>
>> Looking around, the vsmp code actually uses "native_restore_fl()" to
>> not just set the interrupt flag, but AC as well.
>>
>> And the PV spinlock case has that "push;popf" sequence encoded in an alternate.
>>
>> So that trivial patch may (or may not - still not tested) work for
>> some quick testing, but needs more effort for any *real* use.
>
> I'm going to test the attached patch.
...
>
> +# CONFIG_UPROBES is not set
> +# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
> +# CONFIG_HYPERVISOR_GUEST is not set
> +# CONFIG_SYS_HYPERVISOR is not set
> +# CONFIG_FRAME_POINTER is not set
> +# CONFIG_KMEMCHECK is not set
> +# CONFIG_DEBUG_LOCK_ALLOC is not set
> +# CONFIG_PROVE_LOCKING is not set
> +# CONFIG_LOCK_STAT is not set
> +# CONFIG_PROVE_RCU is not set
> +# CONFIG_LATENCYTOP is not set
> +# CONFIG_FTRACE is not set
> +# CONFIG_BINARY_PRINTF is not set
Need also !CONFIG_DEBUG_SPINLOCK, then unpatched irqrestore is:
ffffffff817115a0 <_raw_spin_unlock_irqrestore>:
ffffffff817115a0: c6 07 00 movb $0x0,(%rdi)
ffffffff817115a3: 56 push %rsi
ffffffff817115a4: 9d popfq
ffffffff817115a5: 65 ff 0d e4 ad 8f 7e decl %gs:__preempt_count
ffffffff817115ac: c3 retq
ffffffff817115ad: 0f 1f 00 nopl (%rax)
patched one is
ffffffff81711660 <_raw_spin_unlock_irqrestore>:
ffffffff81711660: f7 c6 00 02 00 00 test $0x200,%esi
ffffffff81711666: c6 07 00 movb $0x0,(%rdi)
ffffffff81711669: 74 01 je ffffffff8171166c <_raw_spin_unlock_irqrestore+0xc>
ffffffff8171166b: fb sti
ffffffff8171166c: 65 ff 0d 1d ad 8f 7e decl %gs:__preempt_count
ffffffff81711673: c3 retq
ffffffff81711674: 66 90 xchg %ax,%ax
ffffffff81711676: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1)
Ran the following twice on a quiesced machine:
taskset 1 perf stat -r60 perf bench sched messaging
taskset 1 perf stat -r60 perf bench sched pipe
Out of these four runs, both "perf bench sched pipe" runs show improvements:
- 2648.279829 task-clock (msec) # 1.000 CPUs utilized ( +- 0.24% )
+ 2483.143469 task-clock (msec) # 0.998 CPUs utilized ( +- 0.31% )
- 2,000,002 context-switches # 0.755 M/sec ( +- 0.00% )
+ 2,000,013 context-switches # 0.805 M/sec ( +- 0.00% )
- 547 page-faults # 0.206 K/sec ( +- 0.04% )
+ 546 page-faults # 0.220 K/sec ( +- 0.04% )
- 8,723,284,926 cycles # 3.294 GHz ( +- 0.06% )
+ 8,157,949,449 cycles # 3.285 GHz ( +- 0.07% )
- 12,286,937,344 instructions # 1.41 insn per cycle ( +- 0.03% )
+ 12,255,616,405 instructions # 1.50 insn per cycle ( +- 0.05% )
- 2,588,839,023 branches # 977.555 M/sec ( +- 0.02% )
+ 2,599,319,615 branches # 1046.786 M/sec ( +- 0.04% )
- 3,620,273 branch-misses # 0.14% of all branches ( +- 0.67% )
+ 3,577,771 branch-misses # 0.14% of all branches ( +- 0.69% )
- 2.648799072 seconds time elapsed ( +- 0.24% )
+ 2.487452268 seconds time elapsed ( +- 0.31% )
Good, we run more insns/cycle, as expected. However, a bit more branches.
But of two "perf bench sched messaging" run, one was slower on a patched kernel,
and perf shows why: more branches, and also branch miss percentage is larger:
- 690.008697 task-clock (msec) # 0.996 CPUs utilized ( +- 0.45% )
+ 699.526509 task-clock (msec) # 0.994 CPUs utilized ( +- 0.28% )
- 26,796 context-switches # 0.039 M/sec ( +- 8.31% )
+ 29,088 context-switches # 0.042 M/sec ( +- 6.62% )
- 35,477 page-faults # 0.051 M/sec ( +- 0.11% )
+ 35,504 page-faults # 0.051 M/sec ( +- 0.14% )
- 2,157,701,609 cycles # 3.127 GHz ( +- 0.35% )
+ 2,143,781,407 cycles # 3.065 GHz ( +- 0.25% )
- 3,115,212,672 instructions # 1.44 insn per cycle ( +- 0.28% )
+ 3,253,499,549 instructions # 1.52 insn per cycle ( +- 0.19% )
- 661,888,593 branches # 959.247 M/sec ( +- 0.36% )
+ 707,862,655 branches # 1011.917 M/sec ( +- 0.20% )
- 2,793,203 branch-misses # 0.42% of all branches ( +- 1.04% )
+ 3,453,397 branch-misses # 0.49% of all branches ( +- 0.32% )
- 0.693004918 seconds time elapsed ( +- 0.45% )
+ 0.703630988 seconds time elapsed ( +- 0.27% )
This tipped the scales, and despite higher insns/cycle, run time is worse.
The other "perf bench sched messaging" run was more lucky:
- 706.944245 task-clock (msec) # 0.995 CPUs utilized ( +- 0.32% )
+ 687.038856 task-clock (msec) # 0.993 CPUs utilized ( +- 0.31% )
- 23,489 context-switches # 0.033 M/sec ( +- 7.02% )
+ 26,644 context-switches # 0.039 M/sec ( +- 7.46% )
- 35,360 page-faults # 0.050 M/sec ( +- 0.12% )
+ 35,417 page-faults # 0.052 M/sec ( +- 0.13% )
- 2,183,639,816 cycles # 3.089 GHz ( +- 0.35% )
+ 2,123,086,753 cycles # 3.090 GHz ( +- 0.27% )
- 3,131,362,238 instructions # 1.43 insn per cycle ( +- 0.19% )
+ 3,236,613,433 instructions # 1.52 insn per cycle ( +- 0.19% )
- 667,874,319 branches # 944.734 M/sec ( +- 0.21% )
+ 703,677,908 branches # 1024.219 M/sec ( +- 0.20% )
- 2,859,521 branch-misses # 0.43% of all branches ( +- 0.56% )
+ 3,462,063 branch-misses # 0.49% of all branches ( +- 0.33% )
- 0.710738536 seconds time elapsed ( +- 0.32% )
+ 0.691908533 seconds time elapsed ( +- 0.31% )
However, it still had more branches (~5% more), and worse branch miss percentage.
The patch seems to work. It also does not bloat the kernel:
text data bss dec hex filename
8199556 5026784 2924544 16150884 f67164 vmlinux
8199897 5026784 2924544 16151225 f672b9 vmlinux.patched
However, a "conditional CLI/STI from r/m" insn could be better still.
The patch is attached.
[-- Attachment #2: z.diff --]
[-- Type: text/x-patch, Size: 3076 bytes --]
diff -urp linux-4.7.1.org/arch/x86/include/asm/irqflags.h linux-4.7.1.obj/arch/x86/include/asm/irqflags.h
--- linux-4.7.1.org/arch/x86/include/asm/irqflags.h 2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/include/asm/irqflags.h 2016-08-18 10:01:09.514219644 +0200
@@ -44,6 +44,16 @@ static inline void native_irq_enable(voi
asm volatile("sti": : :"memory");
}
+/*
+ * This produces a "test; jz; sti" insn sequence.
+ * It's faster than "push reg; popf". If sti is skipped, it's much faster.
+ */
+static inline void native_cond_irq_enable(unsigned long flags)
+{
+ if (flags & X86_EFLAGS_IF)
+ native_irq_enable();
+}
+
static inline void native_safe_halt(void)
{
asm volatile("sti; hlt": : :"memory");
@@ -69,7 +79,8 @@ static inline notrace unsigned long arch
static inline notrace void arch_local_irq_restore(unsigned long flags)
{
- native_restore_fl(flags);
+ if (flags & X86_EFLAGS_IF)
+ native_irq_enable();
}
static inline notrace void arch_local_irq_disable(void)
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt.c linux-4.7.1.obj/arch/x86/kernel/paravirt.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt.c 2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt.c 2016-08-18 10:01:24.797155109 +0200
@@ -313,7 +313,7 @@ struct pv_time_ops pv_time_ops = {
__visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
- .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
+ .restore_fl = __PV_IS_CALLEE_SAVE(native_cond_irq_enable),
.irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
.irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
.safe_halt = native_safe_halt,
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c 2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c 2016-08-18 04:43:19.388102755 +0200
@@ -2,7 +2,7 @@
DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
+DEF_NATIVE(pv_irq_ops, restore_fl, "testb $((1<<9)>>8),%ah; jz 1f; sti; 1: ;");
DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
DEF_NATIVE(pv_cpu_ops, iret, "iret");
DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c 2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c 2016-08-18 04:42:56.364545509 +0200
@@ -4,7 +4,7 @@
DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
+DEF_NATIVE(pv_irq_ops, restore_fl, "testw $(1<<9),%di; jz 1f; sti; 1: ;");
DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-17 21:48 ` Linus Torvalds
@ 2016-08-18 13:26 ` Denys Vlasenko
2016-08-18 17:24 ` Linus Torvalds
0 siblings, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18 13:26 UTC (permalink / raw)
To: Linus Torvalds, Andy Lutomirski
Cc: Sara Sharon, Dan Williams, Christian König, Vinod Koul,
Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski,
the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter
> Of course, somebody really should do timings on modern CPU's (in cpl0,
> comparing native_fl() that enables interrupts with a popf)
I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
if we set iopl(3) first.
Surprisingly, STI is slower than CLI. A loop with 27 CLI's and one STI
converges to about ~0.5 insn/cycle:
# compile with: gcc -nostartfiles -nostdlib
_start: .globl _start
mov $172, %eax #iopl
mov $3, %edi
syscall
mov $200*1000*1000, %eax
.balign 64
loop:
cli;cli;cli;cli
cli;cli;cli;cli
cli;cli;cli;cli
cli;cli;cli;cli
cli;cli;cli;cli
cli;cli;cli;cli
cli;cli;cli;sti
dec %eax
jnz loop
mov $231, %eax #exit_group
syscall
perf stat:
6,015,787,968 instructions # 0.52 insn per cycle
3.355474199 seconds time elapsed
With all CLIs replaced by STIs, it's ~0.25 insn/cycle:
6,030,530,328 instructions # 0.27 insn per cycle
6.547200322 seconds time elapsed
POPF which needs to enable interrupts is not measurably faster than
one which does not change .IF:
Loop with:
400158: fa cli
400159: 53 push %rbx #saved eflags with if=1
40015a: 9d popfq
shows:
8,908,857,324 instructions # 0.11 insn per cycle ( +- 0.00% )
Loop with:
400140: fb sti
400141: 53 push %rbx
400142: 9d popfq
shows:
8,920,243,701 instructions # 0.10 insn per cycle ( +- 0.01% )
Even loop with neither CLI nor STI, only with POPF:
400140: 53 push %rbx
400141: 9d popfq
shows:
6,079,936,714 instructions # 0.10 insn per cycle ( +- 0.00% )
This is on a Skylake CPU.
The gist of it:
CLI is 2 cycles,
STI is 4 cycles,
POPF is 10 cycles
seemingly regardless of prior value of EFLAGS.IF.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-18 12:18 ` Denys Vlasenko
@ 2016-08-18 17:22 ` Paolo Bonzini
0 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2016-08-18 17:22 UTC (permalink / raw)
To: Denys Vlasenko, Linus Torvalds
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On 18/08/2016 14:18, Denys Vlasenko wrote:
>
> - 2,588,839,023 branches # 977.555
> M/sec ( +- 0.02% )
> + 2,599,319,615 branches # 1046.786
> M/sec ( +- 0.04% )
> - 3,620,273 branch-misses # 0.14% of all
> branches ( +- 0.67% )
> + 3,577,771 branch-misses # 0.14% of all
> branches ( +- 0.69% )
> - 2.648799072 seconds time
> elapsed ( +- 0.24% )
> + 2.487452268 seconds time
> elapsed ( +- 0.31% )
>
> Good, we run more insns/cycle, as expected. However, a bit more branches.
Can you see where the missed branches are? Assuming branch misses are
the case where IF=0, perhaps there are a few places that can be changed
to spin_lock/unlock_irq or local_irq_disable/enable.
Paolo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-18 13:26 ` Denys Vlasenko
@ 2016-08-18 17:24 ` Linus Torvalds
2016-08-18 17:47 ` Denys Vlasenko
2016-08-19 10:54 ` Paolo Bonzini
0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-18 17:24 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On Thu, Aug 18, 2016 at 6:26 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
> if we set iopl(3) first.
Yes, but it might not be the same. So the timings could be very
different from a cpl0 case.
Also:
> Surprisingly, STI is slower than CLI. A loop with 27 CLI's and one STI
> converges to about ~0.5 insn/cycle:
You really really should not check "sti" together with immediately
following sti or cli.
The sti instruction has an architecturally defined magical
one-instruction window following it when interrupts stay disabled. I
could easily see that resulting in strange special cases - Intel
actually at some point documented that a sequence of 'sti'
instructions are not going to disable interrupts forever (there was a
question of what happens if you start out with interrupts disabled, go
to a 16-bit code segment that is all filled with "sti" instructions so
that the 16-bit EIP will wrap around and continually do an infinite
series of 'sti' - do interrupts ever get enabled?)
I think intel clarified that when you have a sequence of 'sti'
instructions, interrupts will get enabled after the second one, but
the point is that this is all "special" from a front-end angle. So
putting multiple 'sti' instructions in a bunch may be testing the
magical special case more than it would test anything *real*.
So at a minimum, make the sequence be "sti; nop" if you do it in a
loop. It may not change anything, but at least that way you'll know it
doesn't just test the magical case.
Realistically, it's better to instead test a *real* instruction
sequence, ie just compare something like
pushf
cli
.. do a memory operation here or something half-way real ..
pop
sti
and
pushf
cli
.. do the same half-way real memory op here ..
popf
and see which one is faster in a loop.
That said, your numbers really aren't very convincing. If popf really
is just 10 cycles on modern Intel hardware, it's already fast enough
that I really don't think it matters. Especially with "sti" being ~4
cycles, and there being a question about branch overhead anyway. You
win some, you lose some, but on the whole it sounds like "leave it
alone" wins.
Now, I know for a fact that there have been other x86 uarchitectres
that sucked at "popf", but they may suck almost equally at "sti". So
this might well be worth testing out on something that isn't Skylake.
Modern intel cores really are pretty good at even the slow operations.
Things used to be much much worse in the bad old P4 days. I'm very
impressed with the big intel cores.
Linus
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-18 17:24 ` Linus Torvalds
@ 2016-08-18 17:47 ` Denys Vlasenko
2016-08-18 17:49 ` Denys Vlasenko
2016-08-19 10:54 ` Paolo Bonzini
1 sibling, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18 17:47 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On 08/18/2016 07:24 PM, Linus Torvalds wrote:
> That said, your numbers really aren't very convincing. If popf really
> is just 10 cycles on modern Intel hardware, it's already fast enough
> that I really don't think it matters.
It's 20 cycles. I was wrong in my email, I forgot that the insn count
also counts "push %ebx" insns.
Since I already made a mistake, let me double-check.
200 million iterations of this loop execute under 17 seconds:
400100: b8 00 c2 eb 0b mov $0xbebc200,%eax # 1000*1000*1000
400105: 9c pushfq
400106: 5b pop %rbx
400107: 90 nop
....
0000000000400140 <loop>:
400140: 53 push %rbx
400141: 9d popfq
400142: 53 push %rbx
400143: 9d popfq
400144: 53 push %rbx
400145: 9d popfq
400146: 53 push %rbx
400147: 9d popfq
400148: 53 push %rbx
400149: 9d popfq
40014a: 53 push %rbx
40014b: 9d popfq
40014c: 53 push %rbx
40014d: 9d popfq
40014e: 53 push %rbx
40014f: 9d popfq
400150: 53 push %rbx
400151: 9d popfq
400152: 53 push %rbx
400153: 9d popfq
400154: 53 push %rbx
400155: 9d popfq
400156: 53 push %rbx
400157: 9d popfq
400158: 53 push %rbx
400159: 9d popfq
40015a: 53 push %rbx
40015b: 9d popfq
40015c: ff c8 dec %eax
40015e: 75 e0 jne 400140 <loop>
The loop is exactly 32 bytes, aligned.
There are 14 POPFs. Other insns are very fast.
No perf, just "time taskset 1 ./test".
My CPU frequency hovers around 3500 MHz when loaded.
17 seconds is 17*3500 million cycles.
17*3500 million cycles / 200*14 million cycles = 21.25
Thus, one POPF in CPL3 is ~20 cycles on Skylake.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-18 17:47 ` Denys Vlasenko
@ 2016-08-18 17:49 ` Denys Vlasenko
0 siblings, 0 replies; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18 17:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On 08/18/2016 07:47 PM, Denys Vlasenko wrote:
> On 08/18/2016 07:24 PM, Linus Torvalds wrote:
>> That said, your numbers really aren't very convincing. If popf really
>> is just 10 cycles on modern Intel hardware, it's already fast enough
>> that I really don't think it matters.
>
> It's 20 cycles. I was wrong in my email, I forgot that the insn count
> also counts "push %ebx" insns.
>
> Since I already made a mistake, let me double-check.
>
> 200 million iterations of this loop execute under 17 seconds:
>
> 400100: b8 00 c2 eb 0b mov $0xbebc200,%eax # 1000*1000*1000
Grr. It's 200*1000*1000, not one billion. Sorry.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-18 17:24 ` Linus Torvalds
2016-08-18 17:47 ` Denys Vlasenko
@ 2016-08-19 10:54 ` Paolo Bonzini
2016-08-31 11:12 ` Denys Vlasenko
1 sibling, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2016-08-19 10:54 UTC (permalink / raw)
To: Linus Torvalds, Denys Vlasenko
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On 18/08/2016 19:24, Linus Torvalds wrote:
>> > I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
>> > if we set iopl(3) first.
> Yes, but it might not be the same. So the timings could be very
> different from a cpl0 case.
FWIW I recently measured around 20 cycles for a popf as well on
Haswell-EP and CPL=0 (that was for commit f2485b3e0c6c, "KVM: x86: use
guest_exit_irqoff", 2016-07-01).
Paolo
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn
2016-08-19 10:54 ` Paolo Bonzini
@ 2016-08-31 11:12 ` Denys Vlasenko
0 siblings, 0 replies; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-31 11:12 UTC (permalink / raw)
To: Paolo Bonzini, Linus Torvalds
Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
Adrian Hunter
On 08/19/2016 12:54 PM, Paolo Bonzini wrote:
> On 18/08/2016 19:24, Linus Torvalds wrote:
>>>> I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
>>>> if we set iopl(3) first.
>> Yes, but it might not be the same. So the timings could be very
>> different from a cpl0 case.
>
> FWIW I recently measured around 20 cycles for a popf as well on
> Haswell-EP and CPL=0 (that was for commit f2485b3e0c6c, "KVM: x86: use
> guest_exit_irqoff", 2016-07-01).
Thanks for confirmation.
I revisited benchmarking of the
if (flags & X86_EFLAGS_IF)
native_irq_enable();
patch. In "make -j20" kernel compiles on a 8-way (HT) CPU, it shows some ~5 second
improvement during ~16 minute compile. That's 0.5% speedup. It's ok, but not
something to bee too excited.
80 e6 02 and $0x2,%dh
74 01 je ffffffff810101ae <intel_pt_handle_vmx+0x3e>
fb sti
41 f6 86 91 00 00 00 02 testb $0x2,0x91(%r14)
74 01 je ffffffff81013ce7 <math_error+0x77>
fb sti
f6 83 91 00 00 00 02 testb $0x2,0x91(%rbx)
74 01 je ffffffff81013efa <do_int3+0xba>
fb sti
41 f7 c4 00 02 00 00 test $0x200,%r12d
74 01 je ffffffff8101615d <oops_end+0x5d>
fb sti
Here we trade 20-cycle POPF for either 4-cycle STI, or a branch (which is either
~1 cycle if predicted, or ~20 cycles if mispredicted). The disassembly of
vmlinux shows that gcc generates these asm patterns:
I still think a dedicated instruction for a conditional STI is worth asking for.
Along the lines of "If bit 9 in the r/m argument is set, then STI, else nothing".
What do people from CPU companies say?
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2016-08-31 11:12 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-17 17:20 RFC: Petition Intel/AMD to add POPF_IF insn Denys Vlasenko
2016-08-17 17:30 ` Christian König
2016-08-17 18:34 ` Denys Vlasenko
2016-08-17 17:32 ` Linus Torvalds
2016-08-17 18:41 ` Denys Vlasenko
[not found] ` <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com>
2016-08-17 19:13 ` Andy Lutomirski
2016-08-17 19:26 ` Denys Vlasenko
2016-08-17 19:32 ` Linus Torvalds
2016-08-17 19:35 ` Denys Vlasenko
2016-08-17 19:54 ` Linus Torvalds
2016-08-17 19:37 ` Linus Torvalds
2016-08-17 21:26 ` Linus Torvalds
2016-08-17 21:35 ` Linus Torvalds
2016-08-17 21:43 ` Andy Lutomirski
2016-08-17 21:48 ` Linus Torvalds
2016-08-18 13:26 ` Denys Vlasenko
2016-08-18 17:24 ` Linus Torvalds
2016-08-18 17:47 ` Denys Vlasenko
2016-08-18 17:49 ` Denys Vlasenko
2016-08-19 10:54 ` Paolo Bonzini
2016-08-31 11:12 ` Denys Vlasenko
2016-08-18 9:21 ` Denys Vlasenko
2016-08-18 12:18 ` Denys Vlasenko
2016-08-18 17:22 ` Paolo Bonzini
2016-08-17 19:29 ` Linus Torvalds
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.