* RFC: Petition Intel/AMD to add POPF_IF insn
@ 2016-08-17 17:20 Denys Vlasenko
2016-08-17 17:30 ` Christian König
2016-08-17 17:32 ` Linus Torvalds
0 siblings, 2 replies; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 17:20 UTC (permalink / raw)
To: Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher,
Rafael J. Wysocki, Christian König, Vinod Koul,
Johannes Berg, Sara Sharon, Adrian Hunter, Linus Torvalds
Cc: x86, LKML
Last year, a proposal was floated to avoid costly POPF.
In particular, each spin_unlock_irqrestore() does it,
a rather common operation.
https://lkml.org/lkml/2015/4/21/290
[RFC PATCH] x86/asm/irq: Don't use POPF but STI
* Andy Lutomirski <luto@kernel.org> wrote:
> > Another different approach would be to formally state that
> > pv_irq_ops.save_fl() needs to return all the flags, which would
> > make local_irq_save() safe to use in this circumstance, but that
> > makes a hotpath longer for the sake of a single boot time check.
>
> ...which reminds me:
>
> Why does native_restore_fl restore anything other than IF? A branch
> and sti should be considerably faster than popf.
Ingo agreed:
====
Yes, this has come up in the past, something like the patch below?
Totally untested and not signed off yet: because we'd first have to
make sure (via irq flags debugging) that it's not used in reverse, to
re-disable interrupts:
local_irq_save(flags);
local_irq_enable();
...
local_irq_restore(flags); /* effective local_irq_disable() */
I don't think we have many (any?) such patterns left, but it has to be
checked first. If we have such cases then we'll have to use different
primitives there.
====
Linus replied:
=====
"popf" is fast for the "no changes to IF" case, and is a smaller
instruction anyway.
====
This basically shot down the proposal.
But in my measurements POPF is not fast even in the case where restored
flags are not changes at all:
mov $200*1000*1000, %eax
pushf
pop %rbx
.balign 64
loop: push %rbx
popf
dec %eax
jnz loop
# perf stat -r20 ./popf_1g
4,929,012,093 cycles # 3.412 GHz ( +- 0.02% )
835,721,371 instructions # 0.17 insn per cycle ( +- 0.02% )
1.446185359 seconds time elapsed ( +- 0.46% )
If I replace POPF with a pop into an unused register, I get this:
loop: push %rbx
pop %rcx
dec %eax
jnz loop
209,247,645 cycles # 3.209 GHz ( +- 0.11% )
801,898,016 instructions # 3.83 insn per cycle ( +- 0.00% )
0.066210725 seconds time elapsed ( +- 0.59% )
IOW, POPF takes at least 6 cycles.
Linus does have a point that a "test+branch+STI" may end up not a clear win because of the branch.
But the need to restore IF flag exists, it is necessary not only for Linux, but for any OS
running on x86: they all have some sort of spinlock.
The addition of a POPF instruction variant which looks only at IF bit and changes
only that bit in EFLAGS may be a good idea, for all OSes.
I propose that we ask Intel / AMD to do that.
Maybe by the "ignored prefix" trick which was used when LZCNT insn was introduced
as REPZ-prefixed BSR?
Currently, REPZ POPF (f3 9d) insn does execute. Redefine this opcode as
POPF_IF. Then the same kernel will work on old and new CPUs.
CC'ing some @intel and @amd emails...
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 17:20 RFC: Petition Intel/AMD to add POPF_IF insn Denys Vlasenko @ 2016-08-17 17:30 ` Christian König 2016-08-17 18:34 ` Denys Vlasenko 2016-08-17 17:32 ` Linus Torvalds 1 sibling, 1 reply; 25+ messages in thread From: Christian König @ 2016-08-17 17:30 UTC (permalink / raw) To: Denys Vlasenko, Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher, Rafael J. Wysocki, Vinod Koul, Johannes Berg, Sara Sharon, Adrian Hunter, Linus Torvalds Cc: x86, LKML Well first of all you actually CCed AMDs graphics developers. So Alex and I can't say much about CPU optimizations. Additional to that a comment below. Am 17.08.2016 um 19:20 schrieb Denys Vlasenko: > Last year, a proposal was floated to avoid costly POPF. > In particular, each spin_unlock_irqrestore() does it, > a rather common operation. > > https://lkml.org/lkml/2015/4/21/290 > [RFC PATCH] x86/asm/irq: Don't use POPF but STI > > * Andy Lutomirski <luto@kernel.org> wrote: >> > Another different approach would be to formally state that >> > pv_irq_ops.save_fl() needs to return all the flags, which would >> > make local_irq_save() safe to use in this circumstance, but that >> > makes a hotpath longer for the sake of a single boot time check. >> >> ...which reminds me: >> >> Why does native_restore_fl restore anything other than IF? A branch >> and sti should be considerably faster than popf. > > > Ingo agreed: > ==== > Yes, this has come up in the past, something like the patch below? > > Totally untested and not signed off yet: because we'd first have to > make sure (via irq flags debugging) that it's not used in reverse, to > re-disable interrupts: > local_irq_save(flags); > local_irq_enable(); > ... > local_irq_restore(flags); /* effective local_irq_disable() */ > I don't think we have many (any?) such patterns left, but it has to be > checked first. If we have such cases then we'll have to use different > primitives there. > ==== > > Linus replied: > ===== > "popf" is fast for the "no changes to IF" case, and is a smaller > instruction anyway. > ==== > > > This basically shot down the proposal. > > But in my measurements POPF is not fast even in the case where restored > flags are not changes at all: > > mov $200*1000*1000, %eax > pushf > pop %rbx > .balign 64 > loop: push %rbx > popf > dec %eax > jnz loop > > # perf stat -r20 ./popf_1g > 4,929,012,093 cycles # 3.412 > GHz ( +- 0.02% ) > 835,721,371 instructions # 0.17 insn per > cycle ( +- 0.02% ) > 1.446185359 seconds time > elapsed ( +- 0.46% ) > > If I replace POPF with a pop into an unused register, I get this: You are comparing apples and bananas here. The microcode path in the CPUs will probably through away the read if you don't actually use the register value. Regards, Christian. > > loop: push %rbx > pop %rcx > dec %eax > jnz loop > > 209,247,645 cycles # 3.209 > GHz ( +- 0.11% ) > 801,898,016 instructions # 3.83 insn per > cycle ( +- 0.00% ) > 0.066210725 seconds time > elapsed ( +- 0.59% ) > > > IOW, POPF takes at least 6 cycles. > > > Linus does have a point that a "test+branch+STI" may end up not a > clear win because of the branch. > > But the need to restore IF flag exists, it is necessary not only for > Linux, but for any OS > running on x86: they all have some sort of spinlock. > > The addition of a POPF instruction variant which looks only at IF bit > and changes > only that bit in EFLAGS may be a good idea, for all OSes. > > I propose that we ask Intel / AMD to do that. > > Maybe by the "ignored prefix" trick which was used when LZCNT insn was > introduced > as REPZ-prefixed BSR? > Currently, REPZ POPF (f3 9d) insn does execute. Redefine this opcode as > POPF_IF. Then the same kernel will work on old and new CPUs. > > CC'ing some @intel and @amd emails... ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 17:30 ` Christian König @ 2016-08-17 18:34 ` Denys Vlasenko 0 siblings, 0 replies; 25+ messages in thread From: Denys Vlasenko @ 2016-08-17 18:34 UTC (permalink / raw) To: Christian König, Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher, Rafael J. Wysocki, Vinod Koul, Johannes Berg, Sara Sharon, Adrian Hunter, Linus Torvalds Cc: x86, LKML On 08/17/2016 07:30 PM, Christian König wrote: >> But in my measurements POPF is not fast even in the case where restored >> flags are not changed at all: >> >> mov $200*1000*1000, %eax >> pushf >> pop %rbx >> .balign 64 >> loop: push %rbx >> popf >> dec %eax >> jnz loop >> >> # perf stat -r20 ./popf_1g >> 4,929,012,093 cycles # 3.412 GHz ( +- 0.02% ) >> 835,721,371 instructions # 0.17 insn per cycle ( +- 0.02% ) >> 1.446185359 seconds time elapsed ( +- 0.46% ) >> >> If I replace POPF with a pop into an unused register, I get this: > > You are comparing apples and bananas here. Yes, I know. Pop into a register here is basically free. I'd also add a STI and measure how much it takes to enable interrupts, but unfortunately STI throws a #GP in CPL 3. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 17:20 RFC: Petition Intel/AMD to add POPF_IF insn Denys Vlasenko 2016-08-17 17:30 ` Christian König @ 2016-08-17 17:32 ` Linus Torvalds 2016-08-17 18:41 ` Denys Vlasenko 1 sibling, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2016-08-17 17:32 UTC (permalink / raw) To: Denys Vlasenko Cc: Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher, Rafael J. Wysocki, Christian König, Vinod Koul, Johannes Berg, Sara Sharon, Adrian Hunter, the arch/x86 maintainers, LKML On Wed, Aug 17, 2016 at 10:20 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote: > Last year, a proposal was floated to avoid costly POPF. > In particular, each spin_unlock_irqrestore() does it, > a rather common operation. Quiet frankly, it takes so long for hw features that I don't think it's worth it for something like this. I'd rather see people play around with just the conditional "sti" model instead and see how well that works. *Particularly* if we inline that part of the spin_lock_irqrestore(), the branch prediction logic may work very well. And even if not, it migth work better than popf. But somebody would need to run the tests on an actual load (microbenchmarks wouldn't work, because they wouldn't show the different branch prediction cases). power has been using "just do IF in software", which is also an alternative, but it's much more complicated (you take the interrupt, notice that the sw disable flag is set, don't actually run the irq but set a flag and disable irqs for _real_ and return. Then you replay the irq when enabling the software irq flag again) Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 17:32 ` Linus Torvalds @ 2016-08-17 18:41 ` Denys Vlasenko [not found] ` <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com> 0 siblings, 1 reply; 25+ messages in thread From: Denys Vlasenko @ 2016-08-17 18:41 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher, Rafael J. Wysocki, Christian König, Vinod Koul, Johannes Berg, Sara Sharon, Adrian Hunter, the arch/x86 maintainers, LKML On 08/17/2016 07:32 PM, Linus Torvalds wrote: > On Wed, Aug 17, 2016 at 10:20 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote: >> Last year, a proposal was floated to avoid costly POPF. >> In particular, each spin_unlock_irqrestore() does it, >> a rather common operation. > > Quiet frankly, it takes so long for hw features that I don't think > it's worth it for something like this. True, it will take some ~5 years for new hardware to become widely used. OTOH 5 years will inevitably pass. Just like now 32-bit Linux userspace is commonly compiled to "i686" instruction set, because inevitably enough time has passed that you can safely assume most people run at least Pentium Pro-level CPU, with CMOV, CMPXCHG, etc. If the new instruction will be implemented with "REPZ POPF" encoding, no recompilation or alternatives will be needed for the new kernel to run on an old CPU. On an old CPU, entire EFLAGS will be restored. On a new one, only EFLAGS.IF. ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com>]
* Re: RFC: Petition Intel/AMD to add POPF_IF insn [not found] ` <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com> @ 2016-08-17 19:13 ` Andy Lutomirski 2016-08-17 19:26 ` Denys Vlasenko 2016-08-17 19:29 ` Linus Torvalds 0 siblings, 2 replies; 25+ messages in thread From: Andy Lutomirski @ 2016-08-17 19:13 UTC (permalink / raw) To: Linus Torvalds Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On Wed, Aug 17, 2016 at 12:01 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Aug 17, 2016 11:41 AM, "Denys Vlasenko" <dvlasenk@redhat.com> wrote: >> >> OTOH 5 years will inevitably pass. > > Yes. But in five years, maybe we'll have a popf that is faster anyway. > > I'd actually prefer that in the end. The problem with popf right now seems > to be mainly that it's effectively serializing and does stupid things in > microcode. It doesn't have to be that way. It could actually do much better, > but it hasn't been a high enough priority for Intel. > It wouldn't surprise me if that were easier said than done. popf potentially changes AC, and AC affects address translation. popf also potentially changes IOPL, and I don't know whether Intel chips track IOPL in a way that lets them find all the dependent instructions without serializing. But maybe their pipeline is fancy enough. Personally, I still expect that a simple branch-and-sti is the way to go. It wouldn't shock me if even a mispredicted branch and STI is faster than POPF. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 19:13 ` Andy Lutomirski @ 2016-08-17 19:26 ` Denys Vlasenko 2016-08-17 19:32 ` Linus Torvalds 2016-08-17 19:29 ` Linus Torvalds 1 sibling, 1 reply; 25+ messages in thread From: Denys Vlasenko @ 2016-08-17 19:26 UTC (permalink / raw) To: Andy Lutomirski, Linus Torvalds Cc: Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On 08/17/2016 09:13 PM, Andy Lutomirski wrote: > On Wed, Aug 17, 2016 at 12:01 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Aug 17, 2016 11:41 AM, "Denys Vlasenko" <dvlasenk@redhat.com> wrote: >>> >>> OTOH 5 years will inevitably pass. >> >> Yes. But in five years, maybe we'll have a popf that is faster anyway. >> >> I'd actually prefer that in the end. The problem with popf right now seems >> to be mainly that it's effectively serializing and does stupid things in >> microcode. It doesn't have to be that way. It could actually do much better, >> but it hasn't been a high enough priority for Intel. >> > > It wouldn't surprise me if that were easier said than done. popf > potentially changes AC, and AC affects address translation. popf also > potentially changes IOPL, and I don't know whether Intel chips track > IOPL in a way that lets them find all the dependent instructions > without serializing. But maybe their pipeline is fancy enough. Exactly. And more: POPF potentially changes TF (and it works even in CPL3). POPD changes DF - must serialize versus string insns. POPF changes NT - must serialize versus IRET insns. POPF changes VIF, from a different bit in a popped value, and under a rather complex conditions. Intel's documentation has a pseudo-code for the instructions. For POPF, that pseudo-code is two pages long... ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 19:26 ` Denys Vlasenko @ 2016-08-17 19:32 ` Linus Torvalds 2016-08-17 19:35 ` Denys Vlasenko 2016-08-17 19:37 ` Linus Torvalds 0 siblings, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2016-08-17 19:32 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On Wed, Aug 17, 2016 at 12:26 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote: > > Exactly. And more: All of which is ENTIRELY IRRELEVANT. The 2-page pseudo-code is about behavior. It's trivial to short-circuit. There are obvious optimizations, which involve just making the rare cases go slow and have a trivial "if those bits didn't change, don't go through the expense of testing them". The problem is that IF is almost certainly right now in that rare case mask, and so popf is stupidly slow for IF. That in no way means that it *has* to be slow for IF, and more importantly, the other flags don't even come into play. Forget about them. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 19:32 ` Linus Torvalds @ 2016-08-17 19:35 ` Denys Vlasenko 2016-08-17 19:54 ` Linus Torvalds 2016-08-17 19:37 ` Linus Torvalds 1 sibling, 1 reply; 25+ messages in thread From: Denys Vlasenko @ 2016-08-17 19:35 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On 08/17/2016 09:32 PM, Linus Torvalds wrote: > On Wed, Aug 17, 2016 at 12:26 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote: >> >> Exactly. And more: > > All of which is ENTIRELY IRRELEVANT. > > The 2-page pseudo-code is about behavior. It's trivial to > short-circuit. There are obvious optimizations, which involve just > making the rare cases go slow and have a trivial "if those bits didn't > change, don't go through the expense of testing them". > > The problem is that IF is almost certainly right now in that rare case > mask, and so popf is stupidly slow for IF. I ran the test, see the first email in the thread. Experimentally, POPF is stupidly slow _always_. 6 cycles even if none of the "scary" flags are changed. Either: * its microcode has no rare case mask or * its microcode is so slow that even fast path is slow, and slow path is worse ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 19:35 ` Denys Vlasenko @ 2016-08-17 19:54 ` Linus Torvalds 0 siblings, 0 replies; 25+ messages in thread From: Linus Torvalds @ 2016-08-17 19:54 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On Wed, Aug 17, 2016 at 12:35 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote: > > Experimentally, POPF is stupidly slow _always_. 6 cycles > even if none of the "scary" flags are changed. 6 cycles is nothing. That's basically the overhead of "oops, I need to use the microcode sequencer". One issue is that the intel decoders (AMD too, for that matter) can only generate a fairly small set of uops for any instruction. Some instructions are really trivial to decode (popf definitely falls under that heading), but are more than just a couple of uops, so you end up having to use the uop sequencer logic. According to Agner Fog's tables, there's one or two micro-architectures that actually dot he simple "popf" case with a single cycle throughput, but that's the very unusual case. You can't even fit the "pop a value, see if only the arithmetic flags changed, trap to microcode otherwise" into the three of four uops that the "complex decoder" can generate directly. And that "fall back to the uop sequencer engine" tends to just always cause several cycles regardless. So yes, microcode tends to be slow even for what would otherwise be trivial operations. You'd think Intel could do as well as they do for the L0 uop cache, but afaik they don't. Anyway, six cycles is fast. I'd *love* for popf to actually be just 6 cycles when IF changes. It's much much worse iirc (although honestly, I haven't timed it in years - it's much easier to time just the arithmetic flag changes). It used to be more like a hundred cycles on Prescott. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 19:32 ` Linus Torvalds 2016-08-17 19:35 ` Denys Vlasenko @ 2016-08-17 19:37 ` Linus Torvalds 2016-08-17 21:26 ` Linus Torvalds 1 sibling, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2016-08-17 19:37 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On Wed, Aug 17, 2016 at 12:32 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The 2-page pseudo-code is about behavior. It's trivial to > short-circuit. There are obvious optimizations, which involve just > making the rare cases go slow and have a trivial "if those bits didn't > change, don't go through the expense of testing them". That said, the first thing we should test is to just see how the "let's just optimize this in software" thing works. Replace the "popf" with "if (val & X86_EFLAGS_IF) local_irq_enable();" and see how that works out. Play with inlining it or not, and see if the branch predictor matters. We may simply not need popf optimizations. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 19:37 ` Linus Torvalds @ 2016-08-17 21:26 ` Linus Torvalds 2016-08-17 21:35 ` Linus Torvalds 0 siblings, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2016-08-17 21:26 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter [-- Attachment #1: Type: text/plain, Size: 1414 bytes --] On Wed, Aug 17, 2016 at 12:37 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Replace the "popf" with "if (val & X86_EFLAGS_IF) local_irq_enable();" > and see how that works out. Play with inlining it or not, and see if > the branch predictor matters. .. actually, thinking a bit more about it, I really don't think the branch predictor will even matter. We sure as hell shouldn't have nested irq-safe interrupts in paths that matter from a performance angle, so the normal case for spin_unlock_irqrestore() should be to enable interrupts again. And if interrupts are disabled because the caller is actually in interrupt context, I don't think the branch prediction is going to matter, compared to the irq overhead. So test this trivial patch. It's ENTIRELY UNTESTED. It may be complete crap and not even compile. But I did test it on kernel/locking/spinlock.c, and the generated code is beautiful: _raw_spin_unlock_irqrestore: testl $512, %esi #, flags movb $0, (%rdi) #, MEM[(volatile __u8 *)lock_2(D)] je .L2 sti .L2: ret so maybe the silly popf has always just been stupid. Of course, if somebody uses native_restore_fl() to actually *disable* interrupts (when they weren't already disabled), then this untested patch will just not work. But why would you do somethign so stupid? Famous last words... Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1127 bytes --] arch/x86/include/asm/irqflags.h | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h index b77f5edb03b0..76c4ebfab0be 100644 --- a/arch/x86/include/asm/irqflags.h +++ b/arch/x86/include/asm/irqflags.h @@ -8,6 +8,16 @@ * Interrupt control: */ +static inline void native_irq_disable(void) +{ + asm volatile("cli": : :"memory"); +} + +static inline void native_irq_enable(void) +{ + asm volatile("sti": : :"memory"); +} + static inline unsigned long native_save_fl(void) { unsigned long flags; @@ -28,20 +38,8 @@ static inline unsigned long native_save_fl(void) static inline void native_restore_fl(unsigned long flags) { - asm volatile("push %0 ; popf" - : /* no output */ - :"g" (flags) - :"memory", "cc"); -} - -static inline void native_irq_disable(void) -{ - asm volatile("cli": : :"memory"); -} - -static inline void native_irq_enable(void) -{ - asm volatile("sti": : :"memory"); + if (flags & X86_EFLAGS_IF) + native_irq_enable(); } static inline void native_safe_halt(void) ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 21:26 ` Linus Torvalds @ 2016-08-17 21:35 ` Linus Torvalds 2016-08-17 21:43 ` Andy Lutomirski 2016-08-18 9:21 ` Denys Vlasenko 0 siblings, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2016-08-17 21:35 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On Wed, Aug 17, 2016 at 2:26 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Of course, if somebody uses native_restore_fl() to actually *disable* > interrupts (when they weren't already disabled), then this untested > patch will just not work. But why would you do somethign so stupid? > Famous last words... Looking around, the vsmp code actually uses "native_restore_fl()" to not just set the interrupt flag, but AC as well. And the PV spinlock case has that "push;popf" sequence encoded in an alternate. So that trivial patch may (or may not - still not tested) work for some quick testing, but needs more effort for any *real* use. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 21:35 ` Linus Torvalds @ 2016-08-17 21:43 ` Andy Lutomirski 2016-08-17 21:48 ` Linus Torvalds 2016-08-18 9:21 ` Denys Vlasenko 1 sibling, 1 reply; 25+ messages in thread From: Andy Lutomirski @ 2016-08-17 21:43 UTC (permalink / raw) To: Linus Torvalds Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On Wed, Aug 17, 2016 at 2:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Aug 17, 2016 at 2:26 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Of course, if somebody uses native_restore_fl() to actually *disable* >> interrupts (when they weren't already disabled), then this untested >> patch will just not work. But why would you do somethign so stupid? >> Famous last words... > > Looking around, the vsmp code actually uses "native_restore_fl()" to > not just set the interrupt flag, but AC as well. > > And the PV spinlock case has that "push;popf" sequence encoded in an alternate. > > So that trivial patch may (or may not - still not tested) work for > some quick testing, but needs more effort for any *real* use. > It shouldn't be *too* bad, since xen_restore_fl only affects "IF". And even if native_restore_fl needs to be able to turn IRQs off as well as on, we can just do: if (likely(flags & X86_EFLAGS_IF)) sti(); else cli(); at some cost to code size but hopefully little to no runtime cost for the sane cases. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 21:43 ` Andy Lutomirski @ 2016-08-17 21:48 ` Linus Torvalds 2016-08-18 13:26 ` Denys Vlasenko 0 siblings, 1 reply; 25+ messages in thread From: Linus Torvalds @ 2016-08-17 21:48 UTC (permalink / raw) To: Andy Lutomirski Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On Wed, Aug 17, 2016 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > It shouldn't be *too* bad, since xen_restore_fl only affects "IF". > And even if native_restore_fl needs to be able to turn IRQs off as > well as on, we can just do: > > if (likely(flags & X86_EFLAGS_IF)) > sti(); > else > cli(); > > at some cost to code size but hopefully little to no runtime cost for > the sane cases. No, that would be horrible for the case where we had an irq spinlock in an irq-region. Then we'd have a pointless "cli" there. So I'd rather just make sure that only the spinlock code actually uses native_restore_fl(). And yes, the patch works at least minimally, since I'm writing this with a kernel compiled with that. Of course, somebody really should do timings on modern CPU's (in cpl0, comparing native_fl() that enables interrupts with a popf) Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 21:48 ` Linus Torvalds @ 2016-08-18 13:26 ` Denys Vlasenko 2016-08-18 17:24 ` Linus Torvalds 0 siblings, 1 reply; 25+ messages in thread From: Denys Vlasenko @ 2016-08-18 13:26 UTC (permalink / raw) To: Linus Torvalds, Andy Lutomirski Cc: Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter > Of course, somebody really should do timings on modern CPU's (in cpl0, > comparing native_fl() that enables interrupts with a popf) I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace if we set iopl(3) first. Surprisingly, STI is slower than CLI. A loop with 27 CLI's and one STI converges to about ~0.5 insn/cycle: # compile with: gcc -nostartfiles -nostdlib _start: .globl _start mov $172, %eax #iopl mov $3, %edi syscall mov $200*1000*1000, %eax .balign 64 loop: cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;sti dec %eax jnz loop mov $231, %eax #exit_group syscall perf stat: 6,015,787,968 instructions # 0.52 insn per cycle 3.355474199 seconds time elapsed With all CLIs replaced by STIs, it's ~0.25 insn/cycle: 6,030,530,328 instructions # 0.27 insn per cycle 6.547200322 seconds time elapsed POPF which needs to enable interrupts is not measurably faster than one which does not change .IF: Loop with: 400158: fa cli 400159: 53 push %rbx #saved eflags with if=1 40015a: 9d popfq shows: 8,908,857,324 instructions # 0.11 insn per cycle ( +- 0.00% ) Loop with: 400140: fb sti 400141: 53 push %rbx 400142: 9d popfq shows: 8,920,243,701 instructions # 0.10 insn per cycle ( +- 0.01% ) Even loop with neither CLI nor STI, only with POPF: 400140: 53 push %rbx 400141: 9d popfq shows: 6,079,936,714 instructions # 0.10 insn per cycle ( +- 0.00% ) This is on a Skylake CPU. The gist of it: CLI is 2 cycles, STI is 4 cycles, POPF is 10 cycles seemingly regardless of prior value of EFLAGS.IF. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-18 13:26 ` Denys Vlasenko @ 2016-08-18 17:24 ` Linus Torvalds 2016-08-18 17:47 ` Denys Vlasenko 2016-08-19 10:54 ` Paolo Bonzini 0 siblings, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2016-08-18 17:24 UTC (permalink / raw) To: Denys Vlasenko Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On Thu, Aug 18, 2016 at 6:26 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote: > > I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace > if we set iopl(3) first. Yes, but it might not be the same. So the timings could be very different from a cpl0 case. Also: > Surprisingly, STI is slower than CLI. A loop with 27 CLI's and one STI > converges to about ~0.5 insn/cycle: You really really should not check "sti" together with immediately following sti or cli. The sti instruction has an architecturally defined magical one-instruction window following it when interrupts stay disabled. I could easily see that resulting in strange special cases - Intel actually at some point documented that a sequence of 'sti' instructions are not going to disable interrupts forever (there was a question of what happens if you start out with interrupts disabled, go to a 16-bit code segment that is all filled with "sti" instructions so that the 16-bit EIP will wrap around and continually do an infinite series of 'sti' - do interrupts ever get enabled?) I think intel clarified that when you have a sequence of 'sti' instructions, interrupts will get enabled after the second one, but the point is that this is all "special" from a front-end angle. So putting multiple 'sti' instructions in a bunch may be testing the magical special case more than it would test anything *real*. So at a minimum, make the sequence be "sti; nop" if you do it in a loop. It may not change anything, but at least that way you'll know it doesn't just test the magical case. Realistically, it's better to instead test a *real* instruction sequence, ie just compare something like pushf cli .. do a memory operation here or something half-way real .. pop sti and pushf cli .. do the same half-way real memory op here .. popf and see which one is faster in a loop. That said, your numbers really aren't very convincing. If popf really is just 10 cycles on modern Intel hardware, it's already fast enough that I really don't think it matters. Especially with "sti" being ~4 cycles, and there being a question about branch overhead anyway. You win some, you lose some, but on the whole it sounds like "leave it alone" wins. Now, I know for a fact that there have been other x86 uarchitectres that sucked at "popf", but they may suck almost equally at "sti". So this might well be worth testing out on something that isn't Skylake. Modern intel cores really are pretty good at even the slow operations. Things used to be much much worse in the bad old P4 days. I'm very impressed with the big intel cores. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-18 17:24 ` Linus Torvalds @ 2016-08-18 17:47 ` Denys Vlasenko 2016-08-18 17:49 ` Denys Vlasenko 2016-08-19 10:54 ` Paolo Bonzini 1 sibling, 1 reply; 25+ messages in thread From: Denys Vlasenko @ 2016-08-18 17:47 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On 08/18/2016 07:24 PM, Linus Torvalds wrote: > That said, your numbers really aren't very convincing. If popf really > is just 10 cycles on modern Intel hardware, it's already fast enough > that I really don't think it matters. It's 20 cycles. I was wrong in my email, I forgot that the insn count also counts "push %ebx" insns. Since I already made a mistake, let me double-check. 200 million iterations of this loop execute under 17 seconds: 400100: b8 00 c2 eb 0b mov $0xbebc200,%eax # 1000*1000*1000 400105: 9c pushfq 400106: 5b pop %rbx 400107: 90 nop .... 0000000000400140 <loop>: 400140: 53 push %rbx 400141: 9d popfq 400142: 53 push %rbx 400143: 9d popfq 400144: 53 push %rbx 400145: 9d popfq 400146: 53 push %rbx 400147: 9d popfq 400148: 53 push %rbx 400149: 9d popfq 40014a: 53 push %rbx 40014b: 9d popfq 40014c: 53 push %rbx 40014d: 9d popfq 40014e: 53 push %rbx 40014f: 9d popfq 400150: 53 push %rbx 400151: 9d popfq 400152: 53 push %rbx 400153: 9d popfq 400154: 53 push %rbx 400155: 9d popfq 400156: 53 push %rbx 400157: 9d popfq 400158: 53 push %rbx 400159: 9d popfq 40015a: 53 push %rbx 40015b: 9d popfq 40015c: ff c8 dec %eax 40015e: 75 e0 jne 400140 <loop> The loop is exactly 32 bytes, aligned. There are 14 POPFs. Other insns are very fast. No perf, just "time taskset 1 ./test". My CPU frequency hovers around 3500 MHz when loaded. 17 seconds is 17*3500 million cycles. 17*3500 million cycles / 200*14 million cycles = 21.25 Thus, one POPF in CPL3 is ~20 cycles on Skylake. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-18 17:47 ` Denys Vlasenko @ 2016-08-18 17:49 ` Denys Vlasenko 0 siblings, 0 replies; 25+ messages in thread From: Denys Vlasenko @ 2016-08-18 17:49 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On 08/18/2016 07:47 PM, Denys Vlasenko wrote: > On 08/18/2016 07:24 PM, Linus Torvalds wrote: >> That said, your numbers really aren't very convincing. If popf really >> is just 10 cycles on modern Intel hardware, it's already fast enough >> that I really don't think it matters. > > It's 20 cycles. I was wrong in my email, I forgot that the insn count > also counts "push %ebx" insns. > > Since I already made a mistake, let me double-check. > > 200 million iterations of this loop execute under 17 seconds: > > 400100: b8 00 c2 eb 0b mov $0xbebc200,%eax # 1000*1000*1000 Grr. It's 200*1000*1000, not one billion. Sorry. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-18 17:24 ` Linus Torvalds 2016-08-18 17:47 ` Denys Vlasenko @ 2016-08-19 10:54 ` Paolo Bonzini 2016-08-31 11:12 ` Denys Vlasenko 1 sibling, 1 reply; 25+ messages in thread From: Paolo Bonzini @ 2016-08-19 10:54 UTC (permalink / raw) To: Linus Torvalds, Denys Vlasenko Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On 18/08/2016 19:24, Linus Torvalds wrote: >> > I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace >> > if we set iopl(3) first. > Yes, but it might not be the same. So the timings could be very > different from a cpl0 case. FWIW I recently measured around 20 cycles for a popf as well on Haswell-EP and CPL=0 (that was for commit f2485b3e0c6c, "KVM: x86: use guest_exit_irqoff", 2016-07-01). Paolo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-19 10:54 ` Paolo Bonzini @ 2016-08-31 11:12 ` Denys Vlasenko 0 siblings, 0 replies; 25+ messages in thread From: Denys Vlasenko @ 2016-08-31 11:12 UTC (permalink / raw) To: Paolo Bonzini, Linus Torvalds Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On 08/19/2016 12:54 PM, Paolo Bonzini wrote: > On 18/08/2016 19:24, Linus Torvalds wrote: >>>> I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace >>>> if we set iopl(3) first. >> Yes, but it might not be the same. So the timings could be very >> different from a cpl0 case. > > FWIW I recently measured around 20 cycles for a popf as well on > Haswell-EP and CPL=0 (that was for commit f2485b3e0c6c, "KVM: x86: use > guest_exit_irqoff", 2016-07-01). Thanks for confirmation. I revisited benchmarking of the if (flags & X86_EFLAGS_IF) native_irq_enable(); patch. In "make -j20" kernel compiles on a 8-way (HT) CPU, it shows some ~5 second improvement during ~16 minute compile. That's 0.5% speedup. It's ok, but not something to bee too excited. 80 e6 02 and $0x2,%dh 74 01 je ffffffff810101ae <intel_pt_handle_vmx+0x3e> fb sti 41 f6 86 91 00 00 00 02 testb $0x2,0x91(%r14) 74 01 je ffffffff81013ce7 <math_error+0x77> fb sti f6 83 91 00 00 00 02 testb $0x2,0x91(%rbx) 74 01 je ffffffff81013efa <do_int3+0xba> fb sti 41 f7 c4 00 02 00 00 test $0x200,%r12d 74 01 je ffffffff8101615d <oops_end+0x5d> fb sti Here we trade 20-cycle POPF for either 4-cycle STI, or a branch (which is either ~1 cycle if predicted, or ~20 cycles if mispredicted). The disassembly of vmlinux shows that gcc generates these asm patterns: I still think a dedicated instruction for a conditional STI is worth asking for. Along the lines of "If bit 9 in the r/m argument is set, then STI, else nothing". What do people from CPU companies say? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 21:35 ` Linus Torvalds 2016-08-17 21:43 ` Andy Lutomirski @ 2016-08-18 9:21 ` Denys Vlasenko 2016-08-18 12:18 ` Denys Vlasenko 1 sibling, 1 reply; 25+ messages in thread From: Denys Vlasenko @ 2016-08-18 9:21 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On 08/17/2016 11:35 PM, Linus Torvalds wrote: > On Wed, Aug 17, 2016 at 2:26 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Of course, if somebody uses native_restore_fl() to actually *disable* >> interrupts (when they weren't already disabled), then this untested >> patch will just not work. But why would you do somethign so stupid? >> Famous last words... > > Looking around, the vsmp code actually uses "native_restore_fl()" to > not just set the interrupt flag, but AC as well. > > And the PV spinlock case has that "push;popf" sequence encoded in an alternate. > > So that trivial patch may (or may not - still not tested) work for > some quick testing, but needs more effort for any *real* use. I'm going to test the attached patch. PV and debug maze is daunting... I discovered that Fedora kernels compile with the set of .config options which result in spin_unlock_irqrestore which looks like this: ffffffff818d0f40 <_raw_spin_unlock_irqrestore>: ffffffff818d0f40: e8 1b 31 00 00 callq ffffffff818d4060 <__fentry__> ffffffff818d0f41: R_X86_64_PC32 __fentry__-0x4 ffffffff818d0f45: 55 push %rbp ffffffff818d0f46: 48 89 e5 mov %rsp,%rbp ffffffff818d0f49: 41 54 push %r12 ffffffff818d0f4b: 53 push %rbx ffffffff818d0f4c: 48 8b 55 08 mov 0x8(%rbp),%rdx ffffffff818d0f50: 49 89 fc mov %rdi,%r12 ffffffff818d0f53: 48 89 f3 mov %rsi,%rbx ffffffff818d0f56: 48 83 c7 18 add $0x18,%rdi ffffffff818d0f5a: be 01 00 00 00 mov $0x1,%esi ffffffff818d0f5f: e8 8c b8 83 ff callq ffffffff8110c7f0 <lock_release> ffffffff818d0f60: R_X86_64_PC32 lock_release-0x4 ffffffff818d0f64: 4c 89 e7 mov %r12,%rdi ffffffff818d0f67: e8 d4 fb 83 ff callq ffffffff81110b40 <do_raw_spin_unlock> ffffffff818d0f68: R_X86_64_PC32 do_raw_spin_unlock-0x4 ffffffff818d0f6c: f6 c7 02 test $0x2,%bh ffffffff818d0f6f: 74 1b je ffffffff818d0f8c <_raw_spin_unlock_irqrestore+0x4c> ffffffff818d0f71: e8 aa 9d 83 ff callq ffffffff8110ad20 <trace_hardirqs_on> ffffffff818d0f72: R_X86_64_PC32 trace_hardirqs_on-0x4 ffffffff818d0f76: 48 89 df mov %rbx,%rdi ffffffff818d0f79: ff 14 25 48 32 e2 81 callq *0xffffffff81e23248 ffffffff818d0f7c: R_X86_64_32S pv_irq_ops+0x8 ffffffff818d0f80: 65 ff 0d c9 c4 73 7e decl %gs:0x7e73c4c9(%rip) # d450 <__preempt_count> ffffffff818d0f83: R_X86_64_PC32 __preempt_count-0x4 ffffffff818d0f87: 5b pop %rbx ffffffff818d0f88: 41 5c pop %r12 ffffffff818d0f8a: 5d pop %rbp ffffffff818d0f8b: c3 retq ffffffff818d0f8c: 48 89 df mov %rbx,%rdi ffffffff818d0f8f: ff 14 25 48 32 e2 81 callq *0xffffffff81e23248 ffffffff818d0f92: R_X86_64_32S pv_irq_ops+0x8 ffffffff818d0f96: e8 35 5b 83 ff callq ffffffff81106ad0 <trace_hardirqs_off> ffffffff818d0f97: R_X86_64_PC32 trace_hardirqs_off-0x4 ffffffff818d0f9b: eb e3 jmp ffffffff818d0f80 <_raw_spin_unlock_irqrestore+0x40> ffffffff818d0f9d: 0f 1f 00 nopl (%rax) Gawd... really Fedora? All this is needed? Testing with _this_ is not going to show any differences, I need to weed all the cruft out and test a fast, non-debug .config. Changed it like this: +# CONFIG_UPROBES is not set +# CONFIG_SCHED_OMIT_FRAME_POINTER is not set +# CONFIG_HYPERVISOR_GUEST is not set +# CONFIG_SYS_HYPERVISOR is not set +# CONFIG_FRAME_POINTER is not set +# CONFIG_KMEMCHECK is not set +# CONFIG_DEBUG_LOCK_ALLOC is not set +# CONFIG_PROVE_LOCKING is not set +# CONFIG_LOCK_STAT is not set +# CONFIG_PROVE_RCU is not set +# CONFIG_LATENCYTOP is not set +# CONFIG_FTRACE is not set +# CONFIG_BINARY_PRINTF is not set Looks better (it's with the patch already): 00000000000000f0 <_raw_spin_unlock_irqrestore>: f0: 53 push %rbx f1: 48 89 f3 mov %rsi,%rbx f4: e8 00 00 00 00 callq f9 <_raw_spin_unlock_irqrestore+0x9> f5: R_X86_64_PC32 do_raw_spin_unlock-0x4 f9: 80 e7 02 and $0x2,%bh fc: 74 01 je ff <_raw_spin_unlock_irqrestore+0xf> fe: fb sti ff: 65 ff 0d 00 00 00 00 decl %gs:0x0(%rip) # 106 <_raw_spin_unlock_irqrestore+0x16> 102: R_X86_64_PC32 __preempt_count-0x4 106: 5b pop %rbx 107: c3 retq This also raises questions. Such as "why do_raw_spin_unlock() is not inlined here?" Anyway... on to more kernel builds and testing. Please take a look at the below patch. If it's obviously buggy, let me know. diff -urp linux-4.7.1.org/arch/x86/include/asm/irqflags.h linux-4.7.1.obj/arch/x86/include/asm/irqflags.h --- linux-4.7.1.org/arch/x86/include/asm/irqflags.h 2016-08-16 09:35:15.000000000 +0200 +++ linux-4.7.1.obj/arch/x86/include/asm/irqflags.h 2016-08-18 10:01:09.514219644 +0200 @@ -44,6 +44,16 @@ static inline void native_irq_enable(voi asm volatile("sti": : :"memory"); } +/* + * This produces a "test; jz; sti" insn sequence. + * It's faster than "push reg; popf". If sti is skipped, it's much faster. + */ +static inline void native_cond_irq_enable(unsigned long flags) +{ + if (flags & X86_EFLAGS_IF) + native_irq_enable(); +} + static inline void native_safe_halt(void) { asm volatile("sti; hlt": : :"memory"); @@ -69,7 +79,8 @@ static inline notrace unsigned long arch static inline notrace void arch_local_irq_restore(unsigned long flags) { - native_restore_fl(flags); + if (flags & X86_EFLAGS_IF) + native_irq_enable(); } static inline notrace void arch_local_irq_disable(void) diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt.c linux-4.7.1.obj/arch/x86/kernel/paravirt.c --- linux-4.7.1.org/arch/x86/kernel/paravirt.c 2016-08-16 09:35:15.000000000 +0200 +++ linux-4.7.1.obj/arch/x86/kernel/paravirt.c 2016-08-18 10:01:24.797155109 +0200 @@ -313,7 +313,7 @@ struct pv_time_ops pv_time_ops = { __visible struct pv_irq_ops pv_irq_ops = { .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), - .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), + .restore_fl = __PV_IS_CALLEE_SAVE(native_cond_irq_enable), .irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable), .irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable), .safe_halt = native_safe_halt, diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c --- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c 2016-08-16 09:35:15.000000000 +0200 +++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c 2016-08-18 04:43:19.388102755 +0200 @@ -2,7 +2,7 @@ DEF_NATIVE(pv_irq_ops, irq_disable, "cli"); DEF_NATIVE(pv_irq_ops, irq_enable, "sti"); -DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf"); +DEF_NATIVE(pv_irq_ops, restore_fl, "testb $((1<<9)>>8),%ah; jz 1f; sti; 1: ;"); DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax"); DEF_NATIVE(pv_cpu_ops, iret, "iret"); DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax"); diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c --- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c 2016-08-16 09:35:15.000000000 +0200 +++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c 2016-08-18 04:42:56.364545509 +0200 @@ -4,7 +4,7 @@ DEF_NATIVE(pv_irq_ops, irq_disable, "cli"); DEF_NATIVE(pv_irq_ops, irq_enable, "sti"); -DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq"); +DEF_NATIVE(pv_irq_ops, restore_fl, "testw $(1<<9),%di; jz 1f; sti; 1: ;"); DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax"); DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax"); DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax"); ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-18 9:21 ` Denys Vlasenko @ 2016-08-18 12:18 ` Denys Vlasenko 2016-08-18 17:22 ` Paolo Bonzini 0 siblings, 1 reply; 25+ messages in thread From: Denys Vlasenko @ 2016-08-18 12:18 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter [-- Attachment #1: Type: text/plain, Size: 8089 bytes --] On 08/18/2016 11:21 AM, Denys Vlasenko wrote: >>> Of course, if somebody uses native_restore_fl() to actually *disable* >>> interrupts (when they weren't already disabled), then this untested >>> patch will just not work. But why would you do somethign so stupid? >>> Famous last words... >> >> Looking around, the vsmp code actually uses "native_restore_fl()" to >> not just set the interrupt flag, but AC as well. >> >> And the PV spinlock case has that "push;popf" sequence encoded in an alternate. >> >> So that trivial patch may (or may not - still not tested) work for >> some quick testing, but needs more effort for any *real* use. > > I'm going to test the attached patch. ... > > +# CONFIG_UPROBES is not set > +# CONFIG_SCHED_OMIT_FRAME_POINTER is not set > +# CONFIG_HYPERVISOR_GUEST is not set > +# CONFIG_SYS_HYPERVISOR is not set > +# CONFIG_FRAME_POINTER is not set > +# CONFIG_KMEMCHECK is not set > +# CONFIG_DEBUG_LOCK_ALLOC is not set > +# CONFIG_PROVE_LOCKING is not set > +# CONFIG_LOCK_STAT is not set > +# CONFIG_PROVE_RCU is not set > +# CONFIG_LATENCYTOP is not set > +# CONFIG_FTRACE is not set > +# CONFIG_BINARY_PRINTF is not set Need also !CONFIG_DEBUG_SPINLOCK, then unpatched irqrestore is: ffffffff817115a0 <_raw_spin_unlock_irqrestore>: ffffffff817115a0: c6 07 00 movb $0x0,(%rdi) ffffffff817115a3: 56 push %rsi ffffffff817115a4: 9d popfq ffffffff817115a5: 65 ff 0d e4 ad 8f 7e decl %gs:__preempt_count ffffffff817115ac: c3 retq ffffffff817115ad: 0f 1f 00 nopl (%rax) patched one is ffffffff81711660 <_raw_spin_unlock_irqrestore>: ffffffff81711660: f7 c6 00 02 00 00 test $0x200,%esi ffffffff81711666: c6 07 00 movb $0x0,(%rdi) ffffffff81711669: 74 01 je ffffffff8171166c <_raw_spin_unlock_irqrestore+0xc> ffffffff8171166b: fb sti ffffffff8171166c: 65 ff 0d 1d ad 8f 7e decl %gs:__preempt_count ffffffff81711673: c3 retq ffffffff81711674: 66 90 xchg %ax,%ax ffffffff81711676: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:0x0(%rax,%rax,1) Ran the following twice on a quiesced machine: taskset 1 perf stat -r60 perf bench sched messaging taskset 1 perf stat -r60 perf bench sched pipe Out of these four runs, both "perf bench sched pipe" runs show improvements: - 2648.279829 task-clock (msec) # 1.000 CPUs utilized ( +- 0.24% ) + 2483.143469 task-clock (msec) # 0.998 CPUs utilized ( +- 0.31% ) - 2,000,002 context-switches # 0.755 M/sec ( +- 0.00% ) + 2,000,013 context-switches # 0.805 M/sec ( +- 0.00% ) - 547 page-faults # 0.206 K/sec ( +- 0.04% ) + 546 page-faults # 0.220 K/sec ( +- 0.04% ) - 8,723,284,926 cycles # 3.294 GHz ( +- 0.06% ) + 8,157,949,449 cycles # 3.285 GHz ( +- 0.07% ) - 12,286,937,344 instructions # 1.41 insn per cycle ( +- 0.03% ) + 12,255,616,405 instructions # 1.50 insn per cycle ( +- 0.05% ) - 2,588,839,023 branches # 977.555 M/sec ( +- 0.02% ) + 2,599,319,615 branches # 1046.786 M/sec ( +- 0.04% ) - 3,620,273 branch-misses # 0.14% of all branches ( +- 0.67% ) + 3,577,771 branch-misses # 0.14% of all branches ( +- 0.69% ) - 2.648799072 seconds time elapsed ( +- 0.24% ) + 2.487452268 seconds time elapsed ( +- 0.31% ) Good, we run more insns/cycle, as expected. However, a bit more branches. But of two "perf bench sched messaging" run, one was slower on a patched kernel, and perf shows why: more branches, and also branch miss percentage is larger: - 690.008697 task-clock (msec) # 0.996 CPUs utilized ( +- 0.45% ) + 699.526509 task-clock (msec) # 0.994 CPUs utilized ( +- 0.28% ) - 26,796 context-switches # 0.039 M/sec ( +- 8.31% ) + 29,088 context-switches # 0.042 M/sec ( +- 6.62% ) - 35,477 page-faults # 0.051 M/sec ( +- 0.11% ) + 35,504 page-faults # 0.051 M/sec ( +- 0.14% ) - 2,157,701,609 cycles # 3.127 GHz ( +- 0.35% ) + 2,143,781,407 cycles # 3.065 GHz ( +- 0.25% ) - 3,115,212,672 instructions # 1.44 insn per cycle ( +- 0.28% ) + 3,253,499,549 instructions # 1.52 insn per cycle ( +- 0.19% ) - 661,888,593 branches # 959.247 M/sec ( +- 0.36% ) + 707,862,655 branches # 1011.917 M/sec ( +- 0.20% ) - 2,793,203 branch-misses # 0.42% of all branches ( +- 1.04% ) + 3,453,397 branch-misses # 0.49% of all branches ( +- 0.32% ) - 0.693004918 seconds time elapsed ( +- 0.45% ) + 0.703630988 seconds time elapsed ( +- 0.27% ) This tipped the scales, and despite higher insns/cycle, run time is worse. The other "perf bench sched messaging" run was more lucky: - 706.944245 task-clock (msec) # 0.995 CPUs utilized ( +- 0.32% ) + 687.038856 task-clock (msec) # 0.993 CPUs utilized ( +- 0.31% ) - 23,489 context-switches # 0.033 M/sec ( +- 7.02% ) + 26,644 context-switches # 0.039 M/sec ( +- 7.46% ) - 35,360 page-faults # 0.050 M/sec ( +- 0.12% ) + 35,417 page-faults # 0.052 M/sec ( +- 0.13% ) - 2,183,639,816 cycles # 3.089 GHz ( +- 0.35% ) + 2,123,086,753 cycles # 3.090 GHz ( +- 0.27% ) - 3,131,362,238 instructions # 1.43 insn per cycle ( +- 0.19% ) + 3,236,613,433 instructions # 1.52 insn per cycle ( +- 0.19% ) - 667,874,319 branches # 944.734 M/sec ( +- 0.21% ) + 703,677,908 branches # 1024.219 M/sec ( +- 0.20% ) - 2,859,521 branch-misses # 0.43% of all branches ( +- 0.56% ) + 3,462,063 branch-misses # 0.49% of all branches ( +- 0.33% ) - 0.710738536 seconds time elapsed ( +- 0.32% ) + 0.691908533 seconds time elapsed ( +- 0.31% ) However, it still had more branches (~5% more), and worse branch miss percentage. The patch seems to work. It also does not bloat the kernel: text data bss dec hex filename 8199556 5026784 2924544 16150884 f67164 vmlinux 8199897 5026784 2924544 16151225 f672b9 vmlinux.patched However, a "conditional CLI/STI from r/m" insn could be better still. The patch is attached. [-- Attachment #2: z.diff --] [-- Type: text/x-patch, Size: 3076 bytes --] diff -urp linux-4.7.1.org/arch/x86/include/asm/irqflags.h linux-4.7.1.obj/arch/x86/include/asm/irqflags.h --- linux-4.7.1.org/arch/x86/include/asm/irqflags.h 2016-08-16 09:35:15.000000000 +0200 +++ linux-4.7.1.obj/arch/x86/include/asm/irqflags.h 2016-08-18 10:01:09.514219644 +0200 @@ -44,6 +44,16 @@ static inline void native_irq_enable(voi asm volatile("sti": : :"memory"); } +/* + * This produces a "test; jz; sti" insn sequence. + * It's faster than "push reg; popf". If sti is skipped, it's much faster. + */ +static inline void native_cond_irq_enable(unsigned long flags) +{ + if (flags & X86_EFLAGS_IF) + native_irq_enable(); +} + static inline void native_safe_halt(void) { asm volatile("sti; hlt": : :"memory"); @@ -69,7 +79,8 @@ static inline notrace unsigned long arch static inline notrace void arch_local_irq_restore(unsigned long flags) { - native_restore_fl(flags); + if (flags & X86_EFLAGS_IF) + native_irq_enable(); } static inline notrace void arch_local_irq_disable(void) diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt.c linux-4.7.1.obj/arch/x86/kernel/paravirt.c --- linux-4.7.1.org/arch/x86/kernel/paravirt.c 2016-08-16 09:35:15.000000000 +0200 +++ linux-4.7.1.obj/arch/x86/kernel/paravirt.c 2016-08-18 10:01:24.797155109 +0200 @@ -313,7 +313,7 @@ struct pv_time_ops pv_time_ops = { __visible struct pv_irq_ops pv_irq_ops = { .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), - .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), + .restore_fl = __PV_IS_CALLEE_SAVE(native_cond_irq_enable), .irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable), .irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable), .safe_halt = native_safe_halt, diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c --- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c 2016-08-16 09:35:15.000000000 +0200 +++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c 2016-08-18 04:43:19.388102755 +0200 @@ -2,7 +2,7 @@ DEF_NATIVE(pv_irq_ops, irq_disable, "cli"); DEF_NATIVE(pv_irq_ops, irq_enable, "sti"); -DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf"); +DEF_NATIVE(pv_irq_ops, restore_fl, "testb $((1<<9)>>8),%ah; jz 1f; sti; 1: ;"); DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax"); DEF_NATIVE(pv_cpu_ops, iret, "iret"); DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax"); diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c --- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c 2016-08-16 09:35:15.000000000 +0200 +++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c 2016-08-18 04:42:56.364545509 +0200 @@ -4,7 +4,7 @@ DEF_NATIVE(pv_irq_ops, irq_disable, "cli"); DEF_NATIVE(pv_irq_ops, irq_enable, "sti"); -DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq"); +DEF_NATIVE(pv_irq_ops, restore_fl, "testw $(1<<9),%di; jz 1f; sti; 1: ;"); DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax"); DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax"); DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax"); ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-18 12:18 ` Denys Vlasenko @ 2016-08-18 17:22 ` Paolo Bonzini 0 siblings, 0 replies; 25+ messages in thread From: Paolo Bonzini @ 2016-08-18 17:22 UTC (permalink / raw) To: Denys Vlasenko, Linus Torvalds Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On 18/08/2016 14:18, Denys Vlasenko wrote: > > - 2,588,839,023 branches # 977.555 > M/sec ( +- 0.02% ) > + 2,599,319,615 branches # 1046.786 > M/sec ( +- 0.04% ) > - 3,620,273 branch-misses # 0.14% of all > branches ( +- 0.67% ) > + 3,577,771 branch-misses # 0.14% of all > branches ( +- 0.69% ) > - 2.648799072 seconds time > elapsed ( +- 0.24% ) > + 2.487452268 seconds time > elapsed ( +- 0.31% ) > > Good, we run more insns/cycle, as expected. However, a bit more branches. Can you see where the missed branches are? Assuming branch misses are the case where IF=0, perhaps there are a few places that can be changed to spin_lock/unlock_irq or local_irq_disable/enable. Paolo ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: RFC: Petition Intel/AMD to add POPF_IF insn 2016-08-17 19:13 ` Andy Lutomirski 2016-08-17 19:26 ` Denys Vlasenko @ 2016-08-17 19:29 ` Linus Torvalds 1 sibling, 0 replies; 25+ messages in thread From: Linus Torvalds @ 2016-08-17 19:29 UTC (permalink / raw) To: Andy Lutomirski Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König, Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter On Wed, Aug 17, 2016 at 12:13 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > It wouldn't surprise me if that were easier said than done. popf > potentially changes AC, and AC affects address translation. Rigth. A lot of magical flags are in eflags, and popf *may* change them. But that's why it's slow today. The popf microcode probably unconditionally serializes things exactly because "things may change". And the interrupt flag actually *is* pretty special too, in some respects more so than AC (because it needs to serialize with any pending interrupts). And the microcode probably already has code that says "let's handle the easy case quickly", where the easy case is "only the arithmetic flags changed". The arithmetic flags are special anyway, because they aren't actually physically in the same register any more, but are separately tracked and renamed etc. But I'm sure Intel already treats IF specially in microcode, because IF is really special in other ways (VIF handling in vm86 mode, but also modern virtualization). Yes, intel people tend to be afraid of the microcode stuff, and generally not touch it. But the good news about popf is that is isn't a serializing instruction, so it really *could* be optimized pretty aggressively. And it does have to check for pending interrupts (and *clearing* IF in particular needs to make sure that there isn't some pending interrupt that the CPU is about to react to). So it's not trivial. But the "enable interrupts" case for popf is actually easier for hardware than the disable case from a serializing standpoint, and I suspect the ucode doesn't take advantage of that right now, and it's all just fairly unoptimized microcode. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2016-08-31 11:12 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-08-17 17:20 RFC: Petition Intel/AMD to add POPF_IF insn Denys Vlasenko 2016-08-17 17:30 ` Christian König 2016-08-17 18:34 ` Denys Vlasenko 2016-08-17 17:32 ` Linus Torvalds 2016-08-17 18:41 ` Denys Vlasenko [not found] ` <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com> 2016-08-17 19:13 ` Andy Lutomirski 2016-08-17 19:26 ` Denys Vlasenko 2016-08-17 19:32 ` Linus Torvalds 2016-08-17 19:35 ` Denys Vlasenko 2016-08-17 19:54 ` Linus Torvalds 2016-08-17 19:37 ` Linus Torvalds 2016-08-17 21:26 ` Linus Torvalds 2016-08-17 21:35 ` Linus Torvalds 2016-08-17 21:43 ` Andy Lutomirski 2016-08-17 21:48 ` Linus Torvalds 2016-08-18 13:26 ` Denys Vlasenko 2016-08-18 17:24 ` Linus Torvalds 2016-08-18 17:47 ` Denys Vlasenko 2016-08-18 17:49 ` Denys Vlasenko 2016-08-19 10:54 ` Paolo Bonzini 2016-08-31 11:12 ` Denys Vlasenko 2016-08-18 9:21 ` Denys Vlasenko 2016-08-18 12:18 ` Denys Vlasenko 2016-08-18 17:22 ` Paolo Bonzini 2016-08-17 19:29 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).