All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Petition Intel/AMD to add POPF_IF insn
@ 2016-08-17 17:20 Denys Vlasenko
  2016-08-17 17:30 ` Christian König
  2016-08-17 17:32 ` Linus Torvalds
  0 siblings, 2 replies; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 17:20 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher,
	Rafael J. Wysocki, Christian König, Vinod Koul,
	Johannes Berg, Sara Sharon, Adrian Hunter, Linus Torvalds
  Cc: x86, LKML

Last year, a proposal was floated to avoid costly POPF.
In particular, each spin_unlock_irqrestore() does it,
a rather common operation.

https://lkml.org/lkml/2015/4/21/290
	[RFC PATCH] x86/asm/irq: Don't use POPF but STI

* Andy Lutomirski <luto@kernel.org> wrote:
> > Another different approach would be to formally state that
> > pv_irq_ops.save_fl() needs to return all the flags, which would
> > make local_irq_save() safe to use in this circumstance, but that
> > makes a hotpath longer for the sake of a single boot time check.
>
> ...which reminds me:
>
> Why does native_restore_fl restore anything other than IF?  A branch
> and sti should be considerably faster than popf.


Ingo agreed:
====
Yes, this has come up in the past, something like the patch below?

Totally untested and not signed off yet: because we'd first have to
make sure (via irq flags debugging) that it's not used in reverse, to
re-disable interrupts:
	local_irq_save(flags);
	local_irq_enable();
	...
	local_irq_restore(flags); /* effective local_irq_disable() */
I don't think we have many (any?) such patterns left, but it has to be
checked first. If we have such cases then we'll have to use different
primitives there.
====

Linus replied:
=====
"popf" is fast for the "no changes to IF" case, and is a smaller
instruction anyway.
====


This basically shot down the proposal.

But in my measurements POPF is not fast even in the case where restored
flags are not changes at all:

                 mov     $200*1000*1000, %eax
                 pushf
                 pop     %rbx
                 .balign 64
loop:           push    %rbx
                 popf
                 dec     %eax
                 jnz     loop

# perf stat -r20 ./popf_1g
      4,929,012,093      cycles                    #    3.412 GHz                      ( +-  0.02% )
        835,721,371      instructions              #    0.17  insn per cycle           ( +-  0.02% )
        1.446185359 seconds time elapsed                                          ( +-  0.46% )

If I replace POPF with a pop into an unused register, I get this:

loop:           push    %rbx
                 pop     %rcx
                 dec     %eax
                 jnz     loop

        209,247,645      cycles                    #    3.209 GHz                      ( +-  0.11% )
        801,898,016      instructions              #    3.83  insn per cycle           ( +-  0.00% )
        0.066210725 seconds time elapsed                                          ( +-  0.59% )


IOW, POPF takes at least 6 cycles.


Linus does have a point that a "test+branch+STI" may end up not a clear win because of the branch.

But the need to restore IF flag exists, it is necessary not only for Linux, but for any OS
running on x86: they all have some sort of spinlock.

The addition of a POPF instruction variant which looks only at IF bit and changes
only that bit in EFLAGS may be a good idea, for all OSes.

I propose that we ask Intel / AMD to do that.

Maybe by the "ignored prefix" trick which was used when LZCNT insn was introduced
as REPZ-prefixed BSR?
Currently, REPZ POPF (f3 9d) insn does execute. Redefine this opcode as
POPF_IF. Then the same kernel will work on old and new CPUs.

CC'ing some @intel and @amd emails...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 17:20 RFC: Petition Intel/AMD to add POPF_IF insn Denys Vlasenko
@ 2016-08-17 17:30 ` Christian König
  2016-08-17 18:34   ` Denys Vlasenko
  2016-08-17 17:32 ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Christian König @ 2016-08-17 17:30 UTC (permalink / raw)
  To: Denys Vlasenko, Ingo Molnar, Andy Lutomirski, Dan Williams,
	Alex Deucher, Rafael J. Wysocki, Vinod Koul, Johannes Berg,
	Sara Sharon, Adrian Hunter, Linus Torvalds
  Cc: x86, LKML

Well first of all you actually CCed AMDs graphics developers. So Alex 
and I can't say much about CPU optimizations.

Additional to that a comment below.

Am 17.08.2016 um 19:20 schrieb Denys Vlasenko:
> Last year, a proposal was floated to avoid costly POPF.
> In particular, each spin_unlock_irqrestore() does it,
> a rather common operation.
>
> https://lkml.org/lkml/2015/4/21/290
>     [RFC PATCH] x86/asm/irq: Don't use POPF but STI
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>> > Another different approach would be to formally state that
>> > pv_irq_ops.save_fl() needs to return all the flags, which would
>> > make local_irq_save() safe to use in this circumstance, but that
>> > makes a hotpath longer for the sake of a single boot time check.
>>
>> ...which reminds me:
>>
>> Why does native_restore_fl restore anything other than IF?  A branch
>> and sti should be considerably faster than popf.
>
>
> Ingo agreed:
> ====
> Yes, this has come up in the past, something like the patch below?
>
> Totally untested and not signed off yet: because we'd first have to
> make sure (via irq flags debugging) that it's not used in reverse, to
> re-disable interrupts:
>     local_irq_save(flags);
>     local_irq_enable();
>     ...
>     local_irq_restore(flags); /* effective local_irq_disable() */
> I don't think we have many (any?) such patterns left, but it has to be
> checked first. If we have such cases then we'll have to use different
> primitives there.
> ====
>
> Linus replied:
> =====
> "popf" is fast for the "no changes to IF" case, and is a smaller
> instruction anyway.
> ====
>
>
> This basically shot down the proposal.
>
> But in my measurements POPF is not fast even in the case where restored
> flags are not changes at all:
>
>                 mov     $200*1000*1000, %eax
>                 pushf
>                 pop     %rbx
>                 .balign 64
> loop:           push    %rbx
>                 popf
>                 dec     %eax
>                 jnz     loop
>
> # perf stat -r20 ./popf_1g
>      4,929,012,093      cycles                    #    3.412 
> GHz                      ( +-  0.02% )
>        835,721,371      instructions              #    0.17  insn per 
> cycle           ( +-  0.02% )
>        1.446185359 seconds time 
> elapsed                                          ( +-  0.46% )
>
> If I replace POPF with a pop into an unused register, I get this:

You are comparing apples and bananas here.

The microcode path in the CPUs will probably through away the read if 
you don't actually use the register value.

Regards,
Christian.

>
> loop:           push    %rbx
>                 pop     %rcx
>                 dec     %eax
>                 jnz     loop
>
>        209,247,645      cycles                    #    3.209 
> GHz                      ( +-  0.11% )
>        801,898,016      instructions              #    3.83  insn per 
> cycle           ( +-  0.00% )
>        0.066210725 seconds time 
> elapsed                                          ( +-  0.59% )
>
>
> IOW, POPF takes at least 6 cycles.
>
>
> Linus does have a point that a "test+branch+STI" may end up not a 
> clear win because of the branch.
>
> But the need to restore IF flag exists, it is necessary not only for 
> Linux, but for any OS
> running on x86: they all have some sort of spinlock.
>
> The addition of a POPF instruction variant which looks only at IF bit 
> and changes
> only that bit in EFLAGS may be a good idea, for all OSes.
>
> I propose that we ask Intel / AMD to do that.
>
> Maybe by the "ignored prefix" trick which was used when LZCNT insn was 
> introduced
> as REPZ-prefixed BSR?
> Currently, REPZ POPF (f3 9d) insn does execute. Redefine this opcode as
> POPF_IF. Then the same kernel will work on old and new CPUs.
>
> CC'ing some @intel and @amd emails...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 17:20 RFC: Petition Intel/AMD to add POPF_IF insn Denys Vlasenko
  2016-08-17 17:30 ` Christian König
@ 2016-08-17 17:32 ` Linus Torvalds
  2016-08-17 18:41   ` Denys Vlasenko
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 17:32 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher,
	Rafael J. Wysocki, Christian König, Vinod Koul,
	Johannes Berg, Sara Sharon, Adrian Hunter,
	the arch/x86 maintainers, LKML

On Wed, Aug 17, 2016 at 10:20 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> Last year, a proposal was floated to avoid costly POPF.
> In particular, each spin_unlock_irqrestore() does it,
> a rather common operation.

Quiet frankly, it takes so long for hw features that I don't think
it's worth it for something like this.

I'd rather see people play around with just the conditional "sti"
model instead and see how well that works. *Particularly* if we inline
that part of the spin_lock_irqrestore(), the branch prediction logic
may work very well. And even if not, it migth work better than popf.
But somebody would need to run the tests on an actual load
(microbenchmarks wouldn't work, because they wouldn't show the
different branch prediction cases).

power has been using "just do IF in software", which is also an
alternative, but it's much more complicated (you take the interrupt,
notice that the sw disable flag is set, don't actually run the irq but
set a flag and disable irqs for _real_ and return. Then you replay the
irq when enabling the software irq flag again)

               Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 17:30 ` Christian König
@ 2016-08-17 18:34   ` Denys Vlasenko
  0 siblings, 0 replies; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 18:34 UTC (permalink / raw)
  To: Christian König, Ingo Molnar, Andy Lutomirski, Dan Williams,
	Alex Deucher, Rafael J. Wysocki, Vinod Koul, Johannes Berg,
	Sara Sharon, Adrian Hunter, Linus Torvalds
  Cc: x86, LKML

On 08/17/2016 07:30 PM, Christian König wrote:
>> But in my measurements POPF is not fast even in the case where restored
>> flags are not changed at all:
>>
>>                 mov     $200*1000*1000, %eax
>>                 pushf
>>                 pop     %rbx
>>                 .balign 64
>> loop:           push    %rbx
>>                 popf
>>                 dec     %eax
>>                 jnz     loop
>>
>> # perf stat -r20 ./popf_1g
>>      4,929,012,093      cycles                    #    3.412 GHz                      ( +-  0.02% )
>>        835,721,371      instructions              #    0.17  insn per cycle           ( +-  0.02% )
>>        1.446185359 seconds time elapsed                                          ( +-  0.46% )
>>
>> If I replace POPF with a pop into an unused register, I get this:
>
> You are comparing apples and bananas here.

Yes, I know. Pop into a register here is basically free.
I'd also add a STI and measure how much it takes to
enable interrupts, but unfortunately STI throws a #GP
in CPL 3.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 17:32 ` Linus Torvalds
@ 2016-08-17 18:41   ` Denys Vlasenko
       [not found]     ` <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com>
  0 siblings, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 18:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andy Lutomirski, Dan Williams, Alex Deucher,
	Rafael J. Wysocki, Christian König, Vinod Koul,
	Johannes Berg, Sara Sharon, Adrian Hunter,
	the arch/x86 maintainers, LKML



On 08/17/2016 07:32 PM, Linus Torvalds wrote:
> On Wed, Aug 17, 2016 at 10:20 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> Last year, a proposal was floated to avoid costly POPF.
>> In particular, each spin_unlock_irqrestore() does it,
>> a rather common operation.
>
> Quiet frankly, it takes so long for hw features that I don't think
> it's worth it for something like this.

True, it will take some ~5 years for new hardware to become
widely used.

OTOH 5 years will inevitably pass. Just like now 32-bit Linux userspace
is commonly compiled to "i686" instruction set, because
inevitably enough time has passed that you can safely assume
most people run at least Pentium Pro-level CPU, with CMOV, CMPXCHG,
etc.

If the new instruction will be implemented with "REPZ POPF" encoding,
no recompilation or alternatives will be needed for the new kernel
to run on an old CPU. On an old CPU, entire EFLAGS will be restored.
On a new one, only EFLAGS.IF.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
       [not found]     ` <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com>
@ 2016-08-17 19:13       ` Andy Lutomirski
  2016-08-17 19:26         ` Denys Vlasenko
  2016-08-17 19:29         ` Linus Torvalds
  0 siblings, 2 replies; 25+ messages in thread
From: Andy Lutomirski @ 2016-08-17 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On Wed, Aug 17, 2016 at 12:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Aug 17, 2016 11:41 AM, "Denys Vlasenko" <dvlasenk@redhat.com> wrote:
>>
>> OTOH 5 years will inevitably pass.
>
> Yes. But in five years, maybe we'll have a popf that is faster anyway.
>
> I'd actually prefer that in the end. The problem with popf right now seems
> to be mainly that it's effectively serializing and does stupid things in
> microcode. It doesn't have to be that way. It could actually do much better,
> but it hasn't been a high enough priority for Intel.
>

It wouldn't surprise me if that were easier said than done.  popf
potentially changes AC, and AC affects address translation.  popf also
potentially changes IOPL, and I don't know whether Intel chips track
IOPL in a way that lets them find all the dependent instructions
without serializing.  But maybe their pipeline is fancy enough.

Personally, I still expect that a simple branch-and-sti is the way to
go.  It wouldn't shock me if even a mispredicted branch and STI is
faster than POPF.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 19:13       ` Andy Lutomirski
@ 2016-08-17 19:26         ` Denys Vlasenko
  2016-08-17 19:32           ` Linus Torvalds
  2016-08-17 19:29         ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 19:26 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Sara Sharon, Dan Williams, Christian König, Vinod Koul,
	Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski,
	the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter

On 08/17/2016 09:13 PM, Andy Lutomirski wrote:
> On Wed, Aug 17, 2016 at 12:01 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Aug 17, 2016 11:41 AM, "Denys Vlasenko" <dvlasenk@redhat.com> wrote:
>>>
>>> OTOH 5 years will inevitably pass.
>>
>> Yes. But in five years, maybe we'll have a popf that is faster anyway.
>>
>> I'd actually prefer that in the end. The problem with popf right now seems
>> to be mainly that it's effectively serializing and does stupid things in
>> microcode. It doesn't have to be that way. It could actually do much better,
>> but it hasn't been a high enough priority for Intel.
>>
>
> It wouldn't surprise me if that were easier said than done.  popf
> potentially changes AC, and AC affects address translation.  popf also
> potentially changes IOPL, and I don't know whether Intel chips track
> IOPL in a way that lets them find all the dependent instructions
> without serializing.  But maybe their pipeline is fancy enough.

Exactly. And more:

POPF potentially changes TF (and it works even in CPL3).
POPD changes DF - must serialize versus string insns.
POPF changes NT - must serialize versus IRET insns.
POPF changes VIF, from a different bit in a popped value,
and under a rather complex conditions.

Intel's documentation has a pseudo-code for the instructions.
For POPF, that pseudo-code is two pages long...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 19:13       ` Andy Lutomirski
  2016-08-17 19:26         ` Denys Vlasenko
@ 2016-08-17 19:29         ` Linus Torvalds
  1 sibling, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 19:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On Wed, Aug 17, 2016 at 12:13 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> It wouldn't surprise me if that were easier said than done.  popf
> potentially changes AC, and AC affects address translation.

Rigth. A lot of magical flags are in eflags, and popf *may* change them.

But that's why it's slow today. The popf microcode probably
unconditionally serializes things exactly because "things may change".
And the interrupt flag actually *is* pretty special too, in some
respects more so than AC (because it needs to serialize with any
pending interrupts).

And the microcode probably already has code that says "let's handle
the easy case quickly", where the easy case is "only the arithmetic
flags changed".

The arithmetic flags are special anyway, because they aren't actually
physically in the same register any more, but are separately tracked
and renamed etc.

But I'm sure Intel already treats IF specially in microcode, because
IF is really special in other ways (VIF handling in vm86 mode, but
also modern virtualization).

Yes, intel people tend to be afraid of the microcode stuff, and
generally not touch it. But the good news about popf is that is isn't
a serializing instruction, so it really *could* be optimized pretty
aggressively. And it does have to check for pending interrupts (and
*clearing* IF in particular needs to make sure that there isn't some
pending interrupt that the CPU is about to react to).

So it's not trivial. But the "enable interrupts" case for popf is
actually easier for hardware than the disable case from a serializing
standpoint, and I suspect the ucode doesn't take advantage of that
right now, and it's all just fairly unoptimized microcode.

                Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 19:26         ` Denys Vlasenko
@ 2016-08-17 19:32           ` Linus Torvalds
  2016-08-17 19:35             ` Denys Vlasenko
  2016-08-17 19:37             ` Linus Torvalds
  0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 19:32 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On Wed, Aug 17, 2016 at 12:26 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> Exactly. And more:

All of which is ENTIRELY IRRELEVANT.

The 2-page pseudo-code is about behavior. It's trivial to
short-circuit. There are obvious optimizations, which involve just
making the rare cases go slow and have a trivial "if those bits didn't
change, don't go through the expense of testing them".

The problem is that IF is almost certainly right now in that rare case
mask, and so popf is stupidly slow for IF.

That in no way means that it *has* to be slow for IF, and more
importantly, the other flags don't even come into play. Forget about
them.

               Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 19:32           ` Linus Torvalds
@ 2016-08-17 19:35             ` Denys Vlasenko
  2016-08-17 19:54               ` Linus Torvalds
  2016-08-17 19:37             ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-17 19:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter



On 08/17/2016 09:32 PM, Linus Torvalds wrote:
> On Wed, Aug 17, 2016 at 12:26 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>
>> Exactly. And more:
>
> All of which is ENTIRELY IRRELEVANT.
>
> The 2-page pseudo-code is about behavior. It's trivial to
> short-circuit. There are obvious optimizations, which involve just
> making the rare cases go slow and have a trivial "if those bits didn't
> change, don't go through the expense of testing them".
>
> The problem is that IF is almost certainly right now in that rare case
> mask, and so popf is stupidly slow for IF.

I ran the test, see the first email in the thread.

Experimentally, POPF is stupidly slow _always_. 6 cycles
even if none of the "scary" flags are changed.
Either:
*  its microcode has no rare case mask
or
* its microcode is so slow that even fast path is slow,
   and slow path is worse

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 19:32           ` Linus Torvalds
  2016-08-17 19:35             ` Denys Vlasenko
@ 2016-08-17 19:37             ` Linus Torvalds
  2016-08-17 21:26               ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 19:37 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On Wed, Aug 17, 2016 at 12:32 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The 2-page pseudo-code is about behavior. It's trivial to
> short-circuit. There are obvious optimizations, which involve just
> making the rare cases go slow and have a trivial "if those bits didn't
> change, don't go through the expense of testing them".

That said, the first thing we should test is to just see how the
"let's just optimize this in software" thing works.

Replace the "popf" with "if (val & X86_EFLAGS_IF) local_irq_enable();"
and see how that works out. Play with inlining it or not, and see if
the branch predictor matters.

We may simply not need popf optimizations.

              Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 19:35             ` Denys Vlasenko
@ 2016-08-17 19:54               ` Linus Torvalds
  0 siblings, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 19:54 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On Wed, Aug 17, 2016 at 12:35 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> Experimentally, POPF is stupidly slow _always_. 6 cycles
> even if none of the "scary" flags are changed.

6 cycles is nothing.

That's basically the overhead of "oops, I need to use the microcode sequencer".

One issue is that the intel decoders (AMD too, for that matter) can
only generate a fairly small set of uops for any instruction. Some
instructions are really trivial to decode (popf definitely falls under
that heading), but are more than just a couple of uops, so you end up
having to use the uop sequencer logic.

According to Agner Fog's tables, there's one or two
micro-architectures that actually dot he simple "popf" case with a
single cycle throughput, but that's the very unusual case.

You can't even fit the "pop a value, see if only the arithmetic flags
changed, trap to microcode otherwise" into the three of four uops that
the "complex decoder" can generate directly.

And that "fall back to the uop sequencer engine" tends to just always
cause several cycles regardless. So yes, microcode tends to be slow
even for what would otherwise be trivial operations. You'd think Intel
could do as well as they do for the L0 uop cache, but afaik they
don't.

Anyway, six cycles is fast. I'd *love* for popf to actually be just 6
cycles when IF changes.  It's much much worse iirc (although honestly,
I haven't timed it in years - it's much easier to time just the
arithmetic flag changes).

It used to be more like a hundred cycles on Prescott.

                  Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 19:37             ` Linus Torvalds
@ 2016-08-17 21:26               ` Linus Torvalds
  2016-08-17 21:35                 ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 21:26 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

On Wed, Aug 17, 2016 at 12:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Replace the "popf" with "if (val & X86_EFLAGS_IF) local_irq_enable();"
> and see how that works out. Play with inlining it or not, and see if
> the branch predictor matters.

.. actually, thinking a bit more about it, I really don't think the
branch predictor will even matter.

We sure as hell shouldn't have nested irq-safe interrupts in paths
that matter from a performance angle, so the normal case for
spin_unlock_irqrestore() should be to enable interrupts again.

And if interrupts are disabled because the caller is actually in
interrupt context, I don't think the branch prediction is going to
matter, compared to the irq overhead.

So test this trivial patch. It's ENTIRELY UNTESTED. It may be complete
crap and not even compile. But I did test it on
kernel/locking/spinlock.c, and the generated code is beautiful:

  _raw_spin_unlock_irqrestore:
        testl   $512, %esi      #, flags
        movb    $0, (%rdi)      #, MEM[(volatile __u8 *)lock_2(D)]
        je      .L2
        sti
  .L2:
        ret

so maybe the silly popf has always just been stupid.

Of course, if somebody uses native_restore_fl() to actually *disable*
interrupts (when they weren't already disabled), then this untested
patch will just not work. But why would you do somethign so stupid?
Famous last words...

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1127 bytes --]

 arch/x86/include/asm/irqflags.h | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index b77f5edb03b0..76c4ebfab0be 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -8,6 +8,16 @@
  * Interrupt control:
  */
 
+static inline void native_irq_disable(void)
+{
+	asm volatile("cli": : :"memory");
+}
+
+static inline void native_irq_enable(void)
+{
+	asm volatile("sti": : :"memory");
+}
+
 static inline unsigned long native_save_fl(void)
 {
 	unsigned long flags;
@@ -28,20 +38,8 @@ static inline unsigned long native_save_fl(void)
 
 static inline void native_restore_fl(unsigned long flags)
 {
-	asm volatile("push %0 ; popf"
-		     : /* no output */
-		     :"g" (flags)
-		     :"memory", "cc");
-}
-
-static inline void native_irq_disable(void)
-{
-	asm volatile("cli": : :"memory");
-}
-
-static inline void native_irq_enable(void)
-{
-	asm volatile("sti": : :"memory");
+	if (flags & X86_EFLAGS_IF)
+		native_irq_enable();
 }
 
 static inline void native_safe_halt(void)

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 21:26               ` Linus Torvalds
@ 2016-08-17 21:35                 ` Linus Torvalds
  2016-08-17 21:43                   ` Andy Lutomirski
  2016-08-18  9:21                   ` Denys Vlasenko
  0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 21:35 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On Wed, Aug 17, 2016 at 2:26 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Of course, if somebody uses native_restore_fl() to actually *disable*
> interrupts (when they weren't already disabled), then this untested
> patch will just not work. But why would you do somethign so stupid?
> Famous last words...

Looking around, the vsmp code actually uses "native_restore_fl()" to
not just set the interrupt flag, but AC as well.

And the PV spinlock case has that "push;popf" sequence encoded in an alternate.

So that trivial patch may (or may not - still not tested) work for
some quick testing, but needs more effort for any *real* use.

                  Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 21:35                 ` Linus Torvalds
@ 2016-08-17 21:43                   ` Andy Lutomirski
  2016-08-17 21:48                     ` Linus Torvalds
  2016-08-18  9:21                   ` Denys Vlasenko
  1 sibling, 1 reply; 25+ messages in thread
From: Andy Lutomirski @ 2016-08-17 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On Wed, Aug 17, 2016 at 2:35 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Aug 17, 2016 at 2:26 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Of course, if somebody uses native_restore_fl() to actually *disable*
>> interrupts (when they weren't already disabled), then this untested
>> patch will just not work. But why would you do somethign so stupid?
>> Famous last words...
>
> Looking around, the vsmp code actually uses "native_restore_fl()" to
> not just set the interrupt flag, but AC as well.
>
> And the PV spinlock case has that "push;popf" sequence encoded in an alternate.
>
> So that trivial patch may (or may not - still not tested) work for
> some quick testing, but needs more effort for any *real* use.
>

It shouldn't be *too* bad, since xen_restore_fl only affects "IF".
And even if native_restore_fl needs to be able to turn IRQs off as
well as on, we can just do:

if (likely(flags & X86_EFLAGS_IF))
  sti();
else
  cli();

at some cost to code size but hopefully little to no runtime cost for
the sane cases.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 21:43                   ` Andy Lutomirski
@ 2016-08-17 21:48                     ` Linus Torvalds
  2016-08-18 13:26                       ` Denys Vlasenko
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2016-08-17 21:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On Wed, Aug 17, 2016 at 2:43 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> It shouldn't be *too* bad, since xen_restore_fl only affects "IF".
> And even if native_restore_fl needs to be able to turn IRQs off as
> well as on, we can just do:
>
> if (likely(flags & X86_EFLAGS_IF))
>   sti();
> else
>   cli();
>
> at some cost to code size but hopefully little to no runtime cost for
> the sane cases.

No, that would be horrible for the case where we had an irq spinlock
in an irq-region. Then we'd have a pointless "cli" there.

So I'd rather just make sure that only the spinlock code actually uses
native_restore_fl().

And yes, the patch works at least minimally, since I'm writing this
with a kernel compiled with that.

Of course, somebody really should do timings on modern CPU's (in cpl0,
comparing native_fl() that enables interrupts with a popf)

    Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 21:35                 ` Linus Torvalds
  2016-08-17 21:43                   ` Andy Lutomirski
@ 2016-08-18  9:21                   ` Denys Vlasenko
  2016-08-18 12:18                     ` Denys Vlasenko
  1 sibling, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18  9:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On 08/17/2016 11:35 PM, Linus Torvalds wrote:
> On Wed, Aug 17, 2016 at 2:26 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Of course, if somebody uses native_restore_fl() to actually *disable*
>> interrupts (when they weren't already disabled), then this untested
>> patch will just not work. But why would you do somethign so stupid?
>> Famous last words...
>
> Looking around, the vsmp code actually uses "native_restore_fl()" to
> not just set the interrupt flag, but AC as well.
>
> And the PV spinlock case has that "push;popf" sequence encoded in an alternate.
>
> So that trivial patch may (or may not - still not tested) work for
> some quick testing, but needs more effort for any *real* use.

I'm going to test the attached patch.

PV and debug maze is daunting... I discovered that Fedora kernels
compile with the set of .config options which result in
spin_unlock_irqrestore which looks like this:

ffffffff818d0f40 <_raw_spin_unlock_irqrestore>:
ffffffff818d0f40:       e8 1b 31 00 00          callq  ffffffff818d4060 <__fentry__>    ffffffff818d0f41: R_X86_64_PC32 __fentry__-0x4
ffffffff818d0f45:       55                      push   %rbp
ffffffff818d0f46:       48 89 e5                mov    %rsp,%rbp
ffffffff818d0f49:       41 54                   push   %r12
ffffffff818d0f4b:       53                      push   %rbx
ffffffff818d0f4c:       48 8b 55 08             mov    0x8(%rbp),%rdx
ffffffff818d0f50:       49 89 fc                mov    %rdi,%r12
ffffffff818d0f53:       48 89 f3                mov    %rsi,%rbx
ffffffff818d0f56:       48 83 c7 18             add    $0x18,%rdi
ffffffff818d0f5a:       be 01 00 00 00          mov    $0x1,%esi
ffffffff818d0f5f:       e8 8c b8 83 ff          callq  ffffffff8110c7f0 <lock_release>  ffffffff818d0f60: R_X86_64_PC32 lock_release-0x4
ffffffff818d0f64:       4c 89 e7                mov    %r12,%rdi
ffffffff818d0f67:       e8 d4 fb 83 ff          callq  ffffffff81110b40 <do_raw_spin_unlock>    ffffffff818d0f68: R_X86_64_PC32 do_raw_spin_unlock-0x4
ffffffff818d0f6c:       f6 c7 02                test   $0x2,%bh
ffffffff818d0f6f:       74 1b                   je     ffffffff818d0f8c <_raw_spin_unlock_irqrestore+0x4c>
ffffffff818d0f71:       e8 aa 9d 83 ff          callq  ffffffff8110ad20 <trace_hardirqs_on>     ffffffff818d0f72: R_X86_64_PC32 trace_hardirqs_on-0x4
ffffffff818d0f76:       48 89 df                mov    %rbx,%rdi
ffffffff818d0f79:       ff 14 25 48 32 e2 81    callq  *0xffffffff81e23248      ffffffff818d0f7c: R_X86_64_32S  pv_irq_ops+0x8
ffffffff818d0f80:       65 ff 0d c9 c4 73 7e    decl   %gs:0x7e73c4c9(%rip)        # d450 <__preempt_count>     ffffffff818d0f83: R_X86_64_PC32 __preempt_count-0x4
ffffffff818d0f87:       5b                      pop    %rbx
ffffffff818d0f88:       41 5c                   pop    %r12
ffffffff818d0f8a:       5d                      pop    %rbp
ffffffff818d0f8b:       c3                      retq
ffffffff818d0f8c:       48 89 df                mov    %rbx,%rdi
ffffffff818d0f8f:       ff 14 25 48 32 e2 81    callq  *0xffffffff81e23248      ffffffff818d0f92: R_X86_64_32S  pv_irq_ops+0x8
ffffffff818d0f96:       e8 35 5b 83 ff          callq  ffffffff81106ad0 <trace_hardirqs_off>    ffffffff818d0f97: R_X86_64_PC32 trace_hardirqs_off-0x4
ffffffff818d0f9b:       eb e3                   jmp    ffffffff818d0f80 <_raw_spin_unlock_irqrestore+0x40>
ffffffff818d0f9d:       0f 1f 00                nopl   (%rax)

Gawd... really Fedora? All this is needed?

Testing with _this_ is not going to show any differences, I need to weed
all the cruft out and test a fast, non-debug .config.

Changed it like this:

+# CONFIG_UPROBES is not set
+# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
+# CONFIG_HYPERVISOR_GUEST is not set
+# CONFIG_SYS_HYPERVISOR is not set
+# CONFIG_FRAME_POINTER is not set
+# CONFIG_KMEMCHECK is not set
+# CONFIG_DEBUG_LOCK_ALLOC is not set
+# CONFIG_PROVE_LOCKING is not set
+# CONFIG_LOCK_STAT is not set
+# CONFIG_PROVE_RCU is not set
+# CONFIG_LATENCYTOP is not set
+# CONFIG_FTRACE is not set
+# CONFIG_BINARY_PRINTF is not set

Looks better (it's with the patch already):

00000000000000f0 <_raw_spin_unlock_irqrestore>:
   f0:   53                      push   %rbx
   f1:   48 89 f3                mov    %rsi,%rbx
   f4:   e8 00 00 00 00          callq  f9 <_raw_spin_unlock_irqrestore+0x9>     f5: R_X86_64_PC32       do_raw_spin_unlock-0x4
   f9:   80 e7 02                and    $0x2,%bh
   fc:   74 01                   je     ff <_raw_spin_unlock_irqrestore+0xf>
   fe:   fb                      sti
   ff:   65 ff 0d 00 00 00 00    decl   %gs:0x0(%rip)        # 106 <_raw_spin_unlock_irqrestore+0x16>    102: R_X86_64_PC32      __preempt_count-0x4
  106:   5b                      pop    %rbx
  107:   c3                      retq

This also raises questions. Such as "why do_raw_spin_unlock() is not inlined here?"

Anyway... on to more kernel builds and testing.
Please take a look at the below patch. If it's obviously buggy, let me know.


diff -urp linux-4.7.1.org/arch/x86/include/asm/irqflags.h linux-4.7.1.obj/arch/x86/include/asm/irqflags.h
--- linux-4.7.1.org/arch/x86/include/asm/irqflags.h	2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/include/asm/irqflags.h	2016-08-18 10:01:09.514219644 +0200
@@ -44,6 +44,16 @@ static inline void native_irq_enable(voi
  	asm volatile("sti": : :"memory");
  }
  
+/*
+ * This produces a "test; jz; sti" insn sequence.
+ * It's faster than "push reg; popf". If sti is skipped, it's much faster.
+ */
+static inline void native_cond_irq_enable(unsigned long flags)
+{
+	if (flags & X86_EFLAGS_IF)
+		native_irq_enable();
+}
+
  static inline void native_safe_halt(void)
  {
  	asm volatile("sti; hlt": : :"memory");
@@ -69,7 +79,8 @@ static inline notrace unsigned long arch
  
  static inline notrace void arch_local_irq_restore(unsigned long flags)
  {
-	native_restore_fl(flags);
+	if (flags & X86_EFLAGS_IF)
+		native_irq_enable();
  }
  
  static inline notrace void arch_local_irq_disable(void)
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt.c linux-4.7.1.obj/arch/x86/kernel/paravirt.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt.c	2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt.c	2016-08-18 10:01:24.797155109 +0200
@@ -313,7 +313,7 @@ struct pv_time_ops pv_time_ops = {
  
  __visible struct pv_irq_ops pv_irq_ops = {
  	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
-	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
+	.restore_fl = __PV_IS_CALLEE_SAVE(native_cond_irq_enable),
  	.irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
  	.irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
  	.safe_halt = native_safe_halt,
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c	2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c	2016-08-18 04:43:19.388102755 +0200
@@ -2,7 +2,7 @@
  
  DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
  DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
+DEF_NATIVE(pv_irq_ops, restore_fl, "testb $((1<<9)>>8),%ah; jz 1f; sti; 1: ;");
  DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
  DEF_NATIVE(pv_cpu_ops, iret, "iret");
  DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c	2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c	2016-08-18 04:42:56.364545509 +0200
@@ -4,7 +4,7 @@
  
  DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
  DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
+DEF_NATIVE(pv_irq_ops, restore_fl, "testw $(1<<9),%di; jz 1f; sti; 1: ;");
  DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
  DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
  DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-18  9:21                   ` Denys Vlasenko
@ 2016-08-18 12:18                     ` Denys Vlasenko
  2016-08-18 17:22                       ` Paolo Bonzini
  0 siblings, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18 12:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

[-- Attachment #1: Type: text/plain, Size: 8089 bytes --]

On 08/18/2016 11:21 AM, Denys Vlasenko wrote:
>>> Of course, if somebody uses native_restore_fl() to actually *disable*
>>> interrupts (when they weren't already disabled), then this untested
>>> patch will just not work. But why would you do somethign so stupid?
>>> Famous last words...
>>
>> Looking around, the vsmp code actually uses "native_restore_fl()" to
>> not just set the interrupt flag, but AC as well.
>>
>> And the PV spinlock case has that "push;popf" sequence encoded in an alternate.
>>
>> So that trivial patch may (or may not - still not tested) work for
>> some quick testing, but needs more effort for any *real* use.
>
> I'm going to test the attached patch.
...
>
> +# CONFIG_UPROBES is not set
> +# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
> +# CONFIG_HYPERVISOR_GUEST is not set
> +# CONFIG_SYS_HYPERVISOR is not set
> +# CONFIG_FRAME_POINTER is not set
> +# CONFIG_KMEMCHECK is not set
> +# CONFIG_DEBUG_LOCK_ALLOC is not set
> +# CONFIG_PROVE_LOCKING is not set
> +# CONFIG_LOCK_STAT is not set
> +# CONFIG_PROVE_RCU is not set
> +# CONFIG_LATENCYTOP is not set
> +# CONFIG_FTRACE is not set
> +# CONFIG_BINARY_PRINTF is not set

Need also !CONFIG_DEBUG_SPINLOCK, then unpatched irqrestore is:

ffffffff817115a0 <_raw_spin_unlock_irqrestore>:
ffffffff817115a0:       c6 07 00                movb   $0x0,(%rdi)
ffffffff817115a3:       56                      push   %rsi
ffffffff817115a4:       9d                      popfq
ffffffff817115a5:       65 ff 0d e4 ad 8f 7e    decl   %gs:__preempt_count
ffffffff817115ac:       c3                      retq
ffffffff817115ad:       0f 1f 00                nopl   (%rax)

patched one is

ffffffff81711660 <_raw_spin_unlock_irqrestore>:
ffffffff81711660:       f7 c6 00 02 00 00       test   $0x200,%esi
ffffffff81711666:       c6 07 00                movb   $0x0,(%rdi)
ffffffff81711669:       74 01                   je     ffffffff8171166c <_raw_spin_unlock_irqrestore+0xc>
ffffffff8171166b:       fb                      sti
ffffffff8171166c:       65 ff 0d 1d ad 8f 7e    decl   %gs:__preempt_count
ffffffff81711673:       c3                      retq
ffffffff81711674:       66 90                   xchg   %ax,%ax
ffffffff81711676:       66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)


Ran the following twice on a quiesced machine:

taskset 1 perf stat -r60 perf bench sched messaging
taskset 1 perf stat -r60 perf bench sched pipe

Out of these four runs, both "perf bench sched pipe" runs show improvements:

-       2648.279829      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.24% )
+       2483.143469      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.31% )
-         2,000,002      context-switches          #    0.755 M/sec                    ( +-  0.00% )
+         2,000,013      context-switches          #    0.805 M/sec                    ( +-  0.00% )
-               547      page-faults               #    0.206 K/sec                    ( +-  0.04% )
+               546      page-faults               #    0.220 K/sec                    ( +-  0.04% )
-     8,723,284,926      cycles                    #    3.294 GHz                      ( +-  0.06% )
+     8,157,949,449      cycles                    #    3.285 GHz                      ( +-  0.07% )
-    12,286,937,344      instructions              #    1.41  insn per cycle           ( +-  0.03% )
+    12,255,616,405      instructions              #    1.50  insn per cycle           ( +-  0.05% )
-     2,588,839,023      branches                  #  977.555 M/sec                    ( +-  0.02% )
+     2,599,319,615      branches                  # 1046.786 M/sec                    ( +-  0.04% )
-         3,620,273      branch-misses             #    0.14% of all branches          ( +-  0.67% )
+         3,577,771      branch-misses             #    0.14% of all branches          ( +-  0.69% )
-       2.648799072 seconds time elapsed                                          ( +-  0.24% )
+       2.487452268 seconds time elapsed                                          ( +-  0.31% )

Good, we run more insns/cycle, as expected. However, a bit more branches.

But of two "perf bench sched messaging" run, one was slower on a patched kernel,
and perf shows why: more branches, and also branch miss percentage is larger:

-        690.008697      task-clock (msec)         #    0.996 CPUs utilized            ( +-  0.45% )
+        699.526509      task-clock (msec)         #    0.994 CPUs utilized            ( +-  0.28% )
-            26,796      context-switches          #    0.039 M/sec                    ( +-  8.31% )
+            29,088      context-switches          #    0.042 M/sec                    ( +-  6.62% )
-            35,477      page-faults               #    0.051 M/sec                    ( +-  0.11% )
+            35,504      page-faults               #    0.051 M/sec                    ( +-  0.14% )
-     2,157,701,609      cycles                    #    3.127 GHz                      ( +-  0.35% )
+     2,143,781,407      cycles                    #    3.065 GHz                      ( +-  0.25% )
-     3,115,212,672      instructions              #    1.44  insn per cycle           ( +-  0.28% )
+     3,253,499,549      instructions              #    1.52  insn per cycle           ( +-  0.19% )
-       661,888,593      branches                  #  959.247 M/sec                    ( +-  0.36% )
+       707,862,655      branches                  # 1011.917 M/sec                    ( +-  0.20% )
-         2,793,203      branch-misses             #    0.42% of all branches          ( +-  1.04% )
+         3,453,397      branch-misses             #    0.49% of all branches          ( +-  0.32% )
-       0.693004918 seconds time elapsed                                          ( +-  0.45% )
+       0.703630988 seconds time elapsed                                          ( +-  0.27% )

This tipped the scales, and despite higher insns/cycle, run time is worse.

The other "perf bench sched messaging" run was more lucky:

-        706.944245      task-clock (msec)         #    0.995 CPUs utilized            ( +-  0.32% )
+        687.038856      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.31% )
-            23,489      context-switches          #    0.033 M/sec                    ( +-  7.02% )
+            26,644      context-switches          #    0.039 M/sec                    ( +-  7.46% )
-            35,360      page-faults               #    0.050 M/sec                    ( +-  0.12% )
+            35,417      page-faults               #    0.052 M/sec                    ( +-  0.13% )
-     2,183,639,816      cycles                    #    3.089 GHz                      ( +-  0.35% )
+     2,123,086,753      cycles                    #    3.090 GHz                      ( +-  0.27% )
-     3,131,362,238      instructions              #    1.43  insn per cycle           ( +-  0.19% )
+     3,236,613,433      instructions              #    1.52  insn per cycle           ( +-  0.19% )
-       667,874,319      branches                  #  944.734 M/sec                    ( +-  0.21% )
+       703,677,908      branches                  # 1024.219 M/sec                    ( +-  0.20% )
-         2,859,521      branch-misses             #    0.43% of all branches          ( +-  0.56% )
+         3,462,063      branch-misses             #    0.49% of all branches          ( +-  0.33% )
-       0.710738536 seconds time elapsed                                          ( +-  0.32% )
+       0.691908533 seconds time elapsed                                          ( +-  0.31% )

However, it still had more branches (~5% more), and worse branch miss percentage.

The patch seems to work. It also does not bloat the kernel:

    text	   data	    bss	    dec	    hex	filename
8199556	5026784	2924544	16150884 f67164	vmlinux
8199897	5026784	2924544	16151225 f672b9	vmlinux.patched

However, a "conditional CLI/STI from r/m" insn could be better still.

The patch is attached.

[-- Attachment #2: z.diff --]
[-- Type: text/x-patch, Size: 3076 bytes --]

diff -urp linux-4.7.1.org/arch/x86/include/asm/irqflags.h linux-4.7.1.obj/arch/x86/include/asm/irqflags.h
--- linux-4.7.1.org/arch/x86/include/asm/irqflags.h	2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/include/asm/irqflags.h	2016-08-18 10:01:09.514219644 +0200
@@ -44,6 +44,16 @@ static inline void native_irq_enable(voi
 	asm volatile("sti": : :"memory");
 }
 
+/*
+ * This produces a "test; jz; sti" insn sequence.
+ * It's faster than "push reg; popf". If sti is skipped, it's much faster.
+ */
+static inline void native_cond_irq_enable(unsigned long flags)
+{
+	if (flags & X86_EFLAGS_IF)
+		native_irq_enable();
+}
+
 static inline void native_safe_halt(void)
 {
 	asm volatile("sti; hlt": : :"memory");
@@ -69,7 +79,8 @@ static inline notrace unsigned long arch
 
 static inline notrace void arch_local_irq_restore(unsigned long flags)
 {
-	native_restore_fl(flags);
+	if (flags & X86_EFLAGS_IF)
+		native_irq_enable();
 }
 
 static inline notrace void arch_local_irq_disable(void)
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt.c linux-4.7.1.obj/arch/x86/kernel/paravirt.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt.c	2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt.c	2016-08-18 10:01:24.797155109 +0200
@@ -313,7 +313,7 @@ struct pv_time_ops pv_time_ops = {
 
 __visible struct pv_irq_ops pv_irq_ops = {
 	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
-	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
+	.restore_fl = __PV_IS_CALLEE_SAVE(native_cond_irq_enable),
 	.irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
 	.irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
 	.safe_halt = native_safe_halt,
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_32.c	2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_32.c	2016-08-18 04:43:19.388102755 +0200
@@ -2,7 +2,7 @@
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
 DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
+DEF_NATIVE(pv_irq_ops, restore_fl, "testb $((1<<9)>>8),%ah; jz 1f; sti; 1: ;");
 DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
 DEF_NATIVE(pv_cpu_ops, iret, "iret");
 DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
diff -urp linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c
--- linux-4.7.1.org/arch/x86/kernel/paravirt_patch_64.c	2016-08-16 09:35:15.000000000 +0200
+++ linux-4.7.1.obj/arch/x86/kernel/paravirt_patch_64.c	2016-08-18 04:42:56.364545509 +0200
@@ -4,7 +4,7 @@
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
 DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
-DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
+DEF_NATIVE(pv_irq_ops, restore_fl, "testw $(1<<9),%di; jz 1f; sti; 1: ;");
 DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
 DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
 DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-17 21:48                     ` Linus Torvalds
@ 2016-08-18 13:26                       ` Denys Vlasenko
  2016-08-18 17:24                         ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18 13:26 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Sara Sharon, Dan Williams, Christian König, Vinod Koul,
	Alex Deucher, Johannes Berg, Rafael J. Wysocki, Andy Lutomirski,
	the arch/x86 maintainers, Ingo Molnar, LKML, Adrian Hunter

> Of course, somebody really should do timings on modern CPU's (in cpl0,
> comparing native_fl() that enables interrupts with a popf)

I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
if we set iopl(3) first.

Surprisingly, STI is slower than CLI. A loop with 27 CLI's and one STI
converges to about ~0.5 insn/cycle:

# compile with: gcc -nostartfiles -nostdlib
_start:         .globl  _start
                 mov     $172, %eax #iopl
                 mov     $3, %edi
                 syscall
                 mov     $200*1000*1000, %eax
                 .balign 64
loop:
                 cli;cli;cli;cli
                 cli;cli;cli;cli
                 cli;cli;cli;cli
                 cli;cli;cli;cli

                 cli;cli;cli;cli
                 cli;cli;cli;cli
                 cli;cli;cli;sti
                 dec     %eax
                 jnz     loop

                 mov     $231, %eax #exit_group
                 syscall

perf stat:
      6,015,787,968      instructions              #    0.52  insn per cycle
        3.355474199 seconds time elapsed

With all CLIs replaced by STIs, it's ~0.25 insn/cycle:

      6,030,530,328      instructions              #    0.27  insn per cycle
        6.547200322 seconds time elapsed


POPF which needs to enable interrupts is not measurably faster than
one which does not change .IF:

Loop with:
   400158:	fa                   	cli
   400159:	53                   	push   %rbx  #saved eflags with if=1
   40015a:	9d                   	popfq
shows:
      8,908,857,324      instructions              #    0.11  insn per cycle           ( +-  0.00% )

Loop with:
   400140:	fb                   	sti
   400141:	53                   	push   %rbx
   400142:	9d                   	popfq
shows:
      8,920,243,701      instructions              #    0.10  insn per cycle           ( +-  0.01% )

Even loop with neither CLI nor STI, only with POPF:
   400140:	53                   	push   %rbx
   400141:	9d                   	popfq
shows:
      6,079,936,714      instructions              #    0.10  insn per cycle           ( +-  0.00% )

This is on a Skylake CPU.


The gist of it:
CLI is 2 cycles,
STI is 4 cycles,
POPF is 10 cycles
seemingly regardless of prior value of EFLAGS.IF.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-18 12:18                     ` Denys Vlasenko
@ 2016-08-18 17:22                       ` Paolo Bonzini
  0 siblings, 0 replies; 25+ messages in thread
From: Paolo Bonzini @ 2016-08-18 17:22 UTC (permalink / raw)
  To: Denys Vlasenko, Linus Torvalds
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter



On 18/08/2016 14:18, Denys Vlasenko wrote:
> 
> -     2,588,839,023      branches                  #  977.555
> M/sec                    ( +-  0.02% )
> +     2,599,319,615      branches                  # 1046.786
> M/sec                    ( +-  0.04% )
> -         3,620,273      branch-misses             #    0.14% of all
> branches          ( +-  0.67% )
> +         3,577,771      branch-misses             #    0.14% of all
> branches          ( +-  0.69% )
> -       2.648799072 seconds time
> elapsed                                          ( +-  0.24% )
> +       2.487452268 seconds time
> elapsed                                          ( +-  0.31% )
> 
> Good, we run more insns/cycle, as expected. However, a bit more branches.

Can you see where the missed branches are?  Assuming branch misses are
the case where IF=0, perhaps there are a few places that can be changed
to spin_lock/unlock_irq or local_irq_disable/enable.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-18 13:26                       ` Denys Vlasenko
@ 2016-08-18 17:24                         ` Linus Torvalds
  2016-08-18 17:47                           ` Denys Vlasenko
  2016-08-19 10:54                           ` Paolo Bonzini
  0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2016-08-18 17:24 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On Thu, Aug 18, 2016 at 6:26 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
> if we set iopl(3) first.

Yes, but it might not be the same. So the timings could be very
different from a cpl0 case.

Also:

> Surprisingly, STI is slower than CLI. A loop with 27 CLI's and one STI
> converges to about ~0.5 insn/cycle:

You really really should not check "sti" together with immediately
following sti or cli.

The sti instruction has an architecturally defined magical
one-instruction window following it when interrupts stay disabled. I
could easily see that resulting in strange special cases - Intel
actually at some point documented that a sequence of 'sti'
instructions are not going to disable interrupts forever (there was a
question of what happens if you start out with interrupts disabled, go
to a 16-bit code segment that is all filled with "sti" instructions so
that the 16-bit EIP will wrap around and continually do an infinite
series of 'sti' - do interrupts ever get enabled?)

I think intel clarified that when you have a sequence of 'sti'
instructions, interrupts will get enabled after the second one, but
the point is that this is all "special" from a front-end angle. So
putting multiple 'sti' instructions in a bunch may be testing the
magical special case more than it would test anything *real*.

So at a minimum, make the sequence be "sti; nop" if you do it in a
loop. It may not change anything, but at least that way you'll know it
doesn't just test the magical case.

Realistically, it's better to instead test a *real* instruction
sequence, ie just compare something like

    pushf
    cli
    .. do a memory operation here or something half-way real ..
    pop
    sti

and

    pushf
    cli
    .. do the same half-way real memory op here ..
    popf

and see which one is faster in a loop.

That said, your numbers really aren't very convincing. If popf really
is just 10 cycles on modern Intel hardware, it's already fast enough
that I really don't think it matters. Especially with "sti" being ~4
cycles, and there being a question about branch overhead anyway. You
win some, you lose some, but on the whole it sounds like "leave it
alone" wins.

Now, I know for a fact that there have been other x86 uarchitectres
that sucked at "popf", but they may suck almost equally at "sti". So
this might well be worth testing out on something that isn't Skylake.

Modern intel cores really are pretty good at even the slow operations.
Things used to be much much worse in the bad old P4 days. I'm very
impressed with the big intel cores.

                    Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-18 17:24                         ` Linus Torvalds
@ 2016-08-18 17:47                           ` Denys Vlasenko
  2016-08-18 17:49                             ` Denys Vlasenko
  2016-08-19 10:54                           ` Paolo Bonzini
  1 sibling, 1 reply; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18 17:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On 08/18/2016 07:24 PM, Linus Torvalds wrote:
> That said, your numbers really aren't very convincing. If popf really
> is just 10 cycles on modern Intel hardware, it's already fast enough
> that I really don't think it matters.

It's 20 cycles. I was wrong in my email, I forgot that the insn count
also counts "push %ebx" insns.

Since I already made a mistake, let me double-check.

200 million iterations of this loop execute under 17 seconds:

   400100:	b8 00 c2 eb 0b       	mov    $0xbebc200,%eax # 1000*1000*1000
   400105:	9c                   	pushfq
   400106:	5b                   	pop    %rbx
   400107:	90                   	nop
....
0000000000400140 <loop>:
   400140:	53                   	push   %rbx
   400141:	9d                   	popfq
   400142:	53                   	push   %rbx
   400143:	9d                   	popfq
   400144:	53                   	push   %rbx
   400145:	9d                   	popfq
   400146:	53                   	push   %rbx
   400147:	9d                   	popfq
   400148:	53                   	push   %rbx
   400149:	9d                   	popfq
   40014a:	53                   	push   %rbx
   40014b:	9d                   	popfq
   40014c:	53                   	push   %rbx
   40014d:	9d                   	popfq
   40014e:	53                   	push   %rbx
   40014f:	9d                   	popfq
   400150:	53                   	push   %rbx
   400151:	9d                   	popfq
   400152:	53                   	push   %rbx
   400153:	9d                   	popfq
   400154:	53                   	push   %rbx
   400155:	9d                   	popfq
   400156:	53                   	push   %rbx
   400157:	9d                   	popfq
   400158:	53                   	push   %rbx
   400159:	9d                   	popfq
   40015a:	53                   	push   %rbx
   40015b:	9d                   	popfq
   40015c:	ff c8                	dec    %eax
   40015e:	75 e0                	jne    400140 <loop>

The loop is exactly 32 bytes, aligned.
There are 14 POPFs. Other insns are very fast.

No perf, just "time taskset 1 ./test".
My CPU frequency hovers around 3500 MHz when loaded.

17 seconds is 17*3500 million cycles.
17*3500 million cycles / 200*14 million cycles = 21.25

Thus, one POPF in CPL3 is ~20 cycles on Skylake.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-18 17:47                           ` Denys Vlasenko
@ 2016-08-18 17:49                             ` Denys Vlasenko
  0 siblings, 0 replies; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-18 17:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter



On 08/18/2016 07:47 PM, Denys Vlasenko wrote:
> On 08/18/2016 07:24 PM, Linus Torvalds wrote:
>> That said, your numbers really aren't very convincing. If popf really
>> is just 10 cycles on modern Intel hardware, it's already fast enough
>> that I really don't think it matters.
>
> It's 20 cycles. I was wrong in my email, I forgot that the insn count
> also counts "push %ebx" insns.
>
> Since I already made a mistake, let me double-check.
>
> 200 million iterations of this loop execute under 17 seconds:
>
>   400100:    b8 00 c2 eb 0b           mov    $0xbebc200,%eax # 1000*1000*1000

Grr. It's 200*1000*1000, not one billion. Sorry.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-18 17:24                         ` Linus Torvalds
  2016-08-18 17:47                           ` Denys Vlasenko
@ 2016-08-19 10:54                           ` Paolo Bonzini
  2016-08-31 11:12                             ` Denys Vlasenko
  1 sibling, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2016-08-19 10:54 UTC (permalink / raw)
  To: Linus Torvalds, Denys Vlasenko
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter



On 18/08/2016 19:24, Linus Torvalds wrote:
>> > I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
>> > if we set iopl(3) first.
> Yes, but it might not be the same. So the timings could be very
> different from a cpl0 case.

FWIW I recently measured around 20 cycles for a popf as well on
Haswell-EP and CPL=0 (that was for commit f2485b3e0c6c, "KVM: x86: use
guest_exit_irqoff", 2016-07-01).

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: RFC: Petition Intel/AMD to add POPF_IF insn
  2016-08-19 10:54                           ` Paolo Bonzini
@ 2016-08-31 11:12                             ` Denys Vlasenko
  0 siblings, 0 replies; 25+ messages in thread
From: Denys Vlasenko @ 2016-08-31 11:12 UTC (permalink / raw)
  To: Paolo Bonzini, Linus Torvalds
  Cc: Andy Lutomirski, Sara Sharon, Dan Williams, Christian König,
	Vinod Koul, Alex Deucher, Johannes Berg, Rafael J. Wysocki,
	Andy Lutomirski, the arch/x86 maintainers, Ingo Molnar, LKML,
	Adrian Hunter

On 08/19/2016 12:54 PM, Paolo Bonzini wrote:
> On 18/08/2016 19:24, Linus Torvalds wrote:
>>>> I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace
>>>> if we set iopl(3) first.
>> Yes, but it might not be the same. So the timings could be very
>> different from a cpl0 case.
>
> FWIW I recently measured around 20 cycles for a popf as well on
> Haswell-EP and CPL=0 (that was for commit f2485b3e0c6c, "KVM: x86: use
> guest_exit_irqoff", 2016-07-01).

Thanks for confirmation.

I revisited benchmarking of the

	if (flags & X86_EFLAGS_IF)
		native_irq_enable();

patch. In "make -j20" kernel compiles on a 8-way (HT) CPU, it shows some ~5 second
improvement during ~16 minute compile. That's 0.5% speedup. It's ok, but not
something to bee too excited.

80 e6 02                and    $0x2,%dh
74 01                   je     ffffffff810101ae <intel_pt_handle_vmx+0x3e>
fb                      sti

41 f6 86 91 00 00 00 02 testb  $0x2,0x91(%r14)
74 01                   je     ffffffff81013ce7 <math_error+0x77>
fb                      sti

f6 83 91 00 00 00 02    testb  $0x2,0x91(%rbx)
74 01                   je     ffffffff81013efa <do_int3+0xba>
fb                      sti

41 f7 c4 00 02 00 00    test   $0x200,%r12d
74 01                   je     ffffffff8101615d <oops_end+0x5d>
fb                      sti

Here we trade 20-cycle POPF for either 4-cycle STI, or a branch (which is either
~1 cycle if predicted, or ~20 cycles if mispredicted). The disassembly of
vmlinux shows that gcc generates these asm patterns:

I still think a dedicated instruction for a conditional STI is worth asking for.

Along the lines of "If bit 9 in the r/m argument is set, then STI, else nothing".

What do people from CPU companies say?

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-08-31 11:12 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-17 17:20 RFC: Petition Intel/AMD to add POPF_IF insn Denys Vlasenko
2016-08-17 17:30 ` Christian König
2016-08-17 18:34   ` Denys Vlasenko
2016-08-17 17:32 ` Linus Torvalds
2016-08-17 18:41   ` Denys Vlasenko
     [not found]     ` <CA+55aFwmxwQBkyDjombS4cy1q=a_buhmDjDnaa8rdC8ZDaDYEA@mail.gmail.com>
2016-08-17 19:13       ` Andy Lutomirski
2016-08-17 19:26         ` Denys Vlasenko
2016-08-17 19:32           ` Linus Torvalds
2016-08-17 19:35             ` Denys Vlasenko
2016-08-17 19:54               ` Linus Torvalds
2016-08-17 19:37             ` Linus Torvalds
2016-08-17 21:26               ` Linus Torvalds
2016-08-17 21:35                 ` Linus Torvalds
2016-08-17 21:43                   ` Andy Lutomirski
2016-08-17 21:48                     ` Linus Torvalds
2016-08-18 13:26                       ` Denys Vlasenko
2016-08-18 17:24                         ` Linus Torvalds
2016-08-18 17:47                           ` Denys Vlasenko
2016-08-18 17:49                             ` Denys Vlasenko
2016-08-19 10:54                           ` Paolo Bonzini
2016-08-31 11:12                             ` Denys Vlasenko
2016-08-18  9:21                   ` Denys Vlasenko
2016-08-18 12:18                     ` Denys Vlasenko
2016-08-18 17:22                       ` Paolo Bonzini
2016-08-17 19:29         ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.