linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86: Optimize variable_test_bit()
@ 2015-05-01 15:16 Peter Zijlstra
  2015-05-01 16:03 ` Linus Torvalds
  2015-05-04 13:42 ` Peter Zijlstra
  0 siblings, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-01 15:16 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linus Torvalds
  Cc: linux-kernel, Borislav Petkov, Jakub Jelinek

While looking at some asm I noticed we produce the most horrid code for
test_bit():

    1a5e:       49 0f a3 30             bt     %rsi,(%r8)
    1a62:       45 19 c0                sbb    %r8d,%r8d
    1a65:       45 85 c0                test   %r8d,%r8d
    1a68:       75 a6                   jne    1a10 <x86_schedule_events+0xc0>

Since test_bit() doesn't actually have any output variables, we can use
asm goto without having to add a memory clobber. This reduces the code
to something sensible:

    1a12:       49 0f a3 30             bt     %rsi,(%r8)
    1a16:       72 68                   jb     1a80 <x86_schedule_events+0x130>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
PS. should we kill the memory clobber for __test_and_change_bit()? It
    seems inconsistent and out of place.

PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
     is this something we can get 'fixed' ?

 arch/x86/include/asm/bitops.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index cfe3b954d5e4..bcf4fa77c04f 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -313,6 +313,15 @@ static __always_inline int constant_test_bit(long nr, const volatile unsigned lo
 
 static inline int variable_test_bit(long nr, volatile const unsigned long *addr)
 {
+#ifdef CC_HAVE_ASM_GOTO
+	asm_volatile_goto ("bt %1, %0\n\t"
+			   "jc %l[cc_label]"
+			   : : "m" (*(unsigned long *)addr), "Ir" (nr)
+			   : : cc_label);
+	return 0;
+cc_label:
+	return 1;
+#else
 	int oldbit;
 
 	asm volatile("bt %2,%1\n\t"
@@ -321,6 +330,7 @@ static inline int variable_test_bit(long nr, volatile const unsigned long *addr)
 		     : "m" (*(unsigned long *)addr), "Ir" (nr));
 
 	return oldbit;
+#endif
 }
 
 #if 0 /* Fool kernel-doc since it doesn't do macros yet */

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 15:16 [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra
@ 2015-05-01 16:03 ` Linus Torvalds
  2015-05-01 16:16   ` Peter Zijlstra
                     ` (2 more replies)
  2015-05-04 13:42 ` Peter Zijlstra
  1 sibling, 3 replies; 31+ messages in thread
From: Linus Torvalds @ 2015-05-01 16:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, Jakub Jelinek

On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Since test_bit() doesn't actually have any output variables, we can use
> asm goto without having to add a memory clobber. This reduces the code
> to something sensible:

Yes, looks good, except if we have anything that actually wants to use
the value rather than branch on it. But a quick grep seems to show
that the vast majority of them are all about just directly testing the
result.

It worries me a bit that gcc now cannot pick the likely branch any
more. It will always branch out for the bit being set. So code like
this:

    net/core/dev.c:         if
(likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))

wouldn't work, but almost all of those seem to be the constant case
that doesn't get to this anyway.

So on the whole, this would seem to be a win.

> PS. should we kill the memory clobber for __test_and_change_bit()? It
>     seems inconsistent and out of place.

We don't seem to have it for the clear case, so yeah..

> PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
>      is this something we can get 'fixed' ?

I suspect the problem is that now the particular register allocation
choices are basically not just around the asm, they'd affect the
target labels of the asm too.

I think that for the kernel, it would *generally* be ok to just say
that the outputs are only valid in the case the asm does *not* branch
out, assuming that the *clobbers* obviously clobber things regardless.
Keeping the register allocation for the asm itself still purely
"local" to the asm.

Something with a memory output we could just turn into a memory
clobber (so we could do the test-and-change bits today without using
any outputs - just mark memory clobbered).

Don't get me wrong - it would be even better if outputs would be valid
even for the labels the asm can jump to, but I can see that being
fundamentally hard. For example, two different asm goto's that share
one label, but that have different output register allocation. That
would be a nightmare (the compiler could basically have to duplicate
the label etc - although maybe you have to do that anyway).

And many of the places where I would personally like to use "asm goto"
are where the goto label is for an error case. Things like a failed
cmpxchg, or a failed user access etc. It would generally be ok if the
output values from the asm were "lost", because it's about the cleanup
(or trying again)..

So we might be ok in many cases with that kind of "weaker output",
where the output is only valid when the goto is *not* taken.

                              Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 16:03 ` Linus Torvalds
@ 2015-05-01 16:16   ` Peter Zijlstra
  2015-05-01 16:29     ` Peter Zijlstra
  2015-05-01 16:18   ` Peter Zijlstra
  2015-05-01 16:33   ` Jakub Jelinek
  2 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-01 16:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, Jakub Jelinek

On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Since test_bit() doesn't actually have any output variables, we can use
> > asm goto without having to add a memory clobber. This reduces the code
> > to something sensible:
> 
> Yes, looks good, except if we have anything that actually wants to use
> the value rather than branch on it. But a quick grep seems to show
> that the vast majority of them are all about just directly testing the
> result.
> 
> It worries me a bit that gcc now cannot pick the likely branch any
> more. It will always branch out for the bit being set. So code like
> this:
> 
>     net/core/dev.c:         if
> (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> 
> wouldn't work, but almost all of those seem to be the constant case
> that doesn't get to this anyway.

Ah yes, that's another thing we've previously discussed with the GCC
people (IIRC). The GCC manual states you can use hot and cold attributes
on the labels (although when we tested that it didn't actually work, it
might now). But that's no good if the hint is one (or more) layer up
from the asm goto.

If would indeed be very good if the likely/unlikely thing would work as
expected.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 16:03 ` Linus Torvalds
  2015-05-01 16:16   ` Peter Zijlstra
@ 2015-05-01 16:18   ` Peter Zijlstra
  2015-05-01 16:33   ` Jakub Jelinek
  2 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-01 16:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, Jakub Jelinek

On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
> >      is this something we can get 'fixed' ?
> 
> I suspect the problem is that now the particular register allocation
> choices are basically not just around the asm, they'd affect the
> target labels of the asm too.
> 
> I think that for the kernel, it would *generally* be ok to just say
> that the outputs are only valid in the case the asm does *not* branch
> out, assuming that the *clobbers* obviously clobber things regardless.
> Keeping the register allocation for the asm itself still purely
> "local" to the asm.
> 
> Something with a memory output we could just turn into a memory
> clobber (so we could do the test-and-change bits today without using
> any outputs - just mark memory clobbered).

The risk is of course that we'll cause too much stores and reloads
around them and regress instead of win.

A single variable clobber might be a solution here ?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 16:16   ` Peter Zijlstra
@ 2015-05-01 16:29     ` Peter Zijlstra
  0 siblings, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-01 16:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, Jakub Jelinek

On Fri, May 01, 2015 at 06:16:54PM +0200, Peter Zijlstra wrote:
> On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> > On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > Since test_bit() doesn't actually have any output variables, we can use
> > > asm goto without having to add a memory clobber. This reduces the code
> > > to something sensible:
> > 
> > Yes, looks good, except if we have anything that actually wants to use
> > the value rather than branch on it. But a quick grep seems to show
> > that the vast majority of them are all about just directly testing the
> > result.
> > 
> > It worries me a bit that gcc now cannot pick the likely branch any
> > more. It will always branch out for the bit being set. So code like
> > this:
> > 
> >     net/core/dev.c:         if
> > (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> > 
> > wouldn't work, but almost all of those seem to be the constant case
> > that doesn't get to this anyway.
> 
> Ah yes, that's another thing we've previously discussed with the GCC
> people (IIRC). The GCC manual states you can use hot and cold attributes
> on the labels (although when we tested that it didn't actually work, it
> might now). But that's no good if the hint is one (or more) layer up
> from the asm goto.
> 
> If would indeed be very good if the likely/unlikely thing would work as
> expected.

Ah, I see what you meant. Yes we're stuck with the 'jc', the compiler
cannot flip that into a jnc.

The best it can do is add unconditional jumps to re-arrange the blocks,
and that's somewhat ugly indeed, although better than nothing at all.
And from experiments back when we did the static_branch stuff that all
didn't work at all.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 16:03 ` Linus Torvalds
  2015-05-01 16:16   ` Peter Zijlstra
  2015-05-01 16:18   ` Peter Zijlstra
@ 2015-05-01 16:33   ` Jakub Jelinek
  2015-05-01 16:45     ` Linus Torvalds
                       ` (2 more replies)
  2 siblings, 3 replies; 31+ messages in thread
From: Jakub Jelinek @ 2015-05-01 16:33 UTC (permalink / raw)
  To: Linus Torvalds, Richard Henderson, Vladimir Makarov
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov

On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
> >      is this something we can get 'fixed' ?

CCing Richard as author of asm goto and Vlad as register allocator
maintainer.  There are a few enhancement requests to support this, like
http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
reason why no outputs are allowed is the register allocation issue.
Don't know if LRA would be better suited to handle that case, but it would
indeed be pretty hard.

	Jakub

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 16:33   ` Jakub Jelinek
@ 2015-05-01 16:45     ` Linus Torvalds
  2015-05-01 16:46     ` Peter Zijlstra
  2015-05-01 19:02     ` Vladimir Makarov
  2 siblings, 0 replies; 31+ messages in thread
From: Linus Torvalds @ 2015-05-01 16:45 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Henderson, Vladimir Makarov, Peter Zijlstra, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov

On Fri, May 1, 2015 at 9:33 AM, Jakub Jelinek <jakub@redhat.com> wrote:
>
> CCing Richard as author of asm goto and Vlad as register allocator
> maintainer.  There are a few enhancement requests to support this, like
> http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
> reason why no outputs are allowed is the register allocation issue.

So would it help if it was documented that output registers would only
be valid for the "fallthrough" case of an asm goto, and would be
considered undefined for any jump targets?

It wouldn't be the perfect situation, but it would be better than not
having any outputs at all, and as mentioned, it would probably be
sufficient for a fair number of cases.

                              Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 16:33   ` Jakub Jelinek
  2015-05-01 16:45     ` Linus Torvalds
@ 2015-05-01 16:46     ` Peter Zijlstra
  2015-05-01 17:17       ` Ingo Molnar
  2015-05-01 19:02     ` Vladimir Makarov
  2 siblings, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-01 16:46 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Linus Torvalds, Richard Henderson, Vladimir Makarov, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov

On Fri, May 01, 2015 at 06:33:29PM +0200, Jakub Jelinek wrote:
> On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> > > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
> > >      is this something we can get 'fixed' ?
> 
> CCing Richard as author of asm goto and Vlad as register allocator
> maintainer.  There are a few enhancement requests to support this, like
> http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
> reason why no outputs are allowed is the register allocation issue.
> Don't know if LRA would be better suited to handle that case, but it would
> indeed be pretty hard.

So it would b awesome if we could use these freshly modeled flags as output
for regular asm stmts; that would obviate much of the asm goto hackery we now
do/have and allow gcc to pick the right branch for likely/unlikely.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 16:46     ` Peter Zijlstra
@ 2015-05-01 17:17       ` Ingo Molnar
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Molnar @ 2015-05-01 17:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jakub Jelinek, Linus Torvalds, Richard Henderson,
	Vladimir Makarov, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, May 01, 2015 at 06:33:29PM +0200, Jakub Jelinek wrote:
> > On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
> > > > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
> > > >      is this something we can get 'fixed' ?
> > 
> > CCing Richard as author of asm goto and Vlad as register allocator 
> > maintainer.  There are a few enhancement requests to support this, 
> > like http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , 
> > but indeed the reason why no outputs are allowed is the register 
> > allocation issue. Don't know if LRA would be better suited to 
> > handle that case, but it would indeed be pretty hard.
> 
> So it would b awesome if we could use these freshly modeled flags as 
> output for regular asm stmts; that would obviate much of the asm 
> goto hackery we now do/have and allow gcc to pick the right branch 
> for likely/unlikely.

If I may hijack the discussion a bit: it would also be awesome if 
there was a GCC flag that would allow us to use __builtin_expect() 
hints even when automatic branch heuristics are disabled:

I.e. very similar to -fno-guess-branch-probability, just that explicit 
__builtin_expect() hints would not be ignored (like 
-fno-guess-branch-probability does it today).

We could use this to compress the kernel instruction cache footprint 
by about 5% on x86-64, while still having all the hand-made 
optimizations that __builtin_expect() allows us.

It would be a perfect solution if -fno-guess-branch-probability just 
stopped ignoring __builtin_expect().

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 16:33   ` Jakub Jelinek
  2015-05-01 16:45     ` Linus Torvalds
  2015-05-01 16:46     ` Peter Zijlstra
@ 2015-05-01 19:02     ` Vladimir Makarov
  2015-05-01 20:49       ` Linus Torvalds
  2015-05-02 12:43       ` [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra
  2 siblings, 2 replies; 31+ messages in thread
From: Vladimir Makarov @ 2015-05-01 19:02 UTC (permalink / raw)
  To: Jakub Jelinek, Linus Torvalds, Richard Henderson
  Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov



On 01/05/15 12:33 PM, Jakub Jelinek wrote:
> On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:
>>> PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
>>>       is this something we can get 'fixed' ?
> CCing Richard as author of asm goto and Vlad as register allocator
> maintainer.  There are a few enhancement requests to support this, like
> http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
> reason why no outputs are allowed is the register allocation issue.
> Don't know if LRA would be better suited to handle that case, but it would
> indeed be pretty hard.
>
>
   GCC RA is a major reason to prohibit output operands for asm goto.

   Reload pass was not designed to deal with output reloads for control
flow insns.  It is very hard to implement this feature there and
implement it in a reliable way.  Also nobody does any development for
reload for long time.  So I doubt that somebody would do this for reload.

   LRA is more suitable to implement the feature.  In general, even
outputs used on any branch can be permitted.  Although critical edges
can complicate the implementation as new BBs are created. But it is
doable too.

   The only problem is that asm goto semantics in this case should be
defineddepending on what local register allocator (reload or LRA) GCC
for given target use.

   Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
PPC, SH, and ARC are moving to LRA.  All other targets are still
reload based.

   So I could implement the output reloads in LRA, probably for the
next GCC release.  How to enable and mostly use it for multi-target
code like the kernel is another question.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 19:02     ` Vladimir Makarov
@ 2015-05-01 20:49       ` Linus Torvalds
  2015-05-01 22:22         ` Vladimir Makarov
  2015-05-02 12:39         ` Peter Zijlstra
  2015-05-02 12:43       ` [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra
  1 sibling, 2 replies; 31+ messages in thread
From: Linus Torvalds @ 2015-05-01 20:49 UTC (permalink / raw)
  To: Vladimir Makarov
  Cc: Jakub Jelinek, Richard Henderson, Peter Zijlstra, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov

On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>
>   GCC RA is a major reason to prohibit output operands for asm goto.

Hmm.. Thinking some more about it, I think that what would actually
work really well at least for the kernel is:

(a) allow *memory* operands (ie "=m") as outputs and having them be
meaningful even at any output labels (obviously with the caveat that
the asm instructions that write to memory would have to happen before
the branch ;)

This covers the somewhat common case of having magic instructions that
result in conditions that can't be tested at a C level. Things like
"bit clear and test" on x86 (with or without the lock) .

 (b) allow other operands to be meaningful onlty for the fallthrough case.

>From a register allocation standpoint, these should be the easy cases.
(a) doesn't need any register allocation of the output (only on the
input to set up the effective address of the memory location), and (b)
would explicitly mean that an "asm goto" would leave any non-memory
outputs undefined in any of the goto cases, so from a RA standpoint it
ends up being equivalent to a non-goto asm..

Hmm?

So as an example of something that the kernel does and which wants to
have an output register. is to do a load from user space that can
fault. When it faults, we obviously simply don't *have* an actual
result, and we return an error. But for the successful fallthrough
case, we get a value in a register.

I'd love to be able to write it as (this is simplified, and doesn't
worry about all the different access sizes, or the "stac/clac"
sequence to enable user accesses on modern Intel CPU's):

        asm goto(
            "1:"
            "\tmovl %0,%1\n"
            _ASM_EXTABLE(1b,%l[error])
            : "=r" (val)
            : "m" (*userptr)
            : : error);

where that "_ASM_EXTABLE()" is our magic macro for generating an
exception entry for that instruction, so that if the load takes an
exception, it will instead to to the "error" label.

But if it goes to the error label, the "val" output register really
doesn't contain anything, so we wouldn't even *want* gcc to try to do
any register allocation for the "jump to label from assembly" case.

So at least for one of the major cases that I'd like to use "asm goto"
with an output, I actually don't *want* any register allocation for
anything but the fallthrough case. And I suspect that's a
not-too-uncommon pattern - it's probably often about error handling.

                        Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 20:49       ` Linus Torvalds
@ 2015-05-01 22:22         ` Vladimir Makarov
  2015-05-02 12:39         ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Vladimir Makarov @ 2015-05-01 22:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jakub Jelinek, Richard Henderson, Peter Zijlstra, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov



On 01/05/15 04:49 PM, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>>    GCC RA is a major reason to prohibit output operands for asm goto.
> Hmm.. Thinking some more about it, I think that what would actually
> work really well at least for the kernel is:
>
> (a) allow *memory* operands (ie "=m") as outputs and having them be
> meaningful even at any output labels (obviously with the caveat that
> the asm instructions that write to memory would have to happen before
> the branch ;)
>
> This covers the somewhat common case of having magic instructions that
> result in conditions that can't be tested at a C level. Things like
> "bit clear and test" on x86 (with or without the lock) .
>
>   (b) allow other operands to be meaningful onlty for the fallthrough case.
>
>  From a register allocation standpoint, these should be the easy cases.
> (a) doesn't need any register allocation of the output (only on the
> input to set up the effective address of the memory location), and (b)
> would explicitly mean that an "asm goto" would leave any non-memory
> outputs undefined in any of the goto cases, so from a RA standpoint it
> ends up being equivalent to a non-goto asm..
Thanks for explanation what you need in the most common case.

Big part of GCC RA (at least local register allocators -- reload pass 
and LRA) besides assigning hard registers to pseudos is to make 
transformations to satisfy insn constraints.  If there is not enough 
hard registers, a pseudo can be allocated to a stack slot and if insn 
using the pseudo needs a hard register, load or/and store should be 
generated before/after the insn.  And the problem for the old (reload 
pass) and new RA (LRA) is that they were not designed to put new insns 
after an insn changing control flow.  Assigning hard registers itself is 
not an issue for asm goto case.

If I understood you correctly, you assume that just permitting =m will 
make GCC generates the correct code. Unfortunately, it is more 
complicated.  The operand can be not a memory or memory not satisfying 
memory constraint 'm'.  So still insns for moving memory satisfying 'm' 
into output operand location might be necessary after the asm goto.

We could make asm goto semantics requiring that a user should provide 
memory for such output operand (e.g. a pointer dereferrencing in your 
case) and generate an error otherwise.  By the way the same could be 
done for output *register* operand.  And user to avoid the error should 
use a local register variable (a GCC extension) as an operand. But it 
might be a bad idea with code performance point of view.

Unfortunately, the operand can be substituted by an equiv. value during 
different transformations and even if an user think it will be a memory 
before RA, it might be wrong.  Although I believe there are some cases 
where we can be sure that it will be memory (e.g. dereferrencing pointer 
which is a function argument and is not used anywhere else in 
function).  Still it makes asm goto semantics complicated imho.

We could prevent equiv. substitution for output memory operand of asm 
goto through all the optimizations but it is probably even harder task 
than implementing output reloads in *reload* pass (it is 28-year old 
pass with so many changes during its life that practically nobody can 
understand it now well and change w/o introducing a new bug).  As for 
LRA, I wrote implementing output reloads is a double task.

> Hmm?
>
> So as an example of something that the kernel does and which wants to
> have an output register. is to do a load from user space that can
> fault. When it faults, we obviously simply don't *have* an actual
> result, and we return an error. But for the successful fallthrough
> case, we get a value in a register.
>
> I'd love to be able to write it as (this is simplified, and doesn't
> worry about all the different access sizes, or the "stac/clac"
> sequence to enable user accesses on modern Intel CPU's):
>
>          asm goto(
>              "1:"
>              "\tmovl %0,%1\n"
>              _ASM_EXTABLE(1b,%l[error])
>              : "=r" (val)
>              : "m" (*userptr)
>              : : error);
>
> where that "_ASM_EXTABLE()" is our magic macro for generating an
> exception entry for that instruction, so that if the load takes an
> exception, it will instead to to the "error" label.
>
> But if it goes to the error label, the "val" output register really
> doesn't contain anything, so we wouldn't even *want* gcc to try to do
> any register allocation for the "jump to label from assembly" case.
>
> So at least for one of the major cases that I'd like to use "asm goto"
> with an output, I actually don't *want* any register allocation for
> anything but the fallthrough case. And I suspect that's a
> not-too-uncommon pattern - it's probably often about error handling.
>
>
As I wrote already if we implement output reloads after the control flow 
insn, it does not matter what operand constraint should be (memory or 
register).  Implementing it only for fall-through case simplify the task 
but not so much.  For LRA it is doable and I can do this, for reload 
pass it is very hard (requirement only memory operand can simplify the 
implementation in reload although I am not sure about it).

But may be somebody will agree to do it for reload, sorry only not me -- 
i can not think about this without flinching.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 20:49       ` Linus Torvalds
  2015-05-01 22:22         ` Vladimir Makarov
@ 2015-05-02 12:39         ` Peter Zijlstra
  2015-05-04 15:37           ` Richard Henderson
  2015-05-04 19:33           ` [RFC] Design for flag bit outputs from asms Richard Henderson
  1 sibling, 2 replies; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-02 12:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vladimir Makarov, Jakub Jelinek, Richard Henderson, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov

On Fri, May 01, 2015 at 01:49:52PM -0700, Linus Torvalds wrote:
> On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
> >
> >   GCC RA is a major reason to prohibit output operands for asm goto.
> 
> Hmm.. Thinking some more about it, I think that what would actually
> work really well at least for the kernel is:
> 
> (a) allow *memory* operands (ie "=m") as outputs and having them be
> meaningful even at any output labels (obviously with the caveat that
> the asm instructions that write to memory would have to happen before
> the branch ;)
> 
> This covers the somewhat common case of having magic instructions that
> result in conditions that can't be tested at a C level. Things like
> "bit clear and test" on x86 (with or without the lock) .

Would not something like:

static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr)
{
	bool oldbit;

	asm volatile ("btr %2, %1"
		      : "CF" (oldbit), "+m" (*addr)
		      : "Ir" (nr));

	return oldbit;
}

Be the far better solution for this? Bug 59615 comment 7 states that
they actually modeled the flags in the .md file, so the above should be
possible to implement.

Now GCC can decide to use "sbb %0, %0" to convert CF into a register
value or use "jnc" / "jc" for branches, depending on what
__test_and_clear_bit() was used for.

We don't have to (ab)use asm goto for these things anymore; furthermore
I think the above will naturally work with our __builtin_expect() hints,
whereas the asm goto stuff has a hard time with that (afaik).

That's not to say output operants for asm goto would not still be useful
for other things (like your EXTABLE example).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 19:02     ` Vladimir Makarov
  2015-05-01 20:49       ` Linus Torvalds
@ 2015-05-02 12:43       ` Peter Zijlstra
  2015-05-04 18:07         ` Vladimir Makarov
  1 sibling, 1 reply; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-02 12:43 UTC (permalink / raw)
  To: Vladimir Makarov
  Cc: Jakub Jelinek, Linus Torvalds, Richard Henderson, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov

On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote:
>   Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
> PPC, SH, and ARC are moving to LRA.  All other targets are still
> reload based.
> 
>   So I could implement the output reloads in LRA, probably for the
> next GCC release.  How to enable and mostly use it for multi-target
> code like the kernel is another question.

Pretty much all inline asm is in per arch code; so one arch having
different asm features than another should not be a problem at all.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-01 15:16 [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra
  2015-05-01 16:03 ` Linus Torvalds
@ 2015-05-04 13:42 ` Peter Zijlstra
  1 sibling, 0 replies; 31+ messages in thread
From: Peter Zijlstra @ 2015-05-04 13:42 UTC (permalink / raw)
  To: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linus Torvalds
  Cc: linux-kernel, Borislav Petkov, Jakub Jelinek

On Fri, May 01, 2015 at 05:16:30PM +0200, Peter Zijlstra wrote:

> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index cfe3b954d5e4..bcf4fa77c04f 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -313,6 +313,15 @@ static __always_inline int constant_test_bit(long nr, const volatile unsigned lo
>  
>  static inline int variable_test_bit(long nr, volatile const unsigned long *addr)
>  {
> +#ifdef CC_HAVE_ASM_GOTO
> +	asm_volatile_goto ("bt %1, %0\n\t"
> +			   "jc %l[cc_label]"
> +			   : : "m" (*(unsigned long *)addr), "Ir" (nr)
> +			   : : cc_label);
> +	return 0;
> +cc_label:
> +	return 1;
> +#else

I figured I'd try both jc and jnc versions:

   text           data     bss     dec            hex   filename
12203914        1738112 1081344 15023370         e53d0a defconfig-build/vmlinux
12204228        1738112 1081344 15023684         e53e44 defconfig-build/vmlinux-jc
12203240        1738112 1081344 15022696         e53a68 defconfig-build/vmlinux-jnc

Clearly I picked the wrong one :-) It also shows there is real value in
exposing this decision to GCC.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-02 12:39         ` Peter Zijlstra
@ 2015-05-04 15:37           ` Richard Henderson
  2015-05-04 19:33           ` [RFC] Design for flag bit outputs from asms Richard Henderson
  1 sibling, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2015-05-04 15:37 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Vladimir Makarov, Jakub Jelinek, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov

On 05/02/2015 05:39 AM, Peter Zijlstra wrote:
> On Fri, May 01, 2015 at 01:49:52PM -0700, Linus Torvalds wrote:
>> On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>>>
>>>   GCC RA is a major reason to prohibit output operands for asm goto.
>>
>> Hmm.. Thinking some more about it, I think that what would actually
>> work really well at least for the kernel is:
>>
>> (a) allow *memory* operands (ie "=m") as outputs and having them be
>> meaningful even at any output labels (obviously with the caveat that
>> the asm instructions that write to memory would have to happen before
>> the branch ;)
>>
>> This covers the somewhat common case of having magic instructions that
>> result in conditions that can't be tested at a C level. Things like
>> "bit clear and test" on x86 (with or without the lock) .
> 
> Would not something like:
> 
> static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr)
> {
> 	bool oldbit;
> 
> 	asm volatile ("btr %2, %1"
> 		      : "CF" (oldbit), "+m" (*addr)
> 		      : "Ir" (nr));
> 
> 	return oldbit;
> }
> 
> Be the far better solution for this? Bug 59615 comment 7 states that
> they actually modeled the flags in the .md file, so the above should be
> possible to implement.
> 
> Now GCC can decide to use "sbb %0, %0" to convert CF into a register
> value or use "jnc" / "jc" for branches, depending on what
> __test_and_clear_bit() was used for.
> 
> We don't have to (ab)use asm goto for these things anymore; furthermore
> I think the above will naturally work with our __builtin_expect() hints,
> whereas the asm goto stuff has a hard time with that (afaik).
> 
> That's not to say output operants for asm goto would not still be useful
> for other things (like your EXTABLE example).
> 

I agree that being able to model flags outputs, and thus minimize the amount of
code actually within the asm, is superior to the complexity of asm goto.



r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-02 12:43       ` [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra
@ 2015-05-04 18:07         ` Vladimir Makarov
  2015-05-04 20:14           ` H. Peter Anvin
  0 siblings, 1 reply; 31+ messages in thread
From: Vladimir Makarov @ 2015-05-04 18:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jakub Jelinek, Linus Torvalds, Richard Henderson, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov



On 02/05/15 08:43 AM, Peter Zijlstra wrote:
> On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote:
>>    Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
>> PPC, SH, and ARC are moving to LRA.  All other targets are still
>> reload based.
>>
>>    So I could implement the output reloads in LRA, probably for the
>> next GCC release.  How to enable and mostly use it for multi-target
>> code like the kernel is another question.
> Pretty much all inline asm is in per arch code; so one arch having
> different asm features than another should not be a problem at all.
Ok, then. I'll try to implement output operands for asm-goto in LRA for 
the next GCC release.

Of course, if nobody objects to changing asm goto semantics from

  An 'asm goto' statement cannot have outputs ...

to

  An 'asm goto' statement cannot have outputs on some targets ...



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC] Design for flag bit outputs from asms
  2015-05-02 12:39         ` Peter Zijlstra
  2015-05-04 15:37           ` Richard Henderson
@ 2015-05-04 19:33           ` Richard Henderson
  2015-05-04 20:14             ` H. Peter Anvin
                               ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Richard Henderson @ 2015-05-04 19:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov, gcc

On 05/02/2015 05:39 AM, Peter Zijlstra wrote:
> static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr)
> {
> 	bool oldbit;
> 
> 	asm volatile ("btr %2, %1"
> 		      : "CF" (oldbit), "+m" (*addr)
> 		      : "Ir" (nr));
> 
> 	return oldbit;
> }
> 
> Be the far better solution for this? Bug 59615 comment 7 states that
> they actually modeled the flags in the .md file, so the above should be
> possible to implement.
> 
> Now GCC can decide to use "sbb %0, %0" to convert CF into a register
> value or use "jnc" / "jc" for branches, depending on what
> __test_and_clear_bit() was used for.
> 
> We don't have to (ab)use asm goto for these things anymore; furthermore
> I think the above will naturally work with our __builtin_expect() hints,
> whereas the asm goto stuff has a hard time with that (afaik).
> 
> That's not to say output operants for asm goto would not still be useful
> for other things (like your EXTABLE example).
> 

(0) The C level output variable should be an integral type, from bool on up.

The flags are a scarse resource, easily clobbered.  We cannot allow user code
to keep data in the flags.  While x86 does have lahf/sahf, they don't exactly
perform well.  And other targets like arm don't even have that bad option.

Therefore, the language level semantics are that the output is a boolean store
into the variable with a condition specified by a magic constraint.

That said, just like the compiler should be able to optimize

        void bar(int y)
        {
          int x = (y <= 0);
          if (x) foo();
        }

such that we only use a single compare against y, the expectation is that
within a similarly constrained context the compiler will not require two tests
for these boolean outputs.

Therefore:

(1) Each target defines a set of constraint strings,

   E.g. for x86, wherein we're almost out of constraint letters,

     ja   aux carry flag
     jc   carry flag
     jo   overflow flag
     jp   parity flag
     js   sign flag
     jz   zero flag

   E.g. for arm/aarch64 (using "j" here, but other possibilities exist):

     jn   negative flag
     jc   carry flag
     jz   zero flag
     jv   overflow flag

   E.g. for s390x (I've thought less about what's useful here)

     j<m>  where m is a hex digit, and is the mask of CC values
           for which the condition is true; exactly corresponding
           to the M1 field in the branch on condition instruction.

(2) A new target hook post-processes the asm_insn, looking for the
    new constraint strings.  The hook expands the condition prescribed
    by the string, adjusting the asm_insn as required.

  E.g.

    bool x, y, z;
    asm ("xyzzy" : "=jc"(x), "=jp"(y), "=jo"(z) : : );

  originally

    (parallel [
            (set (reg:QI 83 [ x ])
                (asm_operands/v:QI ("xyzzy") ("=jc") 0 []
                     []
                     [] z.c:4))
            (set (reg:QI 84 [ y ])
                (asm_operands/v:QI ("xyzzy") ("=jp") 1 []
                     []
                     [] z.c:4))
            (set (reg:QI 85 [ z ])
                (asm_operands/v:QI ("xyzzy") ("=jo") 2 []
                     []
                     [] z.c:4))
            (clobber (reg:QI 18 fpsr))
            (clobber (reg:QI 17 flags))
        ])

  becomes

    (parallel [
            (set (reg:CC 17 flags)
                (asm_operands/v:CC ("xyzzy") ("=j_") 0 []
                     []
                     [] z.c:4))
            (clobber (reg:QI 18 fpsr))
        ])
    (set (reg:QI 83 [ x ])
         (ne:QI (reg:CCC 17 flags) (const_int 0)))
    (set (reg:QI 84 [ y ])
         (ne:QI (reg:CCP 17 flags) (const_int 0)))
    (set (reg:QI 85 [ z ])
         (ne:QI (reg:CCO 17 flags) (const_int 0)))

  which ought to assemble to something like

    xyzzy
    setc  %dl
    setp  %cl
    seto  %r15l

  Note that rtl level data flow is preserved via the flags hard register,
  and the lifetime of flags would not extended any further than we would
  for a normal cstore pattern.

  Note that the output constraints are adjusted to a single internal "=j_"
  which would match the flags register in any mode.  We can collapse
  several output flags to a single set of the flags hard register.

(3) Note that ppc is both easier and more complicated.

  There we have 8 4-bit registers, although most of the integer
  non-comparisons only write to CR0.  And the vector non-comparisons
  only write to CR1, though of course that's of less interest in the
  context of kernel code.

  For the purposes of cr0, the same scheme could certainly work, although
  the hook would not insert a hard register use, but rather a pseudo to
  be allocated to cr0 (constaint "x").

  That said, it's my understanding that "dot insns", setting cr0 are
  expensive in current processor generations.  There's also a lot less
  of the x86-style "operate and set a flag based on something useful".


Can anyone think of any drawbacks, pitfalls, or portability issues to less
popular targets that I havn't considered?



r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 19:33           ` [RFC] Design for flag bit outputs from asms Richard Henderson
@ 2015-05-04 20:14             ` H. Peter Anvin
  2015-05-04 20:27               ` H. Peter Anvin
                                 ` (2 more replies)
  2015-05-05  9:01             ` Gabriel Paubert
  2015-05-05 13:50             ` Segher Boessenkool
  2 siblings, 3 replies; 31+ messages in thread
From: H. Peter Anvin @ 2015-05-04 20:14 UTC (permalink / raw)
  To: Richard Henderson, Peter Zijlstra
  Cc: Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar,
	Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc

On 05/04/2015 12:33 PM, Richard Henderson wrote:
> 
> (0) The C level output variable should be an integral type, from bool on up.
> 
> The flags are a scarse resource, easily clobbered.  We cannot allow user code
> to keep data in the flags.  While x86 does have lahf/sahf, they don't exactly
> perform well.  And other targets like arm don't even have that bad option.
> 
> Therefore, the language level semantics are that the output is a boolean store
> into the variable with a condition specified by a magic constraint.
> 
> That said, just like the compiler should be able to optimize
> 
>         void bar(int y)
>         {
>           int x = (y <= 0);
>           if (x) foo();
>         }
> 
> such that we only use a single compare against y, the expectation is that
> within a similarly constrained context the compiler will not require two tests
> for these boolean outputs.
> 
> Therefore:
> 
> (1) Each target defines a set of constraint strings,
> 
>    E.g. for x86, wherein we're almost out of constraint letters,
> 
>      ja   aux carry flag
>      jc   carry flag
>      jo   overflow flag
>      jp   parity flag
>      js   sign flag
>      jz   zero flag
> 

I would argue that for x86 what you actually want is to model the
*conditions* that are available on the flags, not the flags themselves.
 There are 16 such conditions, 8 if we discard the inversions.

It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc
instructions; it is only ever consumed by the DAA/DAS instructions which
makes it pointless to try to model it in a compiler any more than, say, IF.

> (2) A new target hook post-processes the asm_insn, looking for the
>     new constraint strings.  The hook expands the condition prescribed
>     by the string, adjusting the asm_insn as required.
> 
>   E.g.
> 
>     bool x, y, z;
>     asm ("xyzzy" : "=jc"(x), "=jp"(y), "=jo"(z) : : );

Other than that, this is exactly what would be wonderful to see.

	-hpa


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] x86: Optimize variable_test_bit()
  2015-05-04 18:07         ` Vladimir Makarov
@ 2015-05-04 20:14           ` H. Peter Anvin
  0 siblings, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2015-05-04 20:14 UTC (permalink / raw)
  To: Vladimir Makarov, Peter Zijlstra
  Cc: Jakub Jelinek, Linus Torvalds, Richard Henderson, Ingo Molnar,
	Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov

On 05/04/2015 11:07 AM, Vladimir Makarov wrote:
>>>
>>>    So I could implement the output reloads in LRA, probably for the
>>> next GCC release.  How to enable and mostly use it for multi-target
>>> code like the kernel is another question.
>> Pretty much all inline asm is in per arch code; so one arch having
>> different asm features than another should not be a problem at all.
> Ok, then. I'll try to implement output operands for asm-goto in LRA for
> the next GCC release.
> 
> Of course, if nobody objects to changing asm goto semantics from
> 
>  An 'asm goto' statement cannot have outputs ...
> 
> to
> 
>  An 'asm goto' statement cannot have outputs on some targets ...
> 

A gradual implementation should be fine.

	-hpa



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 20:14             ` H. Peter Anvin
@ 2015-05-04 20:27               ` H. Peter Anvin
  2015-05-04 20:33               ` Richard Henderson
  2015-05-04 20:35               ` Linus Torvalds
  2 siblings, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2015-05-04 20:27 UTC (permalink / raw)
  To: Richard Henderson, Peter Zijlstra
  Cc: Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar,
	Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc

On 05/04/2015 01:14 PM, H. Peter Anvin wrote:
>>
>> Therefore:
>>
>> (1) Each target defines a set of constraint strings,
>>
>>    E.g. for x86, wherein we're almost out of constraint letters,
>>
>>      ja   aux carry flag
>>      jc   carry flag
>>      jo   overflow flag
>>      jp   parity flag
>>      js   sign flag
>>      jz   zero flag
>>
> 
> I would argue that for x86 what you actually want is to model the
> *conditions* that are available on the flags, not the flags themselves.
>  There are 16 such conditions, 8 if we discard the inversions.
> 
> It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc
> instructions; it is only ever consumed by the DAA/DAS instructions which
> makes it pointless to try to model it in a compiler any more than, say, IF.
> 

OK, let me qualify that.  This is only necessary if it is impractical
for gcc to optimize boolean combinations of flags.  If such
optimizations are available then it doesn't matter and is probably
needlessly complex.  For example:

char foo(void)
{
	bool zf, sf, of;

	asm("xyzzy" : "=jz" (zf), "=js" (sf), "=jo" (of));

	return zf || (sf != of);
}

... should compile to ...

	xyzzy
	setng %al
	ret

	-hpa



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 20:14             ` H. Peter Anvin
  2015-05-04 20:27               ` H. Peter Anvin
@ 2015-05-04 20:33               ` Richard Henderson
  2015-05-04 20:45                 ` Linus Torvalds
  2015-05-04 20:35               ` Linus Torvalds
  2 siblings, 1 reply; 31+ messages in thread
From: Richard Henderson @ 2015-05-04 20:33 UTC (permalink / raw)
  To: H. Peter Anvin, Peter Zijlstra
  Cc: Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar,
	Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc

On 05/04/2015 01:14 PM, H. Peter Anvin wrote:
> On 05/04/2015 12:33 PM, Richard Henderson wrote:
>>
>> (0) The C level output variable should be an integral type, from bool on up.
>>
>> The flags are a scarse resource, easily clobbered.  We cannot allow user code
>> to keep data in the flags.  While x86 does have lahf/sahf, they don't exactly
>> perform well.  And other targets like arm don't even have that bad option.
>>
>> Therefore, the language level semantics are that the output is a boolean store
>> into the variable with a condition specified by a magic constraint.
>>
>> That said, just like the compiler should be able to optimize
>>
>>         void bar(int y)
>>         {
>>           int x = (y <= 0);
>>           if (x) foo();
>>         }
>>
>> such that we only use a single compare against y, the expectation is that
>> within a similarly constrained context the compiler will not require two tests
>> for these boolean outputs.
>>
>> Therefore:
>>
>> (1) Each target defines a set of constraint strings,
>>
>>    E.g. for x86, wherein we're almost out of constraint letters,
>>
>>      ja   aux carry flag
>>      jc   carry flag
>>      jo   overflow flag
>>      jp   parity flag
>>      js   sign flag
>>      jz   zero flag
>>
> 
> I would argue that for x86 what you actually want is to model the
> *conditions* that are available on the flags, not the flags themselves.
>  There are 16 such conditions, 8 if we discard the inversions.

A fair point.  Though honestly, I was hoping that this feature would mostly be
used for conditions that are "weird" -- that is, not normally describable by
arithmetic at all.  Otherwise, why are you using inline asm for it?

> It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc
> instructions; it is only ever consumed by the DAA/DAS instructions which
> makes it pointless to try to model it in a compiler any more than, say, IF.

Oh yeah.  Consider that dropped.


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 20:14             ` H. Peter Anvin
  2015-05-04 20:27               ` H. Peter Anvin
  2015-05-04 20:33               ` Richard Henderson
@ 2015-05-04 20:35               ` Linus Torvalds
  2015-05-04 20:42                 ` H. Peter Anvin
  2 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2015-05-04 20:35 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Richard Henderson, Peter Zijlstra, Vladimir Makarov,
	Jakub Jelinek, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, gcc

On Mon, May 4, 2015 at 1:14 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> I would argue that for x86 what you actually want is to model the
> *conditions* that are available on the flags, not the flags themselves.

Yes. Otherwise it would be a nightmare to try to describe simple
conditions like "le", which a rather complicated combination of three
of the actual flag bits:

    ((SF ^^ OF) || ZF) = 1

which would just be ridiculously painful for (a) the user to describe
and (b) fior the compiler to recognize once described.

Now, I do admit that most of the cases where you'd use inline asm with
condition codes would probably fall into just simple "test ZF or CF".
But I could certainly imagine other cases.

                       Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 20:35               ` Linus Torvalds
@ 2015-05-04 20:42                 ` H. Peter Anvin
  0 siblings, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2015-05-04 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Henderson, Peter Zijlstra, Vladimir Makarov,
	Jakub Jelinek, Ingo Molnar, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, gcc

On 05/04/2015 01:35 PM, Linus Torvalds wrote:
> On Mon, May 4, 2015 at 1:14 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> I would argue that for x86 what you actually want is to model the
>> *conditions* that are available on the flags, not the flags themselves.
> 
> Yes. Otherwise it would be a nightmare to try to describe simple
> conditions like "le", which a rather complicated combination of three
> of the actual flag bits:
> 
>     ((SF ^^ OF) || ZF) = 1
> 
> which would just be ridiculously painful for (a) the user to describe
> and (b) fior the compiler to recognize once described.
> 
> Now, I do admit that most of the cases where you'd use inline asm with
> condition codes would probably fall into just simple "test ZF or CF".
> But I could certainly imagine other cases.
> 

Yes, although once again I'm more than happy to let gcc do the boolean
optimizations if it already has logic to do so (which it might have/want
for its own reasons.)

	-hpa



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 20:33               ` Richard Henderson
@ 2015-05-04 20:45                 ` Linus Torvalds
  2015-05-04 20:57                   ` Richard Henderson
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2015-05-04 20:45 UTC (permalink / raw)
  To: Richard Henderson
  Cc: H. Peter Anvin, Peter Zijlstra, Vladimir Makarov, Jakub Jelinek,
	Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov, gcc

On Mon, May 4, 2015 at 1:33 PM, Richard Henderson <rth@redhat.com> wrote:
>
> A fair point.  Though honestly, I was hoping that this feature would mostly be
> used for conditions that are "weird" -- that is, not normally describable by
> arithmetic at all.  Otherwise, why are you using inline asm for it?

I could easily imagine using some of the combinations for atomic operations.

For example, doing a "lock decl", and wanting to see if the result is
negative or zero. Sure, it would be possible to set *two* booleans (ZF
and SF), but there's a contiional for "BE"..

                    Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 20:45                 ` Linus Torvalds
@ 2015-05-04 20:57                   ` Richard Henderson
  2015-05-04 21:23                     ` H. Peter Anvin
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Henderson @ 2015-05-04 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Peter Zijlstra, Vladimir Makarov, Jakub Jelinek,
	Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List,
	Borislav Petkov, gcc

On 05/04/2015 01:45 PM, Linus Torvalds wrote:
> On Mon, May 4, 2015 at 1:33 PM, Richard Henderson <rth@redhat.com> wrote:
>>
>> A fair point.  Though honestly, I was hoping that this feature would mostly be
>> used for conditions that are "weird" -- that is, not normally describable by
>> arithmetic at all.  Otherwise, why are you using inline asm for it?
> 
> I could easily imagine using some of the combinations for atomic operations.
> 
> For example, doing a "lock decl", and wanting to see if the result is
> negative or zero. Sure, it would be possible to set *two* booleans (ZF
> and SF), but there's a contiional for "BE"..

Sure.

I'd be more inclined to support these compound conditionals directly, rather
than try to get the compiler to recognize them after the fact.

Indeed, I believe we have a near complete set of them in the x86 backend
already.  It'd just be a matter of selecting the spellings for the constraints.


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 20:57                   ` Richard Henderson
@ 2015-05-04 21:23                     ` H. Peter Anvin
  0 siblings, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2015-05-04 21:23 UTC (permalink / raw)
  To: Richard Henderson, Linus Torvalds
  Cc: Peter Zijlstra, Vladimir Makarov, Jakub Jelinek, Ingo Molnar,
	Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc

On 05/04/2015 01:57 PM, Richard Henderson wrote:
> 
> Sure.
> 
> I'd be more inclined to support these compound conditionals directly, rather
> than try to get the compiler to recognize them after the fact.
> 
> Indeed, I believe we have a near complete set of them in the x86 backend
> already.  It'd just be a matter of selecting the spellings for the constraints.
> 

Whichever works for you.

The full set of conditions, mnemonics, and a bitmask with the bits in
the order from MSB to LSB (OF,SF,ZF,PF,CF) which is probably the sanest
way to model these for the purpose of boolean optimization.

Opcode	Mnemonics	Condition		Bitmask
0	o		OF			0xffff0000
1	no		!OF			0x0000ffff
2	b/c/nae		CF			0xaaaaaaaa
3	ae/nb/nc	!CF			0x55555555
4	e/z		ZF			0xf0f0f0f0
5	ne/nz		!ZF			0x0f0f0f0f
6	na		CF || ZF		0xfafafafa
7	a		!CF && !ZF		0x05050505
8	s		SF			0xff00ff00
9	ns		!SF			0x00ff00ff
A	p/pe		PF			0xcccccccc
B	np/po		!PF			0x33333333
C	l/nge		SF != OF		0x00ffff00
D	ge/nl		SF == OF		0xff0000ff
E	le/ng		ZF || (SF != OF)	0xf0fffff0
F	g/nle		!ZF && (SF == OF)	0x0f00000f

	-hpa


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 19:33           ` [RFC] Design for flag bit outputs from asms Richard Henderson
  2015-05-04 20:14             ` H. Peter Anvin
@ 2015-05-05  9:01             ` Gabriel Paubert
  2015-05-05 13:50             ` Segher Boessenkool
  2 siblings, 0 replies; 31+ messages in thread
From: Gabriel Paubert @ 2015-05-05  9:01 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Zijlstra, Linus Torvalds, Vladimir Makarov, Jakub Jelinek,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, gcc

On Mon, May 04, 2015 at 12:33:38PM -0700, Richard Henderson wrote:
[snipped]
> (3) Note that ppc is both easier and more complicated.
> 
>   There we have 8 4-bit registers, although most of the integer
>   non-comparisons only write to CR0.  And the vector non-comparisons
>   only write to CR1, though of course that's of less interest in the
>   context of kernel code.

Actually vector (Altivec) write to CR6. Standard FPU optionally write to
CR1, but the written value does not exactly depend on the result of the last
instruction; it is an instead an accrued exception status.

> 
>   For the purposes of cr0, the same scheme could certainly work, although
>   the hook would not insert a hard register use, but rather a pseudo to
>   be allocated to cr0 (constaint "x").

Yes, but we might also want to leave the choice of a cr register to the compiler.

> 
>   That said, it's my understanding that "dot insns", setting cr0 are
>   expensive in current processor generations.  

Not that much if I understand properly power7.md and power8.md: 
no (P7) or one (P8) additional clock for common instructions 
(add/sub and logical), but nothing else, so they are likely a win. 

Shift/rotate/sign extensions seem to have more decoding restrictions: 
the recording ("dot") forms are "cracked" and use 2 integer units.

>   There's also a lot less
>   of the x86-style "operate and set a flag based on something useful".
> 

But there is at least an important one, which I occasionally wished I had: 
the conditional stores. 

The overflow bit might also be useful, not really 
for the kernel, but for applications (and mfxer is slow).

    Regards,
    Gabriel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-04 19:33           ` [RFC] Design for flag bit outputs from asms Richard Henderson
  2015-05-04 20:14             ` H. Peter Anvin
  2015-05-05  9:01             ` Gabriel Paubert
@ 2015-05-05 13:50             ` Segher Boessenkool
  2015-05-05 15:37               ` Linus Torvalds
  2 siblings, 1 reply; 31+ messages in thread
From: Segher Boessenkool @ 2015-05-05 13:50 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Zijlstra, Linus Torvalds, Vladimir Makarov, Jakub Jelinek,
	Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, gcc

On Mon, May 04, 2015 at 12:33:38PM -0700, Richard Henderson wrote:
> (1) Each target defines a set of constraint strings,

> (2) A new target hook post-processes the asm_insn, looking for the
>     new constraint strings.  The hook expands the condition prescribed
>     by the string, adjusting the asm_insn as required.

Since it is pre-processed, there is no real reason to overlap this with
the constraints namespace; we could have e.g. "=@[xy]" (and "@[xy]" for
inputs) mean the target needs to do some "xy" transform here.

>   Note that the output constraints are adjusted to a single internal "=j_"
>   which would match the flags register in any mode.  We can collapse
>   several output flags to a single set of the flags hard register.

Many targets would use an already existing contraint that describes the
flags.  Targets that need a fixed register could just insert the hard
register here as far as I see?  (I'm assuming this happens at expand time).

> (3) Note that ppc is both easier and more complicated.
> 
>   There we have 8 4-bit registers, although most of the integer
>   non-comparisons only write to CR0.  And the vector non-comparisons
>   only write to CR1, though of course that's of less interest in the
>   context of kernel code.
> 
>   For the purposes of cr0, the same scheme could certainly work, although
>   the hook would not insert a hard register use, but rather a pseudo to
>   be allocated to cr0 (constaint "x").

And "y" for "any CR field".

>   That said, it's my understanding that "dot insns", setting cr0 are
>   expensive in current processor generations.

They are not.  (Cell BE is not "current" :-) )

PowerPC also has some other bits (the carry bit for example, CA) that
could be usefully exposed via this mechanism.

> Can anyone think of any drawbacks, pitfalls, or portability issues to less
> popular targets that I havn't considered?

I don't like co-opting the constraint names for this; other than that, it
looks quite good :-)


Segher

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-05 13:50             ` Segher Boessenkool
@ 2015-05-05 15:37               ` Linus Torvalds
  2015-05-05 16:10                 ` Segher Boessenkool
  0 siblings, 1 reply; 31+ messages in thread
From: Linus Torvalds @ 2015-05-05 15:37 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Richard Henderson, Peter Zijlstra, Vladimir Makarov,
	Jakub Jelinek, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, gcc

On Tue, May 5, 2015 at 6:50 AM, Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> Since it is pre-processed, there is no real reason to overlap this with
> the constraints namespace; we could have e.g. "=@[xy]" (and "@[xy]" for
> inputs) mean the target needs to do some "xy" transform here.

In fact, standing out visually would be just a good thing, since it's
pretty special even from a usage standpoint.

And are you actually planning to have flags as inputs? Because *that*
sounds like a bad idea. It's pretty hard to turn a boolean into a flag
value, while pretty much any archiecture has an operation like "setcc"
to go the other way. And I don't think your machine descriptions have
anything to "generate flags". You'd have to add fragile and complex
machinery for something it is unlikely anybody ever wants.

Flag *outputs* people definitely want. Flag inputs? Yeah, I can
absolutely see the carry flag being useful for multi-precision
arithmetic, but it's *so* hard to guarantee that it still is live,
that in practice the compiler would likely have to re-generate it from
a value anyway, so ...

So I'd go for output-only, and make the syntax be something very
visually unambiguous. That "=@[xy]" format looks fine, where "xy"
would be very architecture-dependent.

Or make it even *more* specific by using "CC" for condition codes, and
make the syntax "=@CC[xy]", in case you ever want to use the "@"
marker for any other kind of magic constraint.

                         Linus

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC] Design for flag bit outputs from asms
  2015-05-05 15:37               ` Linus Torvalds
@ 2015-05-05 16:10                 ` Segher Boessenkool
  0 siblings, 0 replies; 31+ messages in thread
From: Segher Boessenkool @ 2015-05-05 16:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Richard Henderson, Peter Zijlstra, Vladimir Makarov,
	Jakub Jelinek, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Linux Kernel Mailing List, Borislav Petkov, gcc

On Tue, May 05, 2015 at 08:37:01AM -0700, Linus Torvalds wrote:
> On Tue, May 5, 2015 at 6:50 AM, Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > Since it is pre-processed, there is no real reason to overlap this with
> > the constraints namespace; we could have e.g. "=@[xy]" (and "@[xy]" for
> > inputs) mean the target needs to do some "xy" transform here.
> 
> In fact, standing out visually would be just a good thing, since it's
> pretty special even from a usage standpoint.
> 
> And are you actually planning to have flags as inputs? Because *that*
> sounds like a bad idea. It's pretty hard to turn a boolean into a flag
> value, while pretty much any archiecture has an operation like "setcc"
> to go the other way. And I don't think your machine descriptions have
> anything to "generate flags". You'd have to add fragile and complex
> machinery for something it is unlikely anybody ever wants.

It isn't hard (or expensive) to turn integers into flags, on many
targets.  It is nice to allow this at least in the generic part of
the code -- what targets do in their target hook is up to them.

It isn't fragile or complex.  Not useful on some archs, yes I certainly
believe that.  But the lovely thing about Richard's proposal is that it
actually is a very simple addition to what the compiler already does,
there are no hard new optimisations needed, it's just a bit of munging
to allow the user to write an asm with condition code in/outs.  Allowing
inputs is just another bool argument to the target hook.  I'd rather
have this more orthogonal than more specialised; it can be used for much
more than just condition codes.  It's not like the "more general" syntax
would be a burden, as far as I see.


Segher

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-05-05 18:28 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-01 15:16 [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra
2015-05-01 16:03 ` Linus Torvalds
2015-05-01 16:16   ` Peter Zijlstra
2015-05-01 16:29     ` Peter Zijlstra
2015-05-01 16:18   ` Peter Zijlstra
2015-05-01 16:33   ` Jakub Jelinek
2015-05-01 16:45     ` Linus Torvalds
2015-05-01 16:46     ` Peter Zijlstra
2015-05-01 17:17       ` Ingo Molnar
2015-05-01 19:02     ` Vladimir Makarov
2015-05-01 20:49       ` Linus Torvalds
2015-05-01 22:22         ` Vladimir Makarov
2015-05-02 12:39         ` Peter Zijlstra
2015-05-04 15:37           ` Richard Henderson
2015-05-04 19:33           ` [RFC] Design for flag bit outputs from asms Richard Henderson
2015-05-04 20:14             ` H. Peter Anvin
2015-05-04 20:27               ` H. Peter Anvin
2015-05-04 20:33               ` Richard Henderson
2015-05-04 20:45                 ` Linus Torvalds
2015-05-04 20:57                   ` Richard Henderson
2015-05-04 21:23                     ` H. Peter Anvin
2015-05-04 20:35               ` Linus Torvalds
2015-05-04 20:42                 ` H. Peter Anvin
2015-05-05  9:01             ` Gabriel Paubert
2015-05-05 13:50             ` Segher Boessenkool
2015-05-05 15:37               ` Linus Torvalds
2015-05-05 16:10                 ` Segher Boessenkool
2015-05-02 12:43       ` [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra
2015-05-04 18:07         ` Vladimir Makarov
2015-05-04 20:14           ` H. Peter Anvin
2015-05-04 13:42 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).