* [PATCH] x86: Optimize variable_test_bit() @ 2015-05-01 15:16 Peter Zijlstra 2015-05-01 16:03 ` Linus Torvalds 2015-05-04 13:42 ` Peter Zijlstra 0 siblings, 2 replies; 31+ messages in thread From: Peter Zijlstra @ 2015-05-01 15:16 UTC (permalink / raw) To: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linus Torvalds Cc: linux-kernel, Borislav Petkov, Jakub Jelinek While looking at some asm I noticed we produce the most horrid code for test_bit(): 1a5e: 49 0f a3 30 bt %rsi,(%r8) 1a62: 45 19 c0 sbb %r8d,%r8d 1a65: 45 85 c0 test %r8d,%r8d 1a68: 75 a6 jne 1a10 <x86_schedule_events+0xc0> Since test_bit() doesn't actually have any output variables, we can use asm goto without having to add a memory clobber. This reduces the code to something sensible: 1a12: 49 0f a3 30 bt %rsi,(%r8) 1a16: 72 68 jb 1a80 <x86_schedule_events+0x130> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- PS. should we kill the memory clobber for __test_and_change_bit()? It seems inconsistent and out of place. PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto; is this something we can get 'fixed' ? arch/x86/include/asm/bitops.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h index cfe3b954d5e4..bcf4fa77c04f 100644 --- a/arch/x86/include/asm/bitops.h +++ b/arch/x86/include/asm/bitops.h @@ -313,6 +313,15 @@ static __always_inline int constant_test_bit(long nr, const volatile unsigned lo static inline int variable_test_bit(long nr, volatile const unsigned long *addr) { +#ifdef CC_HAVE_ASM_GOTO + asm_volatile_goto ("bt %1, %0\n\t" + "jc %l[cc_label]" + : : "m" (*(unsigned long *)addr), "Ir" (nr) + : : cc_label); + return 0; +cc_label: + return 1; +#else int oldbit; asm volatile("bt %2,%1\n\t" @@ -321,6 +330,7 @@ static inline int variable_test_bit(long nr, volatile const unsigned long *addr) : "m" (*(unsigned long *)addr), "Ir" (nr)); return oldbit; +#endif } #if 0 /* Fool kernel-doc since it doesn't do macros yet */ ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 15:16 [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra @ 2015-05-01 16:03 ` Linus Torvalds 2015-05-01 16:16 ` Peter Zijlstra ` (2 more replies) 2015-05-04 13:42 ` Peter Zijlstra 1 sibling, 3 replies; 31+ messages in thread From: Linus Torvalds @ 2015-05-01 16:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, Jakub Jelinek On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > Since test_bit() doesn't actually have any output variables, we can use > asm goto without having to add a memory clobber. This reduces the code > to something sensible: Yes, looks good, except if we have anything that actually wants to use the value rather than branch on it. But a quick grep seems to show that the vast majority of them are all about just directly testing the result. It worries me a bit that gcc now cannot pick the likely branch any more. It will always branch out for the bit being set. So code like this: net/core/dev.c: if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) wouldn't work, but almost all of those seem to be the constant case that doesn't get to this anyway. So on the whole, this would seem to be a win. > PS. should we kill the memory clobber for __test_and_change_bit()? It > seems inconsistent and out of place. We don't seem to have it for the clear case, so yeah.. > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto; > is this something we can get 'fixed' ? I suspect the problem is that now the particular register allocation choices are basically not just around the asm, they'd affect the target labels of the asm too. I think that for the kernel, it would *generally* be ok to just say that the outputs are only valid in the case the asm does *not* branch out, assuming that the *clobbers* obviously clobber things regardless. Keeping the register allocation for the asm itself still purely "local" to the asm. Something with a memory output we could just turn into a memory clobber (so we could do the test-and-change bits today without using any outputs - just mark memory clobbered). Don't get me wrong - it would be even better if outputs would be valid even for the labels the asm can jump to, but I can see that being fundamentally hard. For example, two different asm goto's that share one label, but that have different output register allocation. That would be a nightmare (the compiler could basically have to duplicate the label etc - although maybe you have to do that anyway). And many of the places where I would personally like to use "asm goto" are where the goto label is for an error case. Things like a failed cmpxchg, or a failed user access etc. It would generally be ok if the output values from the asm were "lost", because it's about the cleanup (or trying again).. So we might be ok in many cases with that kind of "weaker output", where the output is only valid when the goto is *not* taken. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 16:03 ` Linus Torvalds @ 2015-05-01 16:16 ` Peter Zijlstra 2015-05-01 16:29 ` Peter Zijlstra 2015-05-01 16:18 ` Peter Zijlstra 2015-05-01 16:33 ` Jakub Jelinek 2 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2015-05-01 16:16 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, Jakub Jelinek On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote: > On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > Since test_bit() doesn't actually have any output variables, we can use > > asm goto without having to add a memory clobber. This reduces the code > > to something sensible: > > Yes, looks good, except if we have anything that actually wants to use > the value rather than branch on it. But a quick grep seems to show > that the vast majority of them are all about just directly testing the > result. > > It worries me a bit that gcc now cannot pick the likely branch any > more. It will always branch out for the bit being set. So code like > this: > > net/core/dev.c: if > (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) > > wouldn't work, but almost all of those seem to be the constant case > that doesn't get to this anyway. Ah yes, that's another thing we've previously discussed with the GCC people (IIRC). The GCC manual states you can use hot and cold attributes on the labels (although when we tested that it didn't actually work, it might now). But that's no good if the hint is one (or more) layer up from the asm goto. If would indeed be very good if the likely/unlikely thing would work as expected. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 16:16 ` Peter Zijlstra @ 2015-05-01 16:29 ` Peter Zijlstra 0 siblings, 0 replies; 31+ messages in thread From: Peter Zijlstra @ 2015-05-01 16:29 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, Jakub Jelinek On Fri, May 01, 2015 at 06:16:54PM +0200, Peter Zijlstra wrote: > On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote: > > On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > Since test_bit() doesn't actually have any output variables, we can use > > > asm goto without having to add a memory clobber. This reduces the code > > > to something sensible: > > > > Yes, looks good, except if we have anything that actually wants to use > > the value rather than branch on it. But a quick grep seems to show > > that the vast majority of them are all about just directly testing the > > result. > > > > It worries me a bit that gcc now cannot pick the likely branch any > > more. It will always branch out for the bit being set. So code like > > this: > > > > net/core/dev.c: if > > (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) > > > > wouldn't work, but almost all of those seem to be the constant case > > that doesn't get to this anyway. > > Ah yes, that's another thing we've previously discussed with the GCC > people (IIRC). The GCC manual states you can use hot and cold attributes > on the labels (although when we tested that it didn't actually work, it > might now). But that's no good if the hint is one (or more) layer up > from the asm goto. > > If would indeed be very good if the likely/unlikely thing would work as > expected. Ah, I see what you meant. Yes we're stuck with the 'jc', the compiler cannot flip that into a jnc. The best it can do is add unconditional jumps to re-arrange the blocks, and that's somewhat ugly indeed, although better than nothing at all. And from experiments back when we did the static_branch stuff that all didn't work at all. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 16:03 ` Linus Torvalds 2015-05-01 16:16 ` Peter Zijlstra @ 2015-05-01 16:18 ` Peter Zijlstra 2015-05-01 16:33 ` Jakub Jelinek 2 siblings, 0 replies; 31+ messages in thread From: Peter Zijlstra @ 2015-05-01 16:18 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, Jakub Jelinek On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote: > On Fri, May 1, 2015 at 8:16 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto; > > is this something we can get 'fixed' ? > > I suspect the problem is that now the particular register allocation > choices are basically not just around the asm, they'd affect the > target labels of the asm too. > > I think that for the kernel, it would *generally* be ok to just say > that the outputs are only valid in the case the asm does *not* branch > out, assuming that the *clobbers* obviously clobber things regardless. > Keeping the register allocation for the asm itself still purely > "local" to the asm. > > Something with a memory output we could just turn into a memory > clobber (so we could do the test-and-change bits today without using > any outputs - just mark memory clobbered). The risk is of course that we'll cause too much stores and reloads around them and regress instead of win. A single variable clobber might be a solution here ? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 16:03 ` Linus Torvalds 2015-05-01 16:16 ` Peter Zijlstra 2015-05-01 16:18 ` Peter Zijlstra @ 2015-05-01 16:33 ` Jakub Jelinek 2015-05-01 16:45 ` Linus Torvalds ` (2 more replies) 2 siblings, 3 replies; 31+ messages in thread From: Jakub Jelinek @ 2015-05-01 16:33 UTC (permalink / raw) To: Linus Torvalds, Richard Henderson, Vladimir Makarov Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote: > > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto; > > is this something we can get 'fixed' ? CCing Richard as author of asm goto and Vlad as register allocator maintainer. There are a few enhancement requests to support this, like http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the reason why no outputs are allowed is the register allocation issue. Don't know if LRA would be better suited to handle that case, but it would indeed be pretty hard. Jakub ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 16:33 ` Jakub Jelinek @ 2015-05-01 16:45 ` Linus Torvalds 2015-05-01 16:46 ` Peter Zijlstra 2015-05-01 19:02 ` Vladimir Makarov 2 siblings, 0 replies; 31+ messages in thread From: Linus Torvalds @ 2015-05-01 16:45 UTC (permalink / raw) To: Jakub Jelinek Cc: Richard Henderson, Vladimir Makarov, Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On Fri, May 1, 2015 at 9:33 AM, Jakub Jelinek <jakub@redhat.com> wrote: > > CCing Richard as author of asm goto and Vlad as register allocator > maintainer. There are a few enhancement requests to support this, like > http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the > reason why no outputs are allowed is the register allocation issue. So would it help if it was documented that output registers would only be valid for the "fallthrough" case of an asm goto, and would be considered undefined for any jump targets? It wouldn't be the perfect situation, but it would be better than not having any outputs at all, and as mentioned, it would probably be sufficient for a fair number of cases. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 16:33 ` Jakub Jelinek 2015-05-01 16:45 ` Linus Torvalds @ 2015-05-01 16:46 ` Peter Zijlstra 2015-05-01 17:17 ` Ingo Molnar 2015-05-01 19:02 ` Vladimir Makarov 2 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2015-05-01 16:46 UTC (permalink / raw) To: Jakub Jelinek Cc: Linus Torvalds, Richard Henderson, Vladimir Makarov, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On Fri, May 01, 2015 at 06:33:29PM +0200, Jakub Jelinek wrote: > On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote: > > > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto; > > > is this something we can get 'fixed' ? > > CCing Richard as author of asm goto and Vlad as register allocator > maintainer. There are a few enhancement requests to support this, like > http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the > reason why no outputs are allowed is the register allocation issue. > Don't know if LRA would be better suited to handle that case, but it would > indeed be pretty hard. So it would b awesome if we could use these freshly modeled flags as output for regular asm stmts; that would obviate much of the asm goto hackery we now do/have and allow gcc to pick the right branch for likely/unlikely. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 16:46 ` Peter Zijlstra @ 2015-05-01 17:17 ` Ingo Molnar 0 siblings, 0 replies; 31+ messages in thread From: Ingo Molnar @ 2015-05-01 17:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Jakub Jelinek, Linus Torvalds, Richard Henderson, Vladimir Makarov, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov * Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, May 01, 2015 at 06:33:29PM +0200, Jakub Jelinek wrote: > > On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote: > > > > PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto; > > > > is this something we can get 'fixed' ? > > > > CCing Richard as author of asm goto and Vlad as register allocator > > maintainer. There are a few enhancement requests to support this, > > like http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , > > but indeed the reason why no outputs are allowed is the register > > allocation issue. Don't know if LRA would be better suited to > > handle that case, but it would indeed be pretty hard. > > So it would b awesome if we could use these freshly modeled flags as > output for regular asm stmts; that would obviate much of the asm > goto hackery we now do/have and allow gcc to pick the right branch > for likely/unlikely. If I may hijack the discussion a bit: it would also be awesome if there was a GCC flag that would allow us to use __builtin_expect() hints even when automatic branch heuristics are disabled: I.e. very similar to -fno-guess-branch-probability, just that explicit __builtin_expect() hints would not be ignored (like -fno-guess-branch-probability does it today). We could use this to compress the kernel instruction cache footprint by about 5% on x86-64, while still having all the hand-made optimizations that __builtin_expect() allows us. It would be a perfect solution if -fno-guess-branch-probability just stopped ignoring __builtin_expect(). Thanks, Ingo ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 16:33 ` Jakub Jelinek 2015-05-01 16:45 ` Linus Torvalds 2015-05-01 16:46 ` Peter Zijlstra @ 2015-05-01 19:02 ` Vladimir Makarov 2015-05-01 20:49 ` Linus Torvalds 2015-05-02 12:43 ` [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra 2 siblings, 2 replies; 31+ messages in thread From: Vladimir Makarov @ 2015-05-01 19:02 UTC (permalink / raw) To: Jakub Jelinek, Linus Torvalds, Richard Henderson Cc: Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On 01/05/15 12:33 PM, Jakub Jelinek wrote: > On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote: >>> PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto; >>> is this something we can get 'fixed' ? > CCing Richard as author of asm goto and Vlad as register allocator > maintainer. There are a few enhancement requests to support this, like > http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the > reason why no outputs are allowed is the register allocation issue. > Don't know if LRA would be better suited to handle that case, but it would > indeed be pretty hard. > > GCC RA is a major reason to prohibit output operands for asm goto. Reload pass was not designed to deal with output reloads for control flow insns. It is very hard to implement this feature there and implement it in a reliable way. Also nobody does any development for reload for long time. So I doubt that somebody would do this for reload. LRA is more suitable to implement the feature. In general, even outputs used on any branch can be permitted. Although critical edges can complicate the implementation as new BBs are created. But it is doable too. The only problem is that asm goto semantics in this case should be defineddepending on what local register allocator (reload or LRA) GCC for given target use. Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS. PPC, SH, and ARC are moving to LRA. All other targets are still reload based. So I could implement the output reloads in LRA, probably for the next GCC release. How to enable and mostly use it for multi-target code like the kernel is another question. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 19:02 ` Vladimir Makarov @ 2015-05-01 20:49 ` Linus Torvalds 2015-05-01 22:22 ` Vladimir Makarov 2015-05-02 12:39 ` Peter Zijlstra 2015-05-02 12:43 ` [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra 1 sibling, 2 replies; 31+ messages in thread From: Linus Torvalds @ 2015-05-01 20:49 UTC (permalink / raw) To: Vladimir Makarov Cc: Jakub Jelinek, Richard Henderson, Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: > > GCC RA is a major reason to prohibit output operands for asm goto. Hmm.. Thinking some more about it, I think that what would actually work really well at least for the kernel is: (a) allow *memory* operands (ie "=m") as outputs and having them be meaningful even at any output labels (obviously with the caveat that the asm instructions that write to memory would have to happen before the branch ;) This covers the somewhat common case of having magic instructions that result in conditions that can't be tested at a C level. Things like "bit clear and test" on x86 (with or without the lock) . (b) allow other operands to be meaningful onlty for the fallthrough case. >From a register allocation standpoint, these should be the easy cases. (a) doesn't need any register allocation of the output (only on the input to set up the effective address of the memory location), and (b) would explicitly mean that an "asm goto" would leave any non-memory outputs undefined in any of the goto cases, so from a RA standpoint it ends up being equivalent to a non-goto asm.. Hmm? So as an example of something that the kernel does and which wants to have an output register. is to do a load from user space that can fault. When it faults, we obviously simply don't *have* an actual result, and we return an error. But for the successful fallthrough case, we get a value in a register. I'd love to be able to write it as (this is simplified, and doesn't worry about all the different access sizes, or the "stac/clac" sequence to enable user accesses on modern Intel CPU's): asm goto( "1:" "\tmovl %0,%1\n" _ASM_EXTABLE(1b,%l[error]) : "=r" (val) : "m" (*userptr) : : error); where that "_ASM_EXTABLE()" is our magic macro for generating an exception entry for that instruction, so that if the load takes an exception, it will instead to to the "error" label. But if it goes to the error label, the "val" output register really doesn't contain anything, so we wouldn't even *want* gcc to try to do any register allocation for the "jump to label from assembly" case. So at least for one of the major cases that I'd like to use "asm goto" with an output, I actually don't *want* any register allocation for anything but the fallthrough case. And I suspect that's a not-too-uncommon pattern - it's probably often about error handling. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 20:49 ` Linus Torvalds @ 2015-05-01 22:22 ` Vladimir Makarov 2015-05-02 12:39 ` Peter Zijlstra 1 sibling, 0 replies; 31+ messages in thread From: Vladimir Makarov @ 2015-05-01 22:22 UTC (permalink / raw) To: Linus Torvalds Cc: Jakub Jelinek, Richard Henderson, Peter Zijlstra, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On 01/05/15 04:49 PM, Linus Torvalds wrote: > On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: >> GCC RA is a major reason to prohibit output operands for asm goto. > Hmm.. Thinking some more about it, I think that what would actually > work really well at least for the kernel is: > > (a) allow *memory* operands (ie "=m") as outputs and having them be > meaningful even at any output labels (obviously with the caveat that > the asm instructions that write to memory would have to happen before > the branch ;) > > This covers the somewhat common case of having magic instructions that > result in conditions that can't be tested at a C level. Things like > "bit clear and test" on x86 (with or without the lock) . > > (b) allow other operands to be meaningful onlty for the fallthrough case. > > From a register allocation standpoint, these should be the easy cases. > (a) doesn't need any register allocation of the output (only on the > input to set up the effective address of the memory location), and (b) > would explicitly mean that an "asm goto" would leave any non-memory > outputs undefined in any of the goto cases, so from a RA standpoint it > ends up being equivalent to a non-goto asm.. Thanks for explanation what you need in the most common case. Big part of GCC RA (at least local register allocators -- reload pass and LRA) besides assigning hard registers to pseudos is to make transformations to satisfy insn constraints. If there is not enough hard registers, a pseudo can be allocated to a stack slot and if insn using the pseudo needs a hard register, load or/and store should be generated before/after the insn. And the problem for the old (reload pass) and new RA (LRA) is that they were not designed to put new insns after an insn changing control flow. Assigning hard registers itself is not an issue for asm goto case. If I understood you correctly, you assume that just permitting =m will make GCC generates the correct code. Unfortunately, it is more complicated. The operand can be not a memory or memory not satisfying memory constraint 'm'. So still insns for moving memory satisfying 'm' into output operand location might be necessary after the asm goto. We could make asm goto semantics requiring that a user should provide memory for such output operand (e.g. a pointer dereferrencing in your case) and generate an error otherwise. By the way the same could be done for output *register* operand. And user to avoid the error should use a local register variable (a GCC extension) as an operand. But it might be a bad idea with code performance point of view. Unfortunately, the operand can be substituted by an equiv. value during different transformations and even if an user think it will be a memory before RA, it might be wrong. Although I believe there are some cases where we can be sure that it will be memory (e.g. dereferrencing pointer which is a function argument and is not used anywhere else in function). Still it makes asm goto semantics complicated imho. We could prevent equiv. substitution for output memory operand of asm goto through all the optimizations but it is probably even harder task than implementing output reloads in *reload* pass (it is 28-year old pass with so many changes during its life that practically nobody can understand it now well and change w/o introducing a new bug). As for LRA, I wrote implementing output reloads is a double task. > Hmm? > > So as an example of something that the kernel does and which wants to > have an output register. is to do a load from user space that can > fault. When it faults, we obviously simply don't *have* an actual > result, and we return an error. But for the successful fallthrough > case, we get a value in a register. > > I'd love to be able to write it as (this is simplified, and doesn't > worry about all the different access sizes, or the "stac/clac" > sequence to enable user accesses on modern Intel CPU's): > > asm goto( > "1:" > "\tmovl %0,%1\n" > _ASM_EXTABLE(1b,%l[error]) > : "=r" (val) > : "m" (*userptr) > : : error); > > where that "_ASM_EXTABLE()" is our magic macro for generating an > exception entry for that instruction, so that if the load takes an > exception, it will instead to to the "error" label. > > But if it goes to the error label, the "val" output register really > doesn't contain anything, so we wouldn't even *want* gcc to try to do > any register allocation for the "jump to label from assembly" case. > > So at least for one of the major cases that I'd like to use "asm goto" > with an output, I actually don't *want* any register allocation for > anything but the fallthrough case. And I suspect that's a > not-too-uncommon pattern - it's probably often about error handling. > > As I wrote already if we implement output reloads after the control flow insn, it does not matter what operand constraint should be (memory or register). Implementing it only for fall-through case simplify the task but not so much. For LRA it is doable and I can do this, for reload pass it is very hard (requirement only memory operand can simplify the implementation in reload although I am not sure about it). But may be somebody will agree to do it for reload, sorry only not me -- i can not think about this without flinching. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 20:49 ` Linus Torvalds 2015-05-01 22:22 ` Vladimir Makarov @ 2015-05-02 12:39 ` Peter Zijlstra 2015-05-04 15:37 ` Richard Henderson 2015-05-04 19:33 ` [RFC] Design for flag bit outputs from asms Richard Henderson 1 sibling, 2 replies; 31+ messages in thread From: Peter Zijlstra @ 2015-05-02 12:39 UTC (permalink / raw) To: Linus Torvalds Cc: Vladimir Makarov, Jakub Jelinek, Richard Henderson, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On Fri, May 01, 2015 at 01:49:52PM -0700, Linus Torvalds wrote: > On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: > > > > GCC RA is a major reason to prohibit output operands for asm goto. > > Hmm.. Thinking some more about it, I think that what would actually > work really well at least for the kernel is: > > (a) allow *memory* operands (ie "=m") as outputs and having them be > meaningful even at any output labels (obviously with the caveat that > the asm instructions that write to memory would have to happen before > the branch ;) > > This covers the somewhat common case of having magic instructions that > result in conditions that can't be tested at a C level. Things like > "bit clear and test" on x86 (with or without the lock) . Would not something like: static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr) { bool oldbit; asm volatile ("btr %2, %1" : "CF" (oldbit), "+m" (*addr) : "Ir" (nr)); return oldbit; } Be the far better solution for this? Bug 59615 comment 7 states that they actually modeled the flags in the .md file, so the above should be possible to implement. Now GCC can decide to use "sbb %0, %0" to convert CF into a register value or use "jnc" / "jc" for branches, depending on what __test_and_clear_bit() was used for. We don't have to (ab)use asm goto for these things anymore; furthermore I think the above will naturally work with our __builtin_expect() hints, whereas the asm goto stuff has a hard time with that (afaik). That's not to say output operants for asm goto would not still be useful for other things (like your EXTABLE example). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-02 12:39 ` Peter Zijlstra @ 2015-05-04 15:37 ` Richard Henderson 2015-05-04 19:33 ` [RFC] Design for flag bit outputs from asms Richard Henderson 1 sibling, 0 replies; 31+ messages in thread From: Richard Henderson @ 2015-05-04 15:37 UTC (permalink / raw) To: Peter Zijlstra, Linus Torvalds Cc: Vladimir Makarov, Jakub Jelinek, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On 05/02/2015 05:39 AM, Peter Zijlstra wrote: > On Fri, May 01, 2015 at 01:49:52PM -0700, Linus Torvalds wrote: >> On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: >>> >>> GCC RA is a major reason to prohibit output operands for asm goto. >> >> Hmm.. Thinking some more about it, I think that what would actually >> work really well at least for the kernel is: >> >> (a) allow *memory* operands (ie "=m") as outputs and having them be >> meaningful even at any output labels (obviously with the caveat that >> the asm instructions that write to memory would have to happen before >> the branch ;) >> >> This covers the somewhat common case of having magic instructions that >> result in conditions that can't be tested at a C level. Things like >> "bit clear and test" on x86 (with or without the lock) . > > Would not something like: > > static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr) > { > bool oldbit; > > asm volatile ("btr %2, %1" > : "CF" (oldbit), "+m" (*addr) > : "Ir" (nr)); > > return oldbit; > } > > Be the far better solution for this? Bug 59615 comment 7 states that > they actually modeled the flags in the .md file, so the above should be > possible to implement. > > Now GCC can decide to use "sbb %0, %0" to convert CF into a register > value or use "jnc" / "jc" for branches, depending on what > __test_and_clear_bit() was used for. > > We don't have to (ab)use asm goto for these things anymore; furthermore > I think the above will naturally work with our __builtin_expect() hints, > whereas the asm goto stuff has a hard time with that (afaik). > > That's not to say output operants for asm goto would not still be useful > for other things (like your EXTABLE example). > I agree that being able to model flags outputs, and thus minimize the amount of code actually within the asm, is superior to the complexity of asm goto. r~ ^ permalink raw reply [flat|nested] 31+ messages in thread
* [RFC] Design for flag bit outputs from asms 2015-05-02 12:39 ` Peter Zijlstra 2015-05-04 15:37 ` Richard Henderson @ 2015-05-04 19:33 ` Richard Henderson 2015-05-04 20:14 ` H. Peter Anvin ` (2 more replies) 1 sibling, 3 replies; 31+ messages in thread From: Richard Henderson @ 2015-05-04 19:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On 05/02/2015 05:39 AM, Peter Zijlstra wrote: > static inline bool __test_and_clear_bit(long nr, volatile unsigned long *addr) > { > bool oldbit; > > asm volatile ("btr %2, %1" > : "CF" (oldbit), "+m" (*addr) > : "Ir" (nr)); > > return oldbit; > } > > Be the far better solution for this? Bug 59615 comment 7 states that > they actually modeled the flags in the .md file, so the above should be > possible to implement. > > Now GCC can decide to use "sbb %0, %0" to convert CF into a register > value or use "jnc" / "jc" for branches, depending on what > __test_and_clear_bit() was used for. > > We don't have to (ab)use asm goto for these things anymore; furthermore > I think the above will naturally work with our __builtin_expect() hints, > whereas the asm goto stuff has a hard time with that (afaik). > > That's not to say output operants for asm goto would not still be useful > for other things (like your EXTABLE example). > (0) The C level output variable should be an integral type, from bool on up. The flags are a scarse resource, easily clobbered. We cannot allow user code to keep data in the flags. While x86 does have lahf/sahf, they don't exactly perform well. And other targets like arm don't even have that bad option. Therefore, the language level semantics are that the output is a boolean store into the variable with a condition specified by a magic constraint. That said, just like the compiler should be able to optimize void bar(int y) { int x = (y <= 0); if (x) foo(); } such that we only use a single compare against y, the expectation is that within a similarly constrained context the compiler will not require two tests for these boolean outputs. Therefore: (1) Each target defines a set of constraint strings, E.g. for x86, wherein we're almost out of constraint letters, ja aux carry flag jc carry flag jo overflow flag jp parity flag js sign flag jz zero flag E.g. for arm/aarch64 (using "j" here, but other possibilities exist): jn negative flag jc carry flag jz zero flag jv overflow flag E.g. for s390x (I've thought less about what's useful here) j<m> where m is a hex digit, and is the mask of CC values for which the condition is true; exactly corresponding to the M1 field in the branch on condition instruction. (2) A new target hook post-processes the asm_insn, looking for the new constraint strings. The hook expands the condition prescribed by the string, adjusting the asm_insn as required. E.g. bool x, y, z; asm ("xyzzy" : "=jc"(x), "=jp"(y), "=jo"(z) : : ); originally (parallel [ (set (reg:QI 83 [ x ]) (asm_operands/v:QI ("xyzzy") ("=jc") 0 [] [] [] z.c:4)) (set (reg:QI 84 [ y ]) (asm_operands/v:QI ("xyzzy") ("=jp") 1 [] [] [] z.c:4)) (set (reg:QI 85 [ z ]) (asm_operands/v:QI ("xyzzy") ("=jo") 2 [] [] [] z.c:4)) (clobber (reg:QI 18 fpsr)) (clobber (reg:QI 17 flags)) ]) becomes (parallel [ (set (reg:CC 17 flags) (asm_operands/v:CC ("xyzzy") ("=j_") 0 [] [] [] z.c:4)) (clobber (reg:QI 18 fpsr)) ]) (set (reg:QI 83 [ x ]) (ne:QI (reg:CCC 17 flags) (const_int 0))) (set (reg:QI 84 [ y ]) (ne:QI (reg:CCP 17 flags) (const_int 0))) (set (reg:QI 85 [ z ]) (ne:QI (reg:CCO 17 flags) (const_int 0))) which ought to assemble to something like xyzzy setc %dl setp %cl seto %r15l Note that rtl level data flow is preserved via the flags hard register, and the lifetime of flags would not extended any further than we would for a normal cstore pattern. Note that the output constraints are adjusted to a single internal "=j_" which would match the flags register in any mode. We can collapse several output flags to a single set of the flags hard register. (3) Note that ppc is both easier and more complicated. There we have 8 4-bit registers, although most of the integer non-comparisons only write to CR0. And the vector non-comparisons only write to CR1, though of course that's of less interest in the context of kernel code. For the purposes of cr0, the same scheme could certainly work, although the hook would not insert a hard register use, but rather a pseudo to be allocated to cr0 (constaint "x"). That said, it's my understanding that "dot insns", setting cr0 are expensive in current processor generations. There's also a lot less of the x86-style "operate and set a flag based on something useful". Can anyone think of any drawbacks, pitfalls, or portability issues to less popular targets that I havn't considered? r~ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 19:33 ` [RFC] Design for flag bit outputs from asms Richard Henderson @ 2015-05-04 20:14 ` H. Peter Anvin 2015-05-04 20:27 ` H. Peter Anvin ` (2 more replies) 2015-05-05 9:01 ` Gabriel Paubert 2015-05-05 13:50 ` Segher Boessenkool 2 siblings, 3 replies; 31+ messages in thread From: H. Peter Anvin @ 2015-05-04 20:14 UTC (permalink / raw) To: Richard Henderson, Peter Zijlstra Cc: Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On 05/04/2015 12:33 PM, Richard Henderson wrote: > > (0) The C level output variable should be an integral type, from bool on up. > > The flags are a scarse resource, easily clobbered. We cannot allow user code > to keep data in the flags. While x86 does have lahf/sahf, they don't exactly > perform well. And other targets like arm don't even have that bad option. > > Therefore, the language level semantics are that the output is a boolean store > into the variable with a condition specified by a magic constraint. > > That said, just like the compiler should be able to optimize > > void bar(int y) > { > int x = (y <= 0); > if (x) foo(); > } > > such that we only use a single compare against y, the expectation is that > within a similarly constrained context the compiler will not require two tests > for these boolean outputs. > > Therefore: > > (1) Each target defines a set of constraint strings, > > E.g. for x86, wherein we're almost out of constraint letters, > > ja aux carry flag > jc carry flag > jo overflow flag > jp parity flag > js sign flag > jz zero flag > I would argue that for x86 what you actually want is to model the *conditions* that are available on the flags, not the flags themselves. There are 16 such conditions, 8 if we discard the inversions. It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc instructions; it is only ever consumed by the DAA/DAS instructions which makes it pointless to try to model it in a compiler any more than, say, IF. > (2) A new target hook post-processes the asm_insn, looking for the > new constraint strings. The hook expands the condition prescribed > by the string, adjusting the asm_insn as required. > > E.g. > > bool x, y, z; > asm ("xyzzy" : "=jc"(x), "=jp"(y), "=jo"(z) : : ); Other than that, this is exactly what would be wonderful to see. -hpa ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 20:14 ` H. Peter Anvin @ 2015-05-04 20:27 ` H. Peter Anvin 2015-05-04 20:33 ` Richard Henderson 2015-05-04 20:35 ` Linus Torvalds 2 siblings, 0 replies; 31+ messages in thread From: H. Peter Anvin @ 2015-05-04 20:27 UTC (permalink / raw) To: Richard Henderson, Peter Zijlstra Cc: Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On 05/04/2015 01:14 PM, H. Peter Anvin wrote: >> >> Therefore: >> >> (1) Each target defines a set of constraint strings, >> >> E.g. for x86, wherein we're almost out of constraint letters, >> >> ja aux carry flag >> jc carry flag >> jo overflow flag >> jp parity flag >> js sign flag >> jz zero flag >> > > I would argue that for x86 what you actually want is to model the > *conditions* that are available on the flags, not the flags themselves. > There are 16 such conditions, 8 if we discard the inversions. > > It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc > instructions; it is only ever consumed by the DAA/DAS instructions which > makes it pointless to try to model it in a compiler any more than, say, IF. > OK, let me qualify that. This is only necessary if it is impractical for gcc to optimize boolean combinations of flags. If such optimizations are available then it doesn't matter and is probably needlessly complex. For example: char foo(void) { bool zf, sf, of; asm("xyzzy" : "=jz" (zf), "=js" (sf), "=jo" (of)); return zf || (sf != of); } ... should compile to ... xyzzy setng %al ret -hpa ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 20:14 ` H. Peter Anvin 2015-05-04 20:27 ` H. Peter Anvin @ 2015-05-04 20:33 ` Richard Henderson 2015-05-04 20:45 ` Linus Torvalds 2015-05-04 20:35 ` Linus Torvalds 2 siblings, 1 reply; 31+ messages in thread From: Richard Henderson @ 2015-05-04 20:33 UTC (permalink / raw) To: H. Peter Anvin, Peter Zijlstra Cc: Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On 05/04/2015 01:14 PM, H. Peter Anvin wrote: > On 05/04/2015 12:33 PM, Richard Henderson wrote: >> >> (0) The C level output variable should be an integral type, from bool on up. >> >> The flags are a scarse resource, easily clobbered. We cannot allow user code >> to keep data in the flags. While x86 does have lahf/sahf, they don't exactly >> perform well. And other targets like arm don't even have that bad option. >> >> Therefore, the language level semantics are that the output is a boolean store >> into the variable with a condition specified by a magic constraint. >> >> That said, just like the compiler should be able to optimize >> >> void bar(int y) >> { >> int x = (y <= 0); >> if (x) foo(); >> } >> >> such that we only use a single compare against y, the expectation is that >> within a similarly constrained context the compiler will not require two tests >> for these boolean outputs. >> >> Therefore: >> >> (1) Each target defines a set of constraint strings, >> >> E.g. for x86, wherein we're almost out of constraint letters, >> >> ja aux carry flag >> jc carry flag >> jo overflow flag >> jp parity flag >> js sign flag >> jz zero flag >> > > I would argue that for x86 what you actually want is to model the > *conditions* that are available on the flags, not the flags themselves. > There are 16 such conditions, 8 if we discard the inversions. A fair point. Though honestly, I was hoping that this feature would mostly be used for conditions that are "weird" -- that is, not normally describable by arithmetic at all. Otherwise, why are you using inline asm for it? > It is notable that the auxiliary carry flag has no Jcc/SETcc/CMOVcc > instructions; it is only ever consumed by the DAA/DAS instructions which > makes it pointless to try to model it in a compiler any more than, say, IF. Oh yeah. Consider that dropped. r~ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 20:33 ` Richard Henderson @ 2015-05-04 20:45 ` Linus Torvalds 2015-05-04 20:57 ` Richard Henderson 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2015-05-04 20:45 UTC (permalink / raw) To: Richard Henderson Cc: H. Peter Anvin, Peter Zijlstra, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On Mon, May 4, 2015 at 1:33 PM, Richard Henderson <rth@redhat.com> wrote: > > A fair point. Though honestly, I was hoping that this feature would mostly be > used for conditions that are "weird" -- that is, not normally describable by > arithmetic at all. Otherwise, why are you using inline asm for it? I could easily imagine using some of the combinations for atomic operations. For example, doing a "lock decl", and wanting to see if the result is negative or zero. Sure, it would be possible to set *two* booleans (ZF and SF), but there's a contiional for "BE".. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 20:45 ` Linus Torvalds @ 2015-05-04 20:57 ` Richard Henderson 2015-05-04 21:23 ` H. Peter Anvin 0 siblings, 1 reply; 31+ messages in thread From: Richard Henderson @ 2015-05-04 20:57 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Peter Zijlstra, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On 05/04/2015 01:45 PM, Linus Torvalds wrote: > On Mon, May 4, 2015 at 1:33 PM, Richard Henderson <rth@redhat.com> wrote: >> >> A fair point. Though honestly, I was hoping that this feature would mostly be >> used for conditions that are "weird" -- that is, not normally describable by >> arithmetic at all. Otherwise, why are you using inline asm for it? > > I could easily imagine using some of the combinations for atomic operations. > > For example, doing a "lock decl", and wanting to see if the result is > negative or zero. Sure, it would be possible to set *two* booleans (ZF > and SF), but there's a contiional for "BE".. Sure. I'd be more inclined to support these compound conditionals directly, rather than try to get the compiler to recognize them after the fact. Indeed, I believe we have a near complete set of them in the x86 backend already. It'd just be a matter of selecting the spellings for the constraints. r~ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 20:57 ` Richard Henderson @ 2015-05-04 21:23 ` H. Peter Anvin 0 siblings, 0 replies; 31+ messages in thread From: H. Peter Anvin @ 2015-05-04 21:23 UTC (permalink / raw) To: Richard Henderson, Linus Torvalds Cc: Peter Zijlstra, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On 05/04/2015 01:57 PM, Richard Henderson wrote: > > Sure. > > I'd be more inclined to support these compound conditionals directly, rather > than try to get the compiler to recognize them after the fact. > > Indeed, I believe we have a near complete set of them in the x86 backend > already. It'd just be a matter of selecting the spellings for the constraints. > Whichever works for you. The full set of conditions, mnemonics, and a bitmask with the bits in the order from MSB to LSB (OF,SF,ZF,PF,CF) which is probably the sanest way to model these for the purpose of boolean optimization. Opcode Mnemonics Condition Bitmask 0 o OF 0xffff0000 1 no !OF 0x0000ffff 2 b/c/nae CF 0xaaaaaaaa 3 ae/nb/nc !CF 0x55555555 4 e/z ZF 0xf0f0f0f0 5 ne/nz !ZF 0x0f0f0f0f 6 na CF || ZF 0xfafafafa 7 a !CF && !ZF 0x05050505 8 s SF 0xff00ff00 9 ns !SF 0x00ff00ff A p/pe PF 0xcccccccc B np/po !PF 0x33333333 C l/nge SF != OF 0x00ffff00 D ge/nl SF == OF 0xff0000ff E le/ng ZF || (SF != OF) 0xf0fffff0 F g/nle !ZF && (SF == OF) 0x0f00000f -hpa ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 20:14 ` H. Peter Anvin 2015-05-04 20:27 ` H. Peter Anvin 2015-05-04 20:33 ` Richard Henderson @ 2015-05-04 20:35 ` Linus Torvalds 2015-05-04 20:42 ` H. Peter Anvin 2 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2015-05-04 20:35 UTC (permalink / raw) To: H. Peter Anvin Cc: Richard Henderson, Peter Zijlstra, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On Mon, May 4, 2015 at 1:14 PM, H. Peter Anvin <hpa@zytor.com> wrote: > > I would argue that for x86 what you actually want is to model the > *conditions* that are available on the flags, not the flags themselves. Yes. Otherwise it would be a nightmare to try to describe simple conditions like "le", which a rather complicated combination of three of the actual flag bits: ((SF ^^ OF) || ZF) = 1 which would just be ridiculously painful for (a) the user to describe and (b) fior the compiler to recognize once described. Now, I do admit that most of the cases where you'd use inline asm with condition codes would probably fall into just simple "test ZF or CF". But I could certainly imagine other cases. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 20:35 ` Linus Torvalds @ 2015-05-04 20:42 ` H. Peter Anvin 0 siblings, 0 replies; 31+ messages in thread From: H. Peter Anvin @ 2015-05-04 20:42 UTC (permalink / raw) To: Linus Torvalds Cc: Richard Henderson, Peter Zijlstra, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On 05/04/2015 01:35 PM, Linus Torvalds wrote: > On Mon, May 4, 2015 at 1:14 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> >> I would argue that for x86 what you actually want is to model the >> *conditions* that are available on the flags, not the flags themselves. > > Yes. Otherwise it would be a nightmare to try to describe simple > conditions like "le", which a rather complicated combination of three > of the actual flag bits: > > ((SF ^^ OF) || ZF) = 1 > > which would just be ridiculously painful for (a) the user to describe > and (b) fior the compiler to recognize once described. > > Now, I do admit that most of the cases where you'd use inline asm with > condition codes would probably fall into just simple "test ZF or CF". > But I could certainly imagine other cases. > Yes, although once again I'm more than happy to let gcc do the boolean optimizations if it already has logic to do so (which it might have/want for its own reasons.) -hpa ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 19:33 ` [RFC] Design for flag bit outputs from asms Richard Henderson 2015-05-04 20:14 ` H. Peter Anvin @ 2015-05-05 9:01 ` Gabriel Paubert 2015-05-05 13:50 ` Segher Boessenkool 2 siblings, 0 replies; 31+ messages in thread From: Gabriel Paubert @ 2015-05-05 9:01 UTC (permalink / raw) To: Richard Henderson Cc: Peter Zijlstra, Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On Mon, May 04, 2015 at 12:33:38PM -0700, Richard Henderson wrote: [snipped] > (3) Note that ppc is both easier and more complicated. > > There we have 8 4-bit registers, although most of the integer > non-comparisons only write to CR0. And the vector non-comparisons > only write to CR1, though of course that's of less interest in the > context of kernel code. Actually vector (Altivec) write to CR6. Standard FPU optionally write to CR1, but the written value does not exactly depend on the result of the last instruction; it is an instead an accrued exception status. > > For the purposes of cr0, the same scheme could certainly work, although > the hook would not insert a hard register use, but rather a pseudo to > be allocated to cr0 (constaint "x"). Yes, but we might also want to leave the choice of a cr register to the compiler. > > That said, it's my understanding that "dot insns", setting cr0 are > expensive in current processor generations. Not that much if I understand properly power7.md and power8.md: no (P7) or one (P8) additional clock for common instructions (add/sub and logical), but nothing else, so they are likely a win. Shift/rotate/sign extensions seem to have more decoding restrictions: the recording ("dot") forms are "cracked" and use 2 integer units. > There's also a lot less > of the x86-style "operate and set a flag based on something useful". > But there is at least an important one, which I occasionally wished I had: the conditional stores. The overflow bit might also be useful, not really for the kernel, but for applications (and mfxer is slow). Regards, Gabriel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-04 19:33 ` [RFC] Design for flag bit outputs from asms Richard Henderson 2015-05-04 20:14 ` H. Peter Anvin 2015-05-05 9:01 ` Gabriel Paubert @ 2015-05-05 13:50 ` Segher Boessenkool 2015-05-05 15:37 ` Linus Torvalds 2 siblings, 1 reply; 31+ messages in thread From: Segher Boessenkool @ 2015-05-05 13:50 UTC (permalink / raw) To: Richard Henderson Cc: Peter Zijlstra, Linus Torvalds, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On Mon, May 04, 2015 at 12:33:38PM -0700, Richard Henderson wrote: > (1) Each target defines a set of constraint strings, > (2) A new target hook post-processes the asm_insn, looking for the > new constraint strings. The hook expands the condition prescribed > by the string, adjusting the asm_insn as required. Since it is pre-processed, there is no real reason to overlap this with the constraints namespace; we could have e.g. "=@[xy]" (and "@[xy]" for inputs) mean the target needs to do some "xy" transform here. > Note that the output constraints are adjusted to a single internal "=j_" > which would match the flags register in any mode. We can collapse > several output flags to a single set of the flags hard register. Many targets would use an already existing contraint that describes the flags. Targets that need a fixed register could just insert the hard register here as far as I see? (I'm assuming this happens at expand time). > (3) Note that ppc is both easier and more complicated. > > There we have 8 4-bit registers, although most of the integer > non-comparisons only write to CR0. And the vector non-comparisons > only write to CR1, though of course that's of less interest in the > context of kernel code. > > For the purposes of cr0, the same scheme could certainly work, although > the hook would not insert a hard register use, but rather a pseudo to > be allocated to cr0 (constaint "x"). And "y" for "any CR field". > That said, it's my understanding that "dot insns", setting cr0 are > expensive in current processor generations. They are not. (Cell BE is not "current" :-) ) PowerPC also has some other bits (the carry bit for example, CA) that could be usefully exposed via this mechanism. > Can anyone think of any drawbacks, pitfalls, or portability issues to less > popular targets that I havn't considered? I don't like co-opting the constraint names for this; other than that, it looks quite good :-) Segher ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-05 13:50 ` Segher Boessenkool @ 2015-05-05 15:37 ` Linus Torvalds 2015-05-05 16:10 ` Segher Boessenkool 0 siblings, 1 reply; 31+ messages in thread From: Linus Torvalds @ 2015-05-05 15:37 UTC (permalink / raw) To: Segher Boessenkool Cc: Richard Henderson, Peter Zijlstra, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On Tue, May 5, 2015 at 6:50 AM, Segher Boessenkool <segher@kernel.crashing.org> wrote: > > Since it is pre-processed, there is no real reason to overlap this with > the constraints namespace; we could have e.g. "=@[xy]" (and "@[xy]" for > inputs) mean the target needs to do some "xy" transform here. In fact, standing out visually would be just a good thing, since it's pretty special even from a usage standpoint. And are you actually planning to have flags as inputs? Because *that* sounds like a bad idea. It's pretty hard to turn a boolean into a flag value, while pretty much any archiecture has an operation like "setcc" to go the other way. And I don't think your machine descriptions have anything to "generate flags". You'd have to add fragile and complex machinery for something it is unlikely anybody ever wants. Flag *outputs* people definitely want. Flag inputs? Yeah, I can absolutely see the carry flag being useful for multi-precision arithmetic, but it's *so* hard to guarantee that it still is live, that in practice the compiler would likely have to re-generate it from a value anyway, so ... So I'd go for output-only, and make the syntax be something very visually unambiguous. That "=@[xy]" format looks fine, where "xy" would be very architecture-dependent. Or make it even *more* specific by using "CC" for condition codes, and make the syntax "=@CC[xy]", in case you ever want to use the "@" marker for any other kind of magic constraint. Linus ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [RFC] Design for flag bit outputs from asms 2015-05-05 15:37 ` Linus Torvalds @ 2015-05-05 16:10 ` Segher Boessenkool 0 siblings, 0 replies; 31+ messages in thread From: Segher Boessenkool @ 2015-05-05 16:10 UTC (permalink / raw) To: Linus Torvalds Cc: Richard Henderson, Peter Zijlstra, Vladimir Makarov, Jakub Jelinek, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov, gcc On Tue, May 05, 2015 at 08:37:01AM -0700, Linus Torvalds wrote: > On Tue, May 5, 2015 at 6:50 AM, Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > > > Since it is pre-processed, there is no real reason to overlap this with > > the constraints namespace; we could have e.g. "=@[xy]" (and "@[xy]" for > > inputs) mean the target needs to do some "xy" transform here. > > In fact, standing out visually would be just a good thing, since it's > pretty special even from a usage standpoint. > > And are you actually planning to have flags as inputs? Because *that* > sounds like a bad idea. It's pretty hard to turn a boolean into a flag > value, while pretty much any archiecture has an operation like "setcc" > to go the other way. And I don't think your machine descriptions have > anything to "generate flags". You'd have to add fragile and complex > machinery for something it is unlikely anybody ever wants. It isn't hard (or expensive) to turn integers into flags, on many targets. It is nice to allow this at least in the generic part of the code -- what targets do in their target hook is up to them. It isn't fragile or complex. Not useful on some archs, yes I certainly believe that. But the lovely thing about Richard's proposal is that it actually is a very simple addition to what the compiler already does, there are no hard new optimisations needed, it's just a bit of munging to allow the user to write an asm with condition code in/outs. Allowing inputs is just another bool argument to the target hook. I'd rather have this more orthogonal than more specialised; it can be used for much more than just condition codes. It's not like the "more general" syntax would be a burden, as far as I see. Segher ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 19:02 ` Vladimir Makarov 2015-05-01 20:49 ` Linus Torvalds @ 2015-05-02 12:43 ` Peter Zijlstra 2015-05-04 18:07 ` Vladimir Makarov 1 sibling, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2015-05-02 12:43 UTC (permalink / raw) To: Vladimir Makarov Cc: Jakub Jelinek, Linus Torvalds, Richard Henderson, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote: > Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS. > PPC, SH, and ARC are moving to LRA. All other targets are still > reload based. > > So I could implement the output reloads in LRA, probably for the > next GCC release. How to enable and mostly use it for multi-target > code like the kernel is another question. Pretty much all inline asm is in per arch code; so one arch having different asm features than another should not be a problem at all. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-02 12:43 ` [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra @ 2015-05-04 18:07 ` Vladimir Makarov 2015-05-04 20:14 ` H. Peter Anvin 0 siblings, 1 reply; 31+ messages in thread From: Vladimir Makarov @ 2015-05-04 18:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Jakub Jelinek, Linus Torvalds, Richard Henderson, Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On 02/05/15 08:43 AM, Peter Zijlstra wrote: > On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote: >> Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS. >> PPC, SH, and ARC are moving to LRA. All other targets are still >> reload based. >> >> So I could implement the output reloads in LRA, probably for the >> next GCC release. How to enable and mostly use it for multi-target >> code like the kernel is another question. > Pretty much all inline asm is in per arch code; so one arch having > different asm features than another should not be a problem at all. Ok, then. I'll try to implement output operands for asm-goto in LRA for the next GCC release. Of course, if nobody objects to changing asm goto semantics from An 'asm goto' statement cannot have outputs ... to An 'asm goto' statement cannot have outputs on some targets ... ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-04 18:07 ` Vladimir Makarov @ 2015-05-04 20:14 ` H. Peter Anvin 0 siblings, 0 replies; 31+ messages in thread From: H. Peter Anvin @ 2015-05-04 20:14 UTC (permalink / raw) To: Vladimir Makarov, Peter Zijlstra Cc: Jakub Jelinek, Linus Torvalds, Richard Henderson, Ingo Molnar, Thomas Gleixner, Linux Kernel Mailing List, Borislav Petkov On 05/04/2015 11:07 AM, Vladimir Makarov wrote: >>> >>> So I could implement the output reloads in LRA, probably for the >>> next GCC release. How to enable and mostly use it for multi-target >>> code like the kernel is another question. >> Pretty much all inline asm is in per arch code; so one arch having >> different asm features than another should not be a problem at all. > Ok, then. I'll try to implement output operands for asm-goto in LRA for > the next GCC release. > > Of course, if nobody objects to changing asm goto semantics from > > An 'asm goto' statement cannot have outputs ... > > to > > An 'asm goto' statement cannot have outputs on some targets ... > A gradual implementation should be fine. -hpa ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] x86: Optimize variable_test_bit() 2015-05-01 15:16 [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra 2015-05-01 16:03 ` Linus Torvalds @ 2015-05-04 13:42 ` Peter Zijlstra 1 sibling, 0 replies; 31+ messages in thread From: Peter Zijlstra @ 2015-05-04 13:42 UTC (permalink / raw) To: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Linus Torvalds Cc: linux-kernel, Borislav Petkov, Jakub Jelinek On Fri, May 01, 2015 at 05:16:30PM +0200, Peter Zijlstra wrote: > diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h > index cfe3b954d5e4..bcf4fa77c04f 100644 > --- a/arch/x86/include/asm/bitops.h > +++ b/arch/x86/include/asm/bitops.h > @@ -313,6 +313,15 @@ static __always_inline int constant_test_bit(long nr, const volatile unsigned lo > > static inline int variable_test_bit(long nr, volatile const unsigned long *addr) > { > +#ifdef CC_HAVE_ASM_GOTO > + asm_volatile_goto ("bt %1, %0\n\t" > + "jc %l[cc_label]" > + : : "m" (*(unsigned long *)addr), "Ir" (nr) > + : : cc_label); > + return 0; > +cc_label: > + return 1; > +#else I figured I'd try both jc and jnc versions: text data bss dec hex filename 12203914 1738112 1081344 15023370 e53d0a defconfig-build/vmlinux 12204228 1738112 1081344 15023684 e53e44 defconfig-build/vmlinux-jc 12203240 1738112 1081344 15022696 e53a68 defconfig-build/vmlinux-jnc Clearly I picked the wrong one :-) It also shows there is real value in exposing this decision to GCC. ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2015-05-05 18:28 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-05-01 15:16 [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra 2015-05-01 16:03 ` Linus Torvalds 2015-05-01 16:16 ` Peter Zijlstra 2015-05-01 16:29 ` Peter Zijlstra 2015-05-01 16:18 ` Peter Zijlstra 2015-05-01 16:33 ` Jakub Jelinek 2015-05-01 16:45 ` Linus Torvalds 2015-05-01 16:46 ` Peter Zijlstra 2015-05-01 17:17 ` Ingo Molnar 2015-05-01 19:02 ` Vladimir Makarov 2015-05-01 20:49 ` Linus Torvalds 2015-05-01 22:22 ` Vladimir Makarov 2015-05-02 12:39 ` Peter Zijlstra 2015-05-04 15:37 ` Richard Henderson 2015-05-04 19:33 ` [RFC] Design for flag bit outputs from asms Richard Henderson 2015-05-04 20:14 ` H. Peter Anvin 2015-05-04 20:27 ` H. Peter Anvin 2015-05-04 20:33 ` Richard Henderson 2015-05-04 20:45 ` Linus Torvalds 2015-05-04 20:57 ` Richard Henderson 2015-05-04 21:23 ` H. Peter Anvin 2015-05-04 20:35 ` Linus Torvalds 2015-05-04 20:42 ` H. Peter Anvin 2015-05-05 9:01 ` Gabriel Paubert 2015-05-05 13:50 ` Segher Boessenkool 2015-05-05 15:37 ` Linus Torvalds 2015-05-05 16:10 ` Segher Boessenkool 2015-05-02 12:43 ` [PATCH] x86: Optimize variable_test_bit() Peter Zijlstra 2015-05-04 18:07 ` Vladimir Makarov 2015-05-04 20:14 ` H. Peter Anvin 2015-05-04 13:42 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).