All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86 rwsem optimization extreme
@ 2010-02-17 21:58 Zachary Amsden
  2010-02-17 22:10 ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Zachary Amsden @ 2010-02-17 21:58 UTC (permalink / raw)
  To: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Linus Torvalds
  Cc: x86, Avi Kivity, Zachary Amsden

The x86 instruction set provides the ability to add an additional
bit into addition or subtraction by using the carry flag.
It also provides instructions to directly set or clear the
carry flag.  By forcibly setting the carry flag, we can then
represent one particular 64-bit constant, namely

   0xffffffff + 1 = 0x100000000

using only 32-bit values.  In particular we can optimize the rwsem
write lock release by noting it is of exactly this form.

The old instruction sequence:

0000000000000073 <downgrade_write>:
  73:	55                   	push   %rbp
  74:	48 ba 00 00 00 00 01 	mov    $0x100000000,%rdx
  7b:	00 00 00
  7e:	48 89 f8             	mov    %rdi,%rax
  81:	48 89 e5             	mov    %rsp,%rbp
  84:	f0 48 01 10          	lock add %rdx,(%rax)
  88:	79 05                	jns    8f <downgrade_write+0x1c>
  8a:	e8 00 00 00 00       	callq  8f <downgrade_write+0x1c>
  8f:	c9                   	leaveq
  90:	c3                   	retq

The new instruction sequence:

0000000000000073 <downgrade_write>:
  73:	55                   	push   %rbp
  74:	ba ff ff ff ff       	mov    $0xffffffff,%edx
  79:	48 89 f8             	mov    %rdi,%rax
  7c:	48 89 e5             	mov    %rsp,%rbp
  7f:	f9                   	stc
  80:	f0 48 11 10          	lock adc %rdx,(%rax)
  84:	79 05                	jns    8b <downgrade_write+0x18>
  86:	e8 00 00 00 00       	callq  8b <downgrade_write+0x18>
  8b:	c9                   	leaveq
  8c:	c3                   	retq

Thus we can save a huge amount of space, chiefly, the four extra
bytes required for a 64-bit constant and REX prefix over a 32-bit
constant load and forced carry.

Measured performance impact on Xeon cores is nil; 10e7 loops of
either sequence produces no noticable cycle count difference, with
random variation favoring neither.

Update: measured performance impact on AMD Turion core is also nil.

Signed-off-by: Zachary Amsden <zamsden@redhat.com>
---
 arch/x86/include/asm/asm.h   |    1 +
 arch/x86/include/asm/rwsem.h |   23 ++++++++++++++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index b3ed1e1..3744038 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -25,6 +25,7 @@
 #define _ASM_INC	__ASM_SIZE(inc)
 #define _ASM_DEC	__ASM_SIZE(dec)
 #define _ASM_ADD	__ASM_SIZE(add)
+#define _ASM_ADC	__ASM_SIZE(adc)
 #define _ASM_SUB	__ASM_SIZE(sub)
 #define _ASM_XADD	__ASM_SIZE(xadd)
 
diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 606ede1..147adaf 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -233,18 +233,31 @@ static inline void __up_write(struct rw_semaphore *sem)
 static inline void __downgrade_write(struct rw_semaphore *sem)
 {
 	asm volatile("# beginning __downgrade_write\n\t"
+#ifdef CONFIG_X86_64
+#if RWSEM_WAITING_BIAS != -0x100000000
+# error "This code assumes RWSEM_WAITING_BIAS == -2^32"
+#endif
+		     "  stc\n\t"
+		     LOCK_PREFIX _ASM_ADC "%2,(%1)\n\t"
+		     /* transitions 0xZZZZZZZZ00000001 -> 0xYYYYYYYY00000001 */
+		     "  jns       1f\n\t"
+		     "  call call_rwsem_downgrade_wake\n"
+		     "1:\n\t"
+		     "# ending __downgrade_write\n"
+		     : "+m" (sem->count)
+		     : "a" (sem), "r" (-RWSEM_WAITING_BIAS-1)
+		     : "memory", "cc");
+#else
 		     LOCK_PREFIX _ASM_ADD "%2,(%1)\n\t"
-		     /*
-		      * transitions 0xZZZZ0001 -> 0xYYYY0001 (i386)
-		      *     0xZZZZZZZZ00000001 -> 0xYYYYYYYY00000001 (x86_64)
-		      */
+		     /* transitions 0xZZZZ0001 -> 0xYYYY0001 */
 		     "  jns       1f\n\t"
 		     "  call call_rwsem_downgrade_wake\n"
 		     "1:\n\t"
 		     "# ending __downgrade_write\n"
 		     : "+m" (sem->count)
-		     : "a" (sem), "er" (-RWSEM_WAITING_BIAS)
+		     : "a" (sem), "i" (-RWSEM_WAITING_BIAS)
 		     : "memory", "cc");
+#endif
 }
 
 /*
-- 
1.6.6


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-17 21:58 [PATCH] x86 rwsem optimization extreme Zachary Amsden
@ 2010-02-17 22:10 ` Linus Torvalds
  2010-02-17 22:29   ` H. Peter Anvin
  2010-02-17 23:29   ` H. Peter Anvin
  0 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2010-02-17 22:10 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Avi Kivity



On Wed, 17 Feb 2010, Zachary Amsden wrote:
>
> The x86 instruction set provides the ability to add an additional
> bit into addition or subtraction by using the carry flag.
> It also provides instructions to directly set or clear the
> carry flag.  By forcibly setting the carry flag, we can then
> represent one particular 64-bit constant, namely
> 
>    0xffffffff + 1 = 0x100000000
> 
> using only 32-bit values.  In particular we can optimize the rwsem
> write lock release by noting it is of exactly this form.

Don't do this.

Just shift the constants down by two, and suddenly you don't need any 
clever tricks, because all the constants fit in 32 bits anyway, 
regardless of sign issues.

So just change the 

	# define RWSEM_ACTIVE_MASK              0xffffffffL

line into

	# define RWSEM_ACTIVE_MASK              0x3fffffffL

and you're done.

The cost of 'adc' may happen to be identical in this case, but I suspect 
you didn't test on UP, where the 'lock' prefix goes away. An unlocked 
'add' tends to be faster than an unlocked 'adc'.

(It's possible that some micro-architectures don't care, since it's a 
memory op, and they can see that 'C' is set. But it's a fragile assumption 
that it would always be ok).

			Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-17 22:10 ` Linus Torvalds
@ 2010-02-17 22:29   ` H. Peter Anvin
  2010-02-17 23:29   ` H. Peter Anvin
  1 sibling, 0 replies; 12+ messages in thread
From: H. Peter Anvin @ 2010-02-17 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zachary Amsden, linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Avi Kivity

On 02/17/2010 02:10 PM, Linus Torvalds wrote:
> 
> 
> On Wed, 17 Feb 2010, Zachary Amsden wrote:
>>
>> The x86 instruction set provides the ability to add an additional
>> bit into addition or subtraction by using the carry flag.
>> It also provides instructions to directly set or clear the
>> carry flag.  By forcibly setting the carry flag, we can then
>> represent one particular 64-bit constant, namely
>>
>>    0xffffffff + 1 = 0x100000000
>>
>> using only 32-bit values.  In particular we can optimize the rwsem
>> write lock release by noting it is of exactly this form.
> 
> Don't do this.
> 
> Just shift the constants down by two, and suddenly you don't need any 
> clever tricks, because all the constants fit in 32 bits anyway, 
> regardless of sign issues.
> 

Why bother at all?  I thought it mattered when I saw __downgrade_write()
as an inline, but in fact it is only ever used inside the
downgrade_write() out-of-line function, so we're talking about saving
*five bytes* across the whole kernel in the best case.  I vote for
leaving it the way it is, and get the very slight extra readability.
There is no point in moving bits around, either.

	-hpa

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-17 22:10 ` Linus Torvalds
  2010-02-17 22:29   ` H. Peter Anvin
@ 2010-02-17 23:29   ` H. Peter Anvin
  2010-02-18  1:03     ` Zachary Amsden
  2010-02-18  1:53     ` Linus Torvalds
  1 sibling, 2 replies; 12+ messages in thread
From: H. Peter Anvin @ 2010-02-17 23:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zachary Amsden, linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Avi Kivity

On 02/17/2010 02:10 PM, Linus Torvalds wrote:
> 
> The cost of 'adc' may happen to be identical in this case, but I suspect 
> you didn't test on UP, where the 'lock' prefix goes away. An unlocked 
> 'add' tends to be faster than an unlocked 'adc'.
> 
> (It's possible that some micro-architectures don't care, since it's a 
> memory op, and they can see that 'C' is set. But it's a fragile assumption 
> that it would always be ok).
> 

FWIW, I don't know of any microarchitecture where adc is slower than
add, *as long as* the setup time for the CF flag is already used up.
However, as I already commented, I don't think this is worth it.  This
inline appears to only be instantiated once, and as such, it takes a
whopping six bytes across the entire kernel.

	-hpa


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-17 23:29   ` H. Peter Anvin
@ 2010-02-18  1:03     ` Zachary Amsden
  2010-02-18  1:53     ` Linus Torvalds
  1 sibling, 0 replies; 12+ messages in thread
From: Zachary Amsden @ 2010-02-18  1:03 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Avi Kivity

>
> On 02/17/2010 02:10 PM, Linus Torvalds wrote:
>    
>> The cost of 'adc' may happen to be identical in this case, but I suspect
>> you didn't test on UP, where the 'lock' prefix goes away. An unlocked
>> 'add' tends to be faster than an unlocked 'adc'.
>>
>> (It's possible that some micro-architectures don't care, since it's a
>> memory op, and they can see that 'C' is set. But it's a fragile assumption
>> that it would always be ok).
>>
>>      
> FWIW, I don't know of any microarchitecture where adc is slower than
> add, *as long as* the setup time for the CF flag is already used up.
> However, as I already commented, I don't think this is worth it.  This
> inline appears to only be instantiated once, and as such, it takes a
> whopping six bytes across the entire kernel.
>
>    

Without the locks,

stc; adc %rdx, (%rax)

vs.

add %rdx, (%rax)

Shows no statistical difference on Intel.
On AMD, the first form is about twice as expensive.

Course this is all completely useless, but it would be if the locks were 
inline (which is actually an askable question now).  There was just so 
much awesomeness going on with the 64-bit rwsem constructs I felt I had 
to add even more awesomeness to the plate.  For some definition of 
awesomeness.

Zach

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-17 23:29   ` H. Peter Anvin
  2010-02-18  1:03     ` Zachary Amsden
@ 2010-02-18  1:53     ` Linus Torvalds
  2010-02-18  1:59       ` H. Peter Anvin
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2010-02-18  1:53 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Zachary Amsden, linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Avi Kivity



On Wed, 17 Feb 2010, H. Peter Anvin wrote:
> 
> FWIW, I don't know of any microarchitecture where adc is slower than
> add, *as long as* the setup time for the CF flag is already used up.

Oh, I think there are lots.

Look at just about any x86 latency/throughput table, and you'll see:

 - adc latencies are typically much higher than a single cycle

   But you are right that this is likel not an issue on any out-of-order 
   chip, since the 'stc' will schedule perfectly.

 - but adc _throughput_ is also typically much higher, which indicates 
   that even if you do flag renaming, the 'adc' quite likely only 
   schedules in a single ALU unit.

For example, on a Pentium, adc/sbb can only go in the U pipe, and I think 
the same is true of 'stc'. Now, nobody likely cares about Pentiums any 
more, but the point is, 'adc' does often have constraints that a regular 
'add' does not, and there's an example of a 'stc+adc' pair would at the 
very least have to be scheduled with an instruction in between.

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-18  1:53     ` Linus Torvalds
@ 2010-02-18  1:59       ` H. Peter Anvin
  2010-02-18  4:25         ` Zachary Amsden
  0 siblings, 1 reply; 12+ messages in thread
From: H. Peter Anvin @ 2010-02-18  1:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Zachary Amsden, linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Avi Kivity

On 02/17/2010 05:53 PM, Linus Torvalds wrote:
>>
>> FWIW, I don't know of any microarchitecture where adc is slower than
>> add, *as long as* the setup time for the CF flag is already used up.
> 
> Oh, I think there are lots.
> 
> Look at just about any x86 latency/throughput table, and you'll see:
> 
>  - adc latencies are typically much higher than a single cycle
> 
>    But you are right that this is likel not an issue on any out-of-order 
>    chip, since the 'stc' will schedule perfectly.
> 

STC actually tends to schedule poorly, since it has a partial register
stall.  In-order or out-of-order doesn't really matter, though; what
matters is that the scoreboarding used for the flags has to settle, or
you will take a huge hit.

>  - but adc _throughput_ is also typically much higher, which indicates 
>    that even if you do flag renaming, the 'adc' quite likely only 
>    schedules in a single ALU unit.
> 
> For example, on a Pentium, adc/sbb can only go in the U pipe, and I think 
> the same is true of 'stc'. Now, nobody likely cares about Pentiums any 
> more, but the point is, 'adc' does often have constraints that a regular 
> 'add' does not, and there's an example of a 'stc+adc' pair would at the 
> very least have to be scheduled with an instruction in between.

No doubt.  I doubt it much matters in this context, but either way I
think the patch is probably a bad idea... much for the same as my incl
hack was - since the code isn't actually inline, saving a handful bytes
is not the right tradeoff.

	-hpa


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-18  1:59       ` H. Peter Anvin
@ 2010-02-18  4:25         ` Zachary Amsden
  2010-02-18  8:12           ` Andi Kleen
  0 siblings, 1 reply; 12+ messages in thread
From: Zachary Amsden @ 2010-02-18  4:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Avi Kivity

>
> On 02/17/2010 05:53 PM, Linus Torvalds wrote:
>    
>>   - but adc _throughput_ is also typically much higher, which indicates
>>     that even if you do flag renaming, the 'adc' quite likely only
>>     schedules in a single ALU unit.
>>
>> For example, on a Pentium, adc/sbb can only go in the U pipe, and I think
>> the same is true of 'stc'. Now, nobody likely cares about Pentiums any
>> more, but the point is, 'adc' does often have constraints that a regular
>> 'add' does not, and there's an example of a 'stc+adc' pair would at the
>> very least have to be scheduled with an instruction in between.
>>      
> No doubt.  I doubt it much matters in this context, but either way I
> think the patch is probably a bad idea... much for the same as my incl
> hack was - since the code isn't actually inline, saving a handful bytes
> is not the right tradeoff.
>
> 	-hpa
>
>    

Incidentally, the cost of putting all the rwsem code inline, using the 
straightforward approach, for git-tip, using defconfig on x86_64 is 3565 
bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.

That's small enough to actually consider it.

Even smaller if you leave trylock as a function... actually no, that 
didn't work, size increased.  I'm guessing many call sites also end up 
calling the explicit form as a fallback.

If you inline only read_lock functions and write release, nope, that 
didn't work either.

If you inline only read_lock functions, that still isn't it.  Many other 
permutations are possible, but I've wasted enough time.

Although, with a more clever inline implementation, if some of the 
constraints to %rdx go away...

Zach

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-18  4:25         ` Zachary Amsden
@ 2010-02-18  8:12           ` Andi Kleen
  2010-02-18  8:24             ` Zachary Amsden
  0 siblings, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2010-02-18  8:12 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: H. Peter Anvin, Linus Torvalds, linux-kernel, Thomas Gleixner,
	Ingo Molnar, x86, Avi Kivity

Zachary Amsden <zamsden@redhat.com> writes:
>
> Incidentally, the cost of putting all the rwsem code inline, using the
> straightforward approach, for git-tip, using defconfig on x86_64 is
> 3565 bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.

The nice advantage of putting lock code inline is that it gets 
accounted to the caller in all profilers.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-18  8:12           ` Andi Kleen
@ 2010-02-18  8:24             ` Zachary Amsden
  2010-02-18  9:29               ` Andi Kleen
  2010-02-18 10:55               ` Ingo Molnar
  0 siblings, 2 replies; 12+ messages in thread
From: Zachary Amsden @ 2010-02-18  8:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Linus Torvalds, linux-kernel, Thomas Gleixner,
	Ingo Molnar, x86, Avi Kivity

>
> Zachary Amsden<zamsden@redhat.com>  writes
>> Incidentally, the cost of putting all the rwsem code inline, using the
>> straightforward approach, for git-tip, using defconfig on x86_64 is
>> 3565 bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.
>>      
> The nice advantage of putting lock code inline is that it gets
> accounted to the caller in all profilers.
>
> -Andi
>
>    

Unfortunately, only for the uncontended case.  The hot case still ends 
up in a call to the lock text section.

Zach

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-18  8:24             ` Zachary Amsden
@ 2010-02-18  9:29               ` Andi Kleen
  2010-02-18 10:55               ` Ingo Molnar
  1 sibling, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2010-02-18  9:29 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Andi Kleen, H. Peter Anvin, Linus Torvalds, linux-kernel,
	Thomas Gleixner, Ingo Molnar, x86, Avi Kivity

On Wed, Feb 17, 2010 at 10:24:58PM -1000, Zachary Amsden wrote:
>>
>> Zachary Amsden<zamsden@redhat.com>  writes
>>> Incidentally, the cost of putting all the rwsem code inline, using the
>>> straightforward approach, for git-tip, using defconfig on x86_64 is
>>> 3565 bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.
>>>      
>> The nice advantage of putting lock code inline is that it gets
>> accounted to the caller in all profilers.
>>
>> -Andi
>>
>>    
>
> Unfortunately, only for the uncontended case.  The hot case still ends up 
> in a call to the lock text section.

I removed those some time ago because it breaks unwinding.
Did that get undone?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] x86 rwsem optimization extreme
  2010-02-18  8:24             ` Zachary Amsden
  2010-02-18  9:29               ` Andi Kleen
@ 2010-02-18 10:55               ` Ingo Molnar
  1 sibling, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2010-02-18 10:55 UTC (permalink / raw)
  To: Zachary Amsden
  Cc: Andi Kleen, H. Peter Anvin, Linus Torvalds, linux-kernel,
	Thomas Gleixner, Ingo Molnar, x86, Avi Kivity


* Zachary Amsden <zamsden@redhat.com> wrote:

> >
> >Zachary Amsden<zamsden@redhat.com>  writes
> >>Incidentally, the cost of putting all the rwsem code inline, using the
> >>straightforward approach, for git-tip, using defconfig on x86_64 is
> >>3565 bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.
> >The nice advantage of putting lock code inline is that it gets
> >accounted to the caller in all profilers.
> >
> >-Andi
> >
> 
> Unfortunately, only for the uncontended case.  The hot case still ends up 
> in a call to the lock text section.

Nor is it really true that it's 'a problem for profilers' - call graph
recording works just fine, in fact it can be better for a call-graph record
if the locking sites are not sprinkled around the kernel and inlined.

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-02-18 10:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-17 21:58 [PATCH] x86 rwsem optimization extreme Zachary Amsden
2010-02-17 22:10 ` Linus Torvalds
2010-02-17 22:29   ` H. Peter Anvin
2010-02-17 23:29   ` H. Peter Anvin
2010-02-18  1:03     ` Zachary Amsden
2010-02-18  1:53     ` Linus Torvalds
2010-02-18  1:59       ` H. Peter Anvin
2010-02-18  4:25         ` Zachary Amsden
2010-02-18  8:12           ` Andi Kleen
2010-02-18  8:24             ` Zachary Amsden
2010-02-18  9:29               ` Andi Kleen
2010-02-18 10:55               ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.