All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize
@ 2007-07-23 16:05 Satyam Sharma
  2007-07-23 16:05 ` [PATCH 1/8] i386: bitops: Update/correct comments Satyam Sharma
                   ` (8 more replies)
  0 siblings, 9 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: David Howells, Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

Hi,

There was a lot of bogus stuff that include/asm-i386/bitops.h was doing,
that was unnecessary and not required for the correctness of those APIs.
All that superfluous stuff was also unnecessarily disallowing compiler
optimization possibilities, and making gcc generate code that wasn't as
beautiful as it could otherwise have been. Hence the following series
of cleanups (some trivial, some surprising, in no particular order):

* Bogus extended asm constraints (such as specifying immediate-value
  constraint-modifiers to operands constrained to local registers)
* Obsolete / inapplicable-to-x86 / incorrect comments
* Marking "memory" as clobbered for no good reason
* Volatile-casting of memory addresses
  (wholly unnecessary, makes gcc generate bad code)
* Unwarranted use of __asm__ __volatile__ even when those semantics
  are not required
* Unnecessarily harsh definitions of smp_mb__{before, after}_clear_bit()
  (again, this was like *asking* gcc to generate bad code)


These patches fix no bugs, as there were none (or else we'd have known
long before, obviously). This is just an attempt to clean up / bring down
the bogosity level of that file by several microlenats :-)

Almost all the above are also applicable to include/asm-x86_64/bitops.h
so I will send a patch for that also, later. I have zero knowledge of
other arch's so cannot audit them, sorry.


My testbox boots/works fine with all these patches (uptime half an hour)
and the compressed bzImage is smaller by about ~2 KB for my .config --
don't know about savings in modules. But considering these primitives get
inlined from several places in the tree, I'd expect good savings for
allyesconfig or allmodconfig kind of setups. We've also probably lost
a few cycles from kernel codepaths that use these functions, but I haven't
verified that experimentally.


Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [PATCH 1/8] i386: bitops: Update/correct comments
  2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
@ 2007-07-23 16:05 ` Satyam Sharma
  2007-07-23 16:05 ` [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints Satyam Sharma
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: David Howells, Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[1/8] i386: bitops: Update/correct comments

Just trying to standardize the look of comments for various functions of the
bitops API, removed some trailing whitespace here and there, give different
kernel-doc description to the atomic functions and their corresponding
unlocked variants, remove/explicitly mention what is inapplicable/applicable
to i386, add kernel-doc comments to functions that lacked them already, and
other janitorial work. Only comments touched in this patch.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |  123 +++++++++++++++++++++++++++++++-------------
 1 files changed, 86 insertions(+), 37 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index a20fe98..ba8e4bb 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -10,10 +10,20 @@
 
 /*
  * These have to be done with inline assembly: that way the bit-setting
- * is guaranteed to be atomic. All bit operations return 0 if the bit
- * was cleared before the operation and != 0 if it was not.
+ * is guaranteed to be atomic. All bit test operations return an int type:
+ * 0 if the bit was cleared before the operation and != 0 if it was not.
  *
- * bit 0 is the LSB of addr; bit 32 is the LSB of (addr+1).
+ * However, the operand itself is always of unsigned long type, so care
+ * must be taken to ensure that the int type return of bit test operations
+ * for the != 0 case does not get truncate the '1' bits. This is not an
+ * issue on i386 where sizeof(int) == sizeof(unsigned long), but we avoid
+ * this pitfall anyway by returning with all 1's or LSB set in this case.
+ *
+ * bit 0 is the LSB of addr; bit 31 is the MSB.
+ *
+ * But note that all functions that take a bit-number argument allow it
+ * to be arbitrarily large; these operations are not restricted to acting
+ * on single-dword quantities.
  */
 
 #define ADDR (*(volatile long *) addr)
@@ -23,15 +33,10 @@
  * @nr: the bit to set
  * @addr: the address to start counting from
  *
- * This function is atomic and may not be reordered.  See __set_bit()
- * if you do not require the atomic guarantees.
+ * set_bit() is atomic and cannot be reordered. On the x86, this includes
+ * an implicit memory barrier.
  *
- * Note: there are no guarantees that this function will not be reordered
- * on non x86 architectures, so if you are writing portable code,
- * make sure not to rely on its reordering guarantees.
- *
- * Note that @nr may be almost arbitrarily large; this function is not
- * restricted to acting on a single-word quantity.
+ * See __set_bit() if you do not require the atomic guarantees.
  */
 static inline void set_bit(int nr, volatile unsigned long * addr)
 {
@@ -49,6 +54,9 @@ static inline void set_bit(int nr, volatile unsigned long * addr)
  * Unlike set_bit(), this function is non-atomic and may be reordered.
  * If it's called on the same region of memory simultaneously, the effect
  * may be that only one operation succeeds.
+ *
+ * This is a fast and unlocked operation, and only suitable for callers
+ * that already implement higher-level locking to protect access.
  */
 static inline void __set_bit(int nr, volatile unsigned long * addr)
 {
@@ -59,14 +67,14 @@ static inline void __set_bit(int nr, volatile unsigned long * addr)
 }
 
 /**
- * clear_bit - Clears a bit in memory
+ * clear_bit - Atomically clear a bit in memory
  * @nr: Bit to clear
  * @addr: Address to start counting from
  *
- * clear_bit() is atomic and may not be reordered.  However, it does
- * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
- * in order to ensure changes are visible on other processors.
+ * clear_bit() is atomic and cannot be reordered. On the x86, this includes
+ * an implicit memory barrier.
+ *
+ * See __clear_bit() if you do not require the atomic guarantees.
  */
 static inline void clear_bit(int nr, volatile unsigned long * addr)
 {
@@ -76,6 +84,18 @@ static inline void clear_bit(int nr, volatile unsigned long * addr)
 		:"Ir" (nr));
 }
 
+/**
+ * __clear_bit - Clear a bit in memory
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * Unlike clear_bit(), this function is non-atomic and may be reordered.
+ * It it's called on the same region of memory simultaneously, the effect
+ * may be that only one operation succeeds.
+ *
+ * This is a fast and unlocked operation, and only suitable for callers
+ * that already implement higher-level locking to protect access.
+ */
 static inline void __clear_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__(
@@ -83,6 +103,11 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
 		:"+m" (ADDR)
 		:"Ir" (nr));
 }
+
+/*
+ * Bit operations are already serializing on x86.
+ * These must still be defined here for API completeness.
+ */
 #define smp_mb__before_clear_bit()	barrier()
 #define smp_mb__after_clear_bit()	barrier()
 
@@ -94,6 +119,9 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
  * Unlike change_bit(), this function is non-atomic and may be reordered.
  * If it's called on the same region of memory simultaneously, the effect
  * may be that only one operation succeeds.
+ *
+ * This is a fast and unlocked operation, and only suitable for callers
+ * that already implement higher-level locking to protect access.
  */
 static inline void __change_bit(int nr, volatile unsigned long * addr)
 {
@@ -104,14 +132,14 @@ static inline void __change_bit(int nr, volatile unsigned long * addr)
 }
 
 /**
- * change_bit - Toggle a bit in memory
+ * change_bit - Atomically toggle a bit in memory
  * @nr: Bit to change
  * @addr: Address to start counting from
  *
- * change_bit() is atomic and may not be reordered. It may be
- * reordered on other architectures than x86.
- * Note that @nr may be almost arbitrarily large; this function is not
- * restricted to acting on a single-word quantity.
+ * change_bit() is atomic and cannot be reordered. On the x86, this includes
+ * an implicit memory barrier.
+ *
+ * See __change_bit() if you do not require the atomic guarantees.
  */
 static inline void change_bit(int nr, volatile unsigned long * addr)
 {
@@ -122,13 +150,14 @@ static inline void change_bit(int nr, volatile unsigned long * addr)
 }
 
 /**
- * test_and_set_bit - Set a bit and return its old value
+ * test_and_set_bit - Atomically set a bit and return its old value
  * @nr: Bit to set
  * @addr: Address to count from
  *
- * This operation is atomic and cannot be reordered.  
- * It may be reordered on other architectures than x86.
- * It also implies a memory barrier.
+ * test_and_set_bit() is atomic and cannot be reordered. On the x86, this
+ * includes an implicit memory barrier.
+ *
+ * See __test_and_set_bit() if you do not require the atomic guarantees.
  */
 static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
 {
@@ -146,9 +175,12 @@ static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
  * @nr: Bit to set
  * @addr: Address to count from
  *
- * This operation is non-atomic and can be reordered.  
+ * Unlike test_and_set_bit(), this function is non-atomic and may be reordered.
  * If two examples of this operation race, one can appear to succeed
- * but actually fail.  You must protect multiple accesses with a lock.
+ * but actually fail.
+ *
+ * This is a fast and unlocked operation, and only suitable for callers
+ * that already implement higher-level locking to protect access.
  */
 static inline int __test_and_set_bit(int nr, volatile unsigned long * addr)
 {
@@ -162,13 +194,14 @@ static inline int __test_and_set_bit(int nr, volatile unsigned long * addr)
 }
 
 /**
- * test_and_clear_bit - Clear a bit and return its old value
+ * test_and_clear_bit - Atomically clear a bit and return its old value
  * @nr: Bit to clear
  * @addr: Address to count from
  *
- * This operation is atomic and cannot be reordered.
- * It can be reorderdered on other architectures other than x86.
- * It also implies a memory barrier.
+ * test_and_clear_bit() is atomic and cannot be reordered. On the x86, this
+ * includes an implicit memory barrier.
+ *
+ * See __test_and_clear_bit() if you do not require the atomic guarantees.
  */
 static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)
 {
@@ -186,9 +219,12 @@ static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)
  * @nr: Bit to clear
  * @addr: Address to count from
  *
- * This operation is non-atomic and can be reordered.  
+ * Unlike test_and_clear_bit(), this function is non-atomic and may be reordered.
  * If two examples of this operation race, one can appear to succeed
- * but actually fail.  You must protect multiple accesses with a lock.
+ * but actually fail.
+ *
+ * This is a fast and unlocked operation, and only suitable for callers
+ * that already implement higher-level locking to protect access.
  */
 static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
 {
@@ -201,7 +237,18 @@ static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
 	return oldbit;
 }
 
-/* WARNING: non atomic and it can be reordered! */
+/**
+ * __test_and_change_bit - Change a bit and return its old value
+ * @nr: Bit to clear
+ * @addr: Address to count from
+ *
+ * Unlike test_and_change_bit(), this function is non-atomic and may be reordered.
+ * If two examples of this operation race, one can appear to succeed
+ * but actually fail.
+ *
+ * This is a fast and unlocked operation, and only suitable for callers
+ * that already implement higher-level locking to protect access.
+ */
 static inline int __test_and_change_bit(int nr, volatile unsigned long *addr)
 {
 	int oldbit;
@@ -214,12 +261,14 @@ static inline int __test_and_change_bit(int nr, volatile unsigned long *addr)
 }
 
 /**
- * test_and_change_bit - Change a bit and return its old value
+ * test_and_change_bit - Atomically change a bit and return its old value
  * @nr: Bit to change
  * @addr: Address to count from
  *
- * This operation is atomic and cannot be reordered.  
- * It also implies a memory barrier.
+ * test_and_change_bit() is atomic and cannot be reordered. On the x86, this
+ * includes an implicit memory barrier.
+ *
+ * See __test_and_change_bit() if you do not require the atomic guarantees.
  */
 static inline int test_and_change_bit(int nr, volatile unsigned long* addr)
 {

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
  2007-07-23 16:05 ` [PATCH 1/8] i386: bitops: Update/correct comments Satyam Sharma
@ 2007-07-23 16:05 ` Satyam Sharma
  2007-07-23 16:10   ` Andi Kleen
  2007-07-23 17:57   ` Linus Torvalds
  2007-07-23 16:05 ` [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints Satyam Sharma
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: David Howells, Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[2/8] i386: bitops: Rectify bogus "Ir" constraints

The "I" constraint (on the i386 platform) is used to restrict constants to
the 0..31 range, for use with instructions that must deal with bit numbers.

However:
* The "I" constraint modifier is applicable only to immediate-value operands,
  and combining it with "r" is bogus.
* gcc will just ignore "I" if we specify "Ir" anyway and treat it same as "r".
* Bitops in Linux allow @nr to be arbitrarily large anyway, so we don't want
  to restrict ourselves to @nr < 32 in the first place.

So just kill the bogus "I" constraint modifier.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |   26 +++++++++++++-------------
 1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index ba8e4bb..17a302d 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -43,7 +43,7 @@ static inline void set_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -63,7 +63,7 @@ static inline void __set_bit(int nr, volatile unsigned long * addr)
 	__asm__(
 		"btsl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -81,7 +81,7 @@ static inline void clear_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -101,7 +101,7 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__(
 		"btrl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /*
@@ -128,7 +128,7 @@ static inline void __change_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__(
 		"btcl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -146,7 +146,7 @@ static inline void change_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btcl %1,%0"
 		:"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 }
 
 /**
@@ -166,7 +166,7 @@ static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"+m" (ADDR)
-		:"Ir" (nr) : "memory");
+		:"r" (nr) : "memory");
 	return oldbit;
 }
 
@@ -189,7 +189,7 @@ static inline int __test_and_set_bit(int nr, volatile unsigned long * addr)
 	__asm__(
 		"btsl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 	return oldbit;
 }
 
@@ -210,7 +210,7 @@ static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"+m" (ADDR)
-		:"Ir" (nr) : "memory");
+		:"r" (nr) : "memory");
 	return oldbit;
 }
 
@@ -233,7 +233,7 @@ static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
 	__asm__(
 		"btrl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"+m" (ADDR)
-		:"Ir" (nr));
+		:"r" (nr));
 	return oldbit;
 }
 
@@ -256,7 +256,7 @@ static inline int __test_and_change_bit(int nr, volatile unsigned long *addr)
 	__asm__ __volatile__(
 		"btcl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"+m" (ADDR)
-		:"Ir" (nr) : "memory");
+		:"r" (nr) : "memory");
 	return oldbit;
 }
 
@@ -277,7 +277,7 @@ static inline int test_and_change_bit(int nr, volatile unsigned long* addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btcl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"+m" (ADDR)
-		:"Ir" (nr) : "memory");
+		:"r" (nr) : "memory");
 	return oldbit;
 }
 
@@ -302,7 +302,7 @@ static inline int variable_test_bit(int nr, const volatile unsigned long * addr)
 	__asm__ __volatile__(
 		"btl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit)
-		:"m" (ADDR),"Ir" (nr));
+		:"m" (ADDR),"r" (nr));
 	return oldbit;
 }
 

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints
  2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
  2007-07-23 16:05 ` [PATCH 1/8] i386: bitops: Update/correct comments Satyam Sharma
  2007-07-23 16:05 ` [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints Satyam Sharma
@ 2007-07-23 16:05 ` Satyam Sharma
  2007-07-23 16:37   ` Andi Kleen
                     ` (2 more replies)
  2007-07-23 16:05 ` [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses Satyam Sharma
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: David Howells, Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[3/8] i386: bitops: Rectify bogus "+m" constraints

>From the gcc manual:

  Extended asm supports input-output or read-write operands. Use the
  constraint character `+' to indicate such an operand and list it with
  the output operands. You should only use read-write operands when the
  constraints for the operand (or the operand in which only some of the
  bits are to be changed) allow a register.

So, using the "+" constraint modifier for memory, like "+m" is bogus.
We must simply specify "=m" which handles the case correctly.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |   24 ++++++++++++------------
 1 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 17a302d..d623fd9 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -42,7 +42,7 @@ static inline void set_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -62,7 +62,7 @@ static inline void __set_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__(
 		"btsl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -80,7 +80,7 @@ static inline void clear_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -100,7 +100,7 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__(
 		"btrl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -127,7 +127,7 @@ static inline void __change_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__(
 		"btcl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -145,7 +145,7 @@ static inline void change_bit(int nr, volatile unsigned long * addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btcl %1,%0"
-		:"+m" (ADDR)
+		:"=m" (ADDR)
 		:"r" (nr));
 }
 
@@ -165,7 +165,7 @@ static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
 
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"+m" (ADDR)
+		:"=r" (oldbit),"=m" (ADDR)
 		:"r" (nr) : "memory");
 	return oldbit;
 }
@@ -188,7 +188,7 @@ static inline int __test_and_set_bit(int nr, volatile unsigned long * addr)
 
 	__asm__(
 		"btsl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"+m" (ADDR)
+		:"=r" (oldbit),"=m" (ADDR)
 		:"r" (nr));
 	return oldbit;
 }
@@ -209,7 +209,7 @@ static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)
 
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"+m" (ADDR)
+		:"=r" (oldbit),"=m" (ADDR)
 		:"r" (nr) : "memory");
 	return oldbit;
 }
@@ -232,7 +232,7 @@ static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
 
 	__asm__(
 		"btrl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"+m" (ADDR)
+		:"=r" (oldbit),"=m" (ADDR)
 		:"r" (nr));
 	return oldbit;
 }
@@ -255,7 +255,7 @@ static inline int __test_and_change_bit(int nr, volatile unsigned long *addr)
 
 	__asm__ __volatile__(
 		"btcl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"+m" (ADDR)
+		:"=r" (oldbit),"=m" (ADDR)
 		:"r" (nr) : "memory");
 	return oldbit;
 }
@@ -276,7 +276,7 @@ static inline int test_and_change_bit(int nr, volatile unsigned long* addr)
 
 	__asm__ __volatile__( LOCK_PREFIX
 		"btcl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"+m" (ADDR)
+		:"=r" (oldbit),"=m" (ADDR)
 		:"r" (nr) : "memory");
 	return oldbit;
 }

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses
  2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
                   ` (2 preceding siblings ...)
  2007-07-23 16:05 ` [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints Satyam Sharma
@ 2007-07-23 16:05 ` Satyam Sharma
  2007-07-23 17:52   ` Linus Torvalds
  2007-07-23 16:05 ` [PATCH 5/8] i386: bitops: Contain warnings fallout from the death of volatiles Satyam Sharma
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: David Howells, Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[4/8] i386: bitops: Kill volatile-casting of memory addresses

All the occurrences of "volatile" that are used to qualify access to the
passed bit-string pointer/address must be removed, because "volatile" is
crazy, doesn't really work anyway, has nothing to do with locking and
atomicity, and is in general wholly unnecessary for these operations:

* In the case of the unlocked variants, we're not providing any guarantees
  whatsoever, so we want the caller to do any necessary locking surrounding
  the use of that function anyway.

* In case of the primitives where we *do* make locking/atomicity/reordering
  (three very different, but related, concepts) guarantees, we're doing that
  properly by using the lock instruction-prefix and "__asm__ __volatile__"
  anyway (and with "addr" always being constrained with "m" in the assembly,
  gcc doesn't bring it into a local register either).

So let's just kill the pointless use of volatile.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |   60 +++++++++++++++++++++------------------------
 1 files changed, 28 insertions(+), 32 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index d623fd9..0f5634b 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -26,8 +26,6 @@
  * on single-dword quantities.
  */
 
-#define ADDR (*(volatile long *) addr)
-
 /**
  * set_bit - Atomically set a bit in memory
  * @nr: the bit to set
@@ -38,11 +36,11 @@
  *
  * See __set_bit() if you do not require the atomic guarantees.
  */
-static inline void set_bit(int nr, volatile unsigned long * addr)
+static inline void set_bit(int nr, unsigned long *addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %1,%0"
-		:"=m" (ADDR)
+		:"=m" (*addr)
 		:"r" (nr));
 }
 
@@ -58,11 +56,11 @@ static inline void set_bit(int nr, volatile unsigned long * addr)
  * This is a fast and unlocked operation, and only suitable for callers
  * that already implement higher-level locking to protect access.
  */
-static inline void __set_bit(int nr, volatile unsigned long * addr)
+static inline void __set_bit(int nr, unsigned long *addr)
 {
 	__asm__(
 		"btsl %1,%0"
-		:"=m" (ADDR)
+		:"=m" (*addr)
 		:"r" (nr));
 }
 
@@ -76,11 +74,11 @@ static inline void __set_bit(int nr, volatile unsigned long * addr)
  *
  * See __clear_bit() if you do not require the atomic guarantees.
  */
-static inline void clear_bit(int nr, volatile unsigned long * addr)
+static inline void clear_bit(int nr, unsigned long *addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %1,%0"
-		:"=m" (ADDR)
+		:"=m" (*addr)
 		:"r" (nr));
 }
 
@@ -96,11 +94,11 @@ static inline void clear_bit(int nr, volatile unsigned long * addr)
  * This is a fast and unlocked operation, and only suitable for callers
  * that already implement higher-level locking to protect access.
  */
-static inline void __clear_bit(int nr, volatile unsigned long * addr)
+static inline void __clear_bit(int nr, unsigned long *addr)
 {
 	__asm__ __volatile__(
 		"btrl %1,%0"
-		:"=m" (ADDR)
+		:"=m" (*addr)
 		:"r" (nr));
 }
 
@@ -123,11 +121,11 @@ static inline void __clear_bit(int nr, volatile unsigned long * addr)
  * This is a fast and unlocked operation, and only suitable for callers
  * that already implement higher-level locking to protect access.
  */
-static inline void __change_bit(int nr, volatile unsigned long * addr)
+static inline void __change_bit(int nr, unsigned long *addr)
 {
 	__asm__ __volatile__(
 		"btcl %1,%0"
-		:"=m" (ADDR)
+		:"=m" (*addr)
 		:"r" (nr));
 }
 
@@ -141,11 +139,11 @@ static inline void __change_bit(int nr, volatile unsigned long * addr)
  *
  * See __change_bit() if you do not require the atomic guarantees.
  */
-static inline void change_bit(int nr, volatile unsigned long * addr)
+static inline void change_bit(int nr, unsigned long *addr)
 {
 	__asm__ __volatile__( LOCK_PREFIX
 		"btcl %1,%0"
-		:"=m" (ADDR)
+		:"=m" (*addr)
 		:"r" (nr));
 }
 
@@ -159,13 +157,13 @@ static inline void change_bit(int nr, volatile unsigned long * addr)
  *
  * See __test_and_set_bit() if you do not require the atomic guarantees.
  */
-static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
+static inline int test_and_set_bit(int nr, unsigned long *addr)
 {
 	int oldbit;
 
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"=m" (ADDR)
+		:"=r" (oldbit),"=m" (*addr)
 		:"r" (nr) : "memory");
 	return oldbit;
 }
@@ -182,13 +180,13 @@ static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
  * This is a fast and unlocked operation, and only suitable for callers
  * that already implement higher-level locking to protect access.
  */
-static inline int __test_and_set_bit(int nr, volatile unsigned long * addr)
+static inline int __test_and_set_bit(int nr, unsigned long *addr)
 {
 	int oldbit;
 
 	__asm__(
 		"btsl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"=m" (ADDR)
+		:"=r" (oldbit),"=m" (*addr)
 		:"r" (nr));
 	return oldbit;
 }
@@ -203,13 +201,13 @@ static inline int __test_and_set_bit(int nr, volatile unsigned long * addr)
  *
  * See __test_and_clear_bit() if you do not require the atomic guarantees.
  */
-static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)
+static inline int test_and_clear_bit(int nr, unsigned long *addr)
 {
 	int oldbit;
 
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"=m" (ADDR)
+		:"=r" (oldbit),"=m" (*addr)
 		:"r" (nr) : "memory");
 	return oldbit;
 }
@@ -226,13 +224,13 @@ static inline int test_and_clear_bit(int nr, volatile unsigned long * addr)
  * This is a fast and unlocked operation, and only suitable for callers
  * that already implement higher-level locking to protect access.
  */
-static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
+static inline int __test_and_clear_bit(int nr, unsigned long *addr)
 {
 	int oldbit;
 
 	__asm__(
 		"btrl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"=m" (ADDR)
+		:"=r" (oldbit),"=m" (*addr)
 		:"r" (nr));
 	return oldbit;
 }
@@ -249,13 +247,13 @@ static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
  * This is a fast and unlocked operation, and only suitable for callers
  * that already implement higher-level locking to protect access.
  */
-static inline int __test_and_change_bit(int nr, volatile unsigned long *addr)
+static inline int __test_and_change_bit(int nr, unsigned long *addr)
 {
 	int oldbit;
 
 	__asm__ __volatile__(
 		"btcl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"=m" (ADDR)
+		:"=r" (oldbit),"=m" (*addr)
 		:"r" (nr) : "memory");
 	return oldbit;
 }
@@ -270,13 +268,13 @@ static inline int __test_and_change_bit(int nr, volatile unsigned long *addr)
  *
  * See __test_and_change_bit() if you do not require the atomic guarantees.
  */
-static inline int test_and_change_bit(int nr, volatile unsigned long* addr)
+static inline int test_and_change_bit(int nr, unsigned long *addr)
 {
 	int oldbit;
 
 	__asm__ __volatile__( LOCK_PREFIX
 		"btcl %2,%1\n\tsbbl %0,%0"
-		:"=r" (oldbit),"=m" (ADDR)
+		:"=r" (oldbit),"=m" (*addr)
 		:"r" (nr) : "memory");
 	return oldbit;
 }
@@ -287,22 +285,22 @@ static inline int test_and_change_bit(int nr, volatile unsigned long* addr)
  * @nr: bit number to test
  * @addr: Address to start counting from
  */
-static int test_bit(int nr, const volatile void * addr);
+static int test_bit(int nr, const unsigned long *addr);
 #endif
 
-static __always_inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static __always_inline int constant_test_bit(int nr, const unsigned long *addr)
 {
 	return ((1UL << (nr & 31)) & (addr[nr >> 5])) != 0;
 }
 
-static inline int variable_test_bit(int nr, const volatile unsigned long * addr)
+static inline int variable_test_bit(int nr, const unsigned long *addr)
 {
 	int oldbit;
 
 	__asm__ __volatile__(
 		"btl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit)
-		:"m" (ADDR),"r" (nr));
+		:"m" (*addr),"r" (nr));
 	return oldbit;
 }
 
@@ -311,8 +309,6 @@ static inline int variable_test_bit(int nr, const volatile unsigned long * addr)
  constant_test_bit((nr),(addr)) : \
  variable_test_bit((nr),(addr)))
 
-#undef ADDR
-
 /**
  * find_first_zero_bit - find the first zero bit in a memory region
  * @addr: The address to start the search at

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 5/8] i386: bitops: Contain warnings fallout from the death of volatiles
  2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
                   ` (3 preceding siblings ...)
  2007-07-23 16:05 ` [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses Satyam Sharma
@ 2007-07-23 16:05 ` Satyam Sharma
  2007-07-23 16:05 ` [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily Satyam Sharma
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: David Howells, Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[5/8] i386: bitops: Contain warnings fallout from the death of volatiles

The wrappers below included from all over tree re-used "volatile" just
because the bitops used them. With them killed, almost every file ends
up crying about:

warning: passing argument 2 of 'set_bit' discards qualifiers from
pointer target type

Silence these bogus warnings by killing volatile casts a second time.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

[ Cscope tells us there are other such wrappers that didn't show up
  in my test .config -- I'll update them shortly.

  Also, I should probably merge this patch with the previous one.
  Otherwise git-bisecters who hit the window between these two
  patches will be flooded with bogus warnings and would probably
  want to kill me :-) ]

 include/linux/cpumask.h  |    4 ++--
 include/linux/nodemask.h |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 23f5514..49f6ed4 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -89,13 +89,13 @@ typedef struct { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
 extern cpumask_t _unused_cpumask_arg_;
 
 #define cpu_set(cpu, dst) __cpu_set((cpu), &(dst))
-static inline void __cpu_set(int cpu, volatile cpumask_t *dstp)
+static inline void __cpu_set(int cpu, cpumask_t *dstp)
 {
 	set_bit(cpu, dstp->bits);
 }
 
 #define cpu_clear(cpu, dst) __cpu_clear((cpu), &(dst))
-static inline void __cpu_clear(int cpu, volatile cpumask_t *dstp)
+static inline void __cpu_clear(int cpu, cpumask_t *dstp)
 {
 	clear_bit(cpu, dstp->bits);
 }
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 52c54a5..81ba056 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -89,13 +89,13 @@ typedef struct { DECLARE_BITMAP(bits, MAX_NUMNODES); } nodemask_t;
 extern nodemask_t _unused_nodemask_arg_;
 
 #define node_set(node, dst) __node_set((node), &(dst))
-static inline void __node_set(int node, volatile nodemask_t *dstp)
+static inline void __node_set(int node, nodemask_t *dstp)
 {
 	set_bit(node, dstp->bits);
 }
 
 #define node_clear(node, dst) __node_clear((node), &(dst))
-static inline void __node_clear(int node, volatile nodemask_t *dstp)
+static inline void __node_clear(int node, nodemask_t *dstp)
 {
 	clear_bit(node, dstp->bits);
 }

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
                   ` (4 preceding siblings ...)
  2007-07-23 16:05 ` [PATCH 5/8] i386: bitops: Contain warnings fallout from the death of volatiles Satyam Sharma
@ 2007-07-23 16:05 ` Satyam Sharma
  2007-07-23 16:13   ` Andi Kleen
                     ` (2 more replies)
  2007-07-23 16:06 ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ Satyam Sharma
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: David Howells, Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[6/8] i386: bitops: Don't mark memory as clobbered unnecessarily

The goal is to let gcc generate good, beautiful, optimized code.

But test_and_set_bit, test_and_clear_bit, __test_and_change_bit,
and test_and_change_bit unnecessarily mark all of memory as clobbered,
thereby preventing gcc from doing perfectly valid optimizations.

The case of __test_and_change_bit() is particularly surprising, given
that it's a variant where we don't make any guarantees at all.

Even for the other three cases, the (only) instruction that accesses
shared memory is btsl/btrl/btcl and requires locking and atomicity.
But we handle that properly already by the use of the lock prefix.
Also, "=m" is already specified in the output operands of the asm
for the passed bit-string pointer, so the gcc optimizer knows what
to do and what not to do anyway.

So there's no point clobbering all of memory in these functions.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 0f5634b..f37b8a2 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -164,7 +164,7 @@ static inline int test_and_set_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btsl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (*addr)
-		:"r" (nr) : "memory");
+		:"r" (nr));
 	return oldbit;
 }
 
@@ -208,7 +208,7 @@ static inline int test_and_clear_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btrl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (*addr)
-		:"r" (nr) : "memory");
+		:"r" (nr));
 	return oldbit;
 }
 
@@ -254,7 +254,7 @@ static inline int __test_and_change_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__(
 		"btcl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (*addr)
-		:"r" (nr) : "memory");
+		:"r" (nr));
 	return oldbit;
 }
 
@@ -275,7 +275,7 @@ static inline int test_and_change_bit(int nr, unsigned long *addr)
 	__asm__ __volatile__( LOCK_PREFIX
 		"btcl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (*addr)
-		:"r" (nr) : "memory");
+		:"r" (nr));
 	return oldbit;
 }
 

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
                   ` (5 preceding siblings ...)
  2007-07-23 16:05 ` [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily Satyam Sharma
@ 2007-07-23 16:06 ` Satyam Sharma
  2007-07-23 16:18   ` Andi Kleen
  2007-07-23 16:23   ` Jeremy Fitzhardinge
  2007-07-23 16:06 ` [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions Satyam Sharma
  2007-07-30 17:57 ` [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Denis Vlasenko
  8 siblings, 2 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:06 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: David Howells, Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[7/8] i386: bitops: Kill needless usage of __asm__ __volatile__

Another oddity I noticed in this file. The semantics of __volatile__
when used to qualify inline __asm__ are that the compiler will not
(1) elid, or, (2) reorder, or, (3) intersperse, our inline asm with
the rest of the generated code.

However, we do not want these guarantees in the unlocked variants of the
bitops functions. Also, note that test_bit() is *always* an unlocked
operation on i386 -- the i386 instruction set disallows the use of the LOCK
prefix with the "btl" instruction anyway, and the CPU will throw an invalid
opcode exception otherwise. Hence, every caller of variable_test_bit() must
have all the required locking implemented at a higher-level anyway and this
operation would necessarily be guarantee-less.

So let's remove the __volatile__ qualification of the inline asm in the
variable_test_bit() function also -- and let's give the compiler a chance
to optimize / combine multiple bitops, if possible.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

 include/asm-i386/bitops.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index f37b8a2..4f1fda5 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -96,7 +96,7 @@ static inline void clear_bit(int nr, unsigned long *addr)
  */
 static inline void __clear_bit(int nr, unsigned long *addr)
 {
-	__asm__ __volatile__(
+	__asm__(
 		"btrl %1,%0"
 		:"=m" (*addr)
 		:"r" (nr));
@@ -123,7 +123,7 @@ static inline void __clear_bit(int nr, unsigned long *addr)
  */
 static inline void __change_bit(int nr, unsigned long *addr)
 {
-	__asm__ __volatile__(
+	__asm__(
 		"btcl %1,%0"
 		:"=m" (*addr)
 		:"r" (nr));
@@ -251,7 +251,7 @@ static inline int __test_and_change_bit(int nr, unsigned long *addr)
 {
 	int oldbit;
 
-	__asm__ __volatile__(
+	__asm__(
 		"btcl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit),"=m" (*addr)
 		:"r" (nr));
@@ -297,7 +297,7 @@ static inline int variable_test_bit(int nr, const unsigned long *addr)
 {
 	int oldbit;
 
-	__asm__ __volatile__(
+	__asm__(
 		"btl %2,%1\n\tsbbl %0,%0"
 		:"=r" (oldbit)
 		:"m" (*addr),"r" (nr));

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
                   ` (6 preceding siblings ...)
  2007-07-23 16:06 ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ Satyam Sharma
@ 2007-07-23 16:06 ` Satyam Sharma
  2007-07-24  3:53   ` Nick Piggin
  2007-07-30 17:57 ` [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Denis Vlasenko
  8 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:06 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: David Howells, Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

From: Satyam Sharma <ssatyam@cse.iitk.ac.in>

[8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions

>From Documentation/atomic_ops.txt, those archs that require explicit
memory barriers around clear_bit() must also implement these two interfaces.
However, for i386, clear_bit() is a strict, locked, atomic and
un-reorderable operation and includes an implicit memory barrier already.

But these two functions have been wrongly defined as "barrier()" which is
a pointless _compiler optimization_ barrier, and only serves to make gcc
not do legitimate optimizations that it could have otherwise done.

So let's make these proper no-ops, because that's exactly what we require
these to be on the i386 platform.

Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Cc: David Howells <dhowells@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>

---

[ A similar optimization needs to be done in the atomic.h also.
  Will submit that patch shortly. ]

 include/asm-i386/bitops.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
index 4f1fda5..42999eb 100644
--- a/include/asm-i386/bitops.h
+++ b/include/asm-i386/bitops.h
@@ -106,8 +106,8 @@ static inline void __clear_bit(int nr, unsigned long *addr)
  * Bit operations are already serializing on x86.
  * These must still be defined here for API completeness.
  */
-#define smp_mb__before_clear_bit()	barrier()
-#define smp_mb__after_clear_bit()	barrier()
+#define smp_mb__before_clear_bit()	do {} while (0)
+#define smp_mb__after_clear_bit()	do {} while (0)
 
 /**
  * __change_bit - Toggle a bit in memory

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 16:05 ` [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints Satyam Sharma
@ 2007-07-23 16:10   ` Andi Kleen
  2007-07-23 16:21     ` Satyam Sharma
  2007-07-23 17:57   ` Linus Torvalds
  1 sibling, 1 reply; 92+ messages in thread
From: Andi Kleen @ 2007-07-23 16:10 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds

On Monday 23 July 2007 18:05:38 Satyam Sharma wrote:
> From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> 
> [2/8] i386: bitops: Rectify bogus "Ir" constraints
> 
> The "I" constraint (on the i386 platform) is used to restrict constants to
> the 0..31 range, for use with instructions that must deal with bit numbers.

It means I or r, not I modified by r. This means either a immediate constant
0..31 or a register, which is correct.

% cat t18.c 

f()
{
        asm("xxx %0" :: "rI" (10));
        asm("yyy %0" :: "rI" (100));
}
% gcc -O2 -S t18.c
% cat t18.s
...
f:
.LFB2:
#APP
        xxx $10
#NO_APP
        movl    $100, %eax
#APP
        yyy %eax
#NO_APP
        ret
.LFE2:
...

-Andi


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-23 16:05 ` [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily Satyam Sharma
@ 2007-07-23 16:13   ` Andi Kleen
  2007-07-23 16:26     ` Satyam Sharma
  2007-07-23 17:55   ` Linus Torvalds
  2007-07-24  3:57   ` Nick Piggin
  2 siblings, 1 reply; 92+ messages in thread
From: Andi Kleen @ 2007-07-23 16:13 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds

On Monday 23 July 2007 18:05:58 Satyam Sharma wrote:
> From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> 
> [6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
> 
> The goal is to let gcc generate good, beautiful, optimized code.


The first goal is correct code.

The reason for the memory barrier is to prevent other memory references
from being moved over the atomic reference.

e.g. when a bit is used to communicate with another CPU this might be dangerous.

I don't think it's a good idea to remove them. It'll likely introduce subtle
bugs. Atomic bitops tend to be so slow anyways that it doesn't make much
difference.

-Andi


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 16:06 ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ Satyam Sharma
@ 2007-07-23 16:18   ` Andi Kleen
  2007-07-23 16:22     ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ II Andi Kleen
  2007-07-23 16:32     ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ Satyam Sharma
  2007-07-23 16:23   ` Jeremy Fitzhardinge
  1 sibling, 2 replies; 92+ messages in thread
From: Andi Kleen @ 2007-07-23 16:18 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds

On Monday 23 July 2007 18:06:03 Satyam Sharma wrote:
> From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> 
> [7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
> 
> Another oddity I noticed in this file. The semantics of __volatile__
> when used to qualify inline __asm__ are that the compiler will not
> (1) elid, or, (2) reorder, or, (3) intersperse, our inline asm with
> the rest of the generated code.
> 
> However, we do not want these guarantees in the unlocked variants of the
> bitops functions. 

I thought so too and did a similar transformation while moving
some string functions out of line. After that recent misadventure
I would be very very careful with this.

Overall I'm sorry to say, but the risk:gain ratio of this
patch is imho totally out of whack.

It took a long time to get bitops.h correct and as far as we know
it is compiled correctly currently. You risk all at for very dubious
improvements.

-Andi




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 16:10   ` Andi Kleen
@ 2007-07-23 16:21     ` Satyam Sharma
  2007-07-23 16:30       ` Andi Kleen
  0 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds

Hi Andi,


On Mon, 23 Jul 2007, Andi Kleen wrote:

> On Monday 23 July 2007 18:05:38 Satyam Sharma wrote:
> > From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> > 
> > [2/8] i386: bitops: Rectify bogus "Ir" constraints
> > 
> > The "I" constraint (on the i386 platform) is used to restrict constants to
> > the 0..31 range, for use with instructions that must deal with bit numbers.
> 
> It means I or r, not I modified by r. This means either a immediate constant
> 0..31 or a register, which is correct.
> 
> % cat t18.c 
> 
> f()
> {
>         asm("xxx %0" :: "rI" (10));
>         asm("yyy %0" :: "rI" (100));
> }
> % gcc -O2 -S t18.c
> % cat t18.s
> ...
> f:
> .LFB2:
> #APP
>         xxx $10
> #NO_APP
>         movl    $100, %eax
> #APP
>         yyy %eax
> #NO_APP
>         ret
> .LFE2:
> ...


Whoa, thanks for explaining that to me -- I didn't know, obviously. I had
just written a test program that used "Ir" with an automatic variable
defined in the inline function (as is the case with these bitops) and
observed that even when I gave > 32 values, it would still work -- hence
my conclusion.

However, the patch still stands, does it not? [ I will modify the
changelog, obviously. ] The thing is that we don't want to limit
@nr to <= 31 in the first place, or am I wrong again? :-)

Thanks,
Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ II
  2007-07-23 16:18   ` Andi Kleen
@ 2007-07-23 16:22     ` Andi Kleen
  2007-07-23 16:32     ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ Satyam Sharma
  1 sibling, 0 replies; 92+ messages in thread
From: Andi Kleen @ 2007-07-23 16:22 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds


BTW if you want to optimize inline asm code a bit -- find_first_bit / 
find_first_zero_bit / for_each_cpu could really benefit from a optimized version 
for cpumask_t sized bitmaps. That would  save a lot of cycles in some of the 
hotter paths of the kernel like the scheduler.

-Andi

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 16:06 ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ Satyam Sharma
  2007-07-23 16:18   ` Andi Kleen
@ 2007-07-23 16:23   ` Jeremy Fitzhardinge
  2007-07-23 16:43     ` Satyam Sharma
  1 sibling, 1 reply; 92+ messages in thread
From: Jeremy Fitzhardinge @ 2007-07-23 16:23 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
>
> [7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
>
> Another oddity I noticed in this file. The semantics of __volatile__
> when used to qualify inline __asm__ are that the compiler will not
> (1) elid, or, (2) reorder, or, (3) intersperse, our inline asm with
> the rest of the generated code.
>   

"asm volatile" does not mean that at all.  It only guarantees (1), and
only then if the asm is ever reachable.

    J

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-23 16:13   ` Andi Kleen
@ 2007-07-23 16:26     ` Satyam Sharma
  2007-07-23 16:33       ` Andi Kleen
  0 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds

Hi Andi,


On Mon, 23 Jul 2007, Andi Kleen wrote:

> On Monday 23 July 2007 18:05:58 Satyam Sharma wrote:
> > From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> > 
> > [6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
> > 
> > The goal is to let gcc generate good, beautiful, optimized code.
> 
> 
> The first goal is correct code.
> 
> The reason for the memory barrier is to prevent other memory references
> from being moved over the atomic reference.
> 
> e.g. when a bit is used to communicate with another CPU this might be dangerous.


Yes, but _that_ address (of the bit-string) is protected already -- by the
implicit memory barrier due to the LOCK prefix. We shouldn't really be
caring about any other memory addresses, so it doesn't affect the
correctness of the bitops API at all.

[ and at least the __test_and_change_bit() case is definitely okay. ]

Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 16:21     ` Satyam Sharma
@ 2007-07-23 16:30       ` Andi Kleen
  2007-07-23 16:36         ` Jan Hubicka
  2007-07-23 18:05         ` H. Peter Anvin
  0 siblings, 2 replies; 92+ messages in thread
From: Andi Kleen @ 2007-07-23 16:30 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds, jh


> Whoa, thanks for explaining that to me -- I didn't know, obviously. I had
> just written a test program that used "Ir" with an automatic variable
> defined in the inline function (as is the case with these bitops) and
> observed that even when I gave > 32 values, it would still work -- hence
> my conclusion.
> 
> However, the patch still stands, does it not? [ I will modify the
> changelog, obviously. ] The thing is that we don't want to limit
> @nr to <= 31 in the first place, or am I wrong again? :-)

These bit operations only allow 8 bit immediates, so 0..255 would
be correct. N might work from the 4.1 docs, but I don't know if it works 
in all old supported gccs (3.2+)

However I is definitely not wrong and most bit numbers are small anyways.

-Andi

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 16:18   ` Andi Kleen
  2007-07-23 16:22     ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ II Andi Kleen
@ 2007-07-23 16:32     ` Satyam Sharma
  1 sibling, 0 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds

Hi,

On Mon, 23 Jul 2007, Andi Kleen wrote:

> On Monday 23 July 2007 18:06:03 Satyam Sharma wrote:
> > From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> > 
> > [7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
> > 
> > Another oddity I noticed in this file. The semantics of __volatile__
> > when used to qualify inline __asm__ are that the compiler will not
> > (1) elid, or, (2) reorder, or, (3) intersperse, our inline asm with
> > the rest of the generated code.
> > 
> > However, we do not want these guarantees in the unlocked variants of the
> > bitops functions. 
> 
> I thought so too and did a similar transformation while moving
> some string functions out of line. After that recent misadventure
> I would be very very careful with this.

Ah, ok, so this must be the case we'd stumbled upon recently on the
other thread. I hadn't got your fix for this, so didn't know.

> Overall I'm sorry to say, but the risk:gain ratio of this
> patch is imho totally out of whack.

The patch does look correct to me, we're only killing the use of
__volatile__ from places that don't require it (the guarantee-less
variants). Without losing it, I really don't see how the compiler
can ever combine multiple bitops, which does sound beneficial when
the caller has already implemented higher-level locking around
the usage of these operations.

Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-23 16:26     ` Satyam Sharma
@ 2007-07-23 16:33       ` Andi Kleen
  2007-07-23 17:12         ` Satyam Sharma
  0 siblings, 1 reply; 92+ messages in thread
From: Andi Kleen @ 2007-07-23 16:33 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds


> Yes, but _that_ address (of the bit-string) is protected already -- by the
> implicit memory barrier due to the LOCK prefix.

Compiler barrier != CPU barrier.

The memory clobber is a compiler barrier that prevents its global optimizer
from moving memory references. The CPU memory ordering guarantees are completely 
independent from this.


> We shouldn't really be 
> caring about any other memory addresses, so it doesn't affect the
> correctness of the bitops API at all.

The problem is the relationship to other operations.

This is not theoretic and we have had bugs because of this.

-Andi

 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 16:30       ` Andi Kleen
@ 2007-07-23 16:36         ` Jan Hubicka
  2007-07-23 18:05         ` H. Peter Anvin
  1 sibling, 0 replies; 92+ messages in thread
From: Jan Hubicka @ 2007-07-23 16:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andrew Morton, Linus Torvalds, jh

> 
> > Whoa, thanks for explaining that to me -- I didn't know, obviously. I had
> > just written a test program that used "Ir" with an automatic variable
> > defined in the inline function (as is the case with these bitops) and
> > observed that even when I gave > 32 values, it would still work -- hence
> > my conclusion.
> > 
> > However, the patch still stands, does it not? [ I will modify the
> > changelog, obviously. ] The thing is that we don't want to limit
> > @nr to <= 31 in the first place, or am I wrong again? :-)
> 
> These bit operations only allow 8 bit immediates, so 0..255 would
> be correct. N might work from the 4.1 docs, but I don't know if it works 
> in all old supported gccs (3.2+)

'N' constraint was there pretty much forever, originally intended for
in/out operands.  It is in 2.95 docs
http://gcc.gnu.org/onlinedocs/gcc-2.95.3/gcc_16.html#SEC175

Honza

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints
  2007-07-23 16:05 ` [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints Satyam Sharma
@ 2007-07-23 16:37   ` Andi Kleen
  2007-07-23 17:15     ` Satyam Sharma
  2007-07-23 17:46   ` Linus Torvalds
  2007-07-24  9:22   ` David Howells
  2 siblings, 1 reply; 92+ messages in thread
From: Andi Kleen @ 2007-07-23 16:37 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds, jh

On Monday 23 July 2007 18:05:43 Satyam Sharma wrote:
> From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> 
> [3/8] i386: bitops: Rectify bogus "+m" constraints
> 
> From the gcc manual:
> 
>   Extended asm supports input-output or read-write operands. Use the
>   constraint character `+' to indicate such an operand and list it with
>   the output operands. You should only use read-write operands when the
>   constraints for the operand (or the operand in which only some of the
>   bits are to be changed) allow a register.
> 
> So, using the "+" constraint modifier for memory, like "+m" is bogus.
> We must simply specify "=m" which handles the case correctly.

I checked with Honza (cc'ed) and he stated that the + are really needed
at least in newer gcc.

-Andi

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 16:23   ` Jeremy Fitzhardinge
@ 2007-07-23 16:43     ` Satyam Sharma
  2007-07-23 17:39       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 16:43 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton, Linus Torvalds

Hi Jeremy,


On Mon, 23 Jul 2007, Jeremy Fitzhardinge wrote:

> Satyam Sharma wrote:
> > From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> >
> > [7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
> >
> > Another oddity I noticed in this file. The semantics of __volatile__
> > when used to qualify inline __asm__ are that the compiler will not
> > (1) elid, or, (2) reorder, or, (3) intersperse, our inline asm with
> > the rest of the generated code.
> >   
> 
> "asm volatile" does not mean that at all.  It only guarantees (1),


Actually, you're probably right about (2), but (3)?

>From the gcc manual:

<quote>

Similarly, you can't expect a sequence of volatile asm instructions to
remain perfectly consecutive. If you want consecutive output, use a
single asm. Also GCC will perform some optimizations across a volatile
asm instruction, GCC does not "forget everything" when it encounters a
volatile asm instruction the way some other compilers do.

</quote>

I'm reading "Similarly, you can't expect a sequence of volatile asm
instructions to remain perfectly consecutive" to mean they're talking
about something like:

asm volatile(...);
asm volatile(...);
asm volatile(...);

But "If you want consecutive output, use a single asm" probably means:

asm volatile(... multiple instructions here ...);

would actually ensure the code written in there would not be
interspersed ... at least that's how I read it.

[ BTW "Also GCC will perform some optimizations across a volatile
asm instruction ..." is exactly the kind of optimizations we must
allow the compiler to do, but that's not related to point at hand. ]


> and
> only then if the asm is ever reachable.

Yup, of course.

Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-23 16:33       ` Andi Kleen
@ 2007-07-23 17:12         ` Satyam Sharma
  2007-07-23 17:49           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 17:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds

Hi,

On Mon, 23 Jul 2007, Andi Kleen wrote:

> 
> > Yes, but _that_ address (of the bit-string) is protected already -- by the
> > implicit memory barrier due to the LOCK prefix.
> 
> Compiler barrier != CPU barrier.

Exactly, but the actual _synchronization_ in all users of the bitops API
should (should, at least, otherwise the bugs lie in the callers) depend
upon the _bit-string_ whose address is passed to us. That could be some
flags/lock/etc in some caller, whatever, but all the synchronization in
those users would be based upon _that_ -- and we're handling it alright
with the full CPU barrier for _that_ address. Also note that in all the
atomic/unatomic variants, we always constrain the passed pointer to
memory and for the atomic/locked variants at least, I don't really see
how compiler optimizations would get us into trouble (again, for the
_passed bit-string memory address_).

> The memory clobber is a compiler barrier that prevents its global optimizer
> from moving memory references.
> [...]
> > We shouldn't really be 
> > caring about any other memory addresses, so it doesn't affect the
> > correctness of the bitops API at all.
> 
> The problem is the relationship to other operations.


... I don't want to get argumentative, but IMHO, we don't really need
to care about other (other than the passed memory address) references.
The problem with marking all of memory clobbered is that we're
disallowing legitimate optimizations for other memory references.

> The CPU memory ordering guarantees are completely 
> independent from this.

Of course.

> This is not theoretic and we have had bugs because of this.

Those could've been due to buggy users, not necessarily the bitops.

Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints
  2007-07-23 16:37   ` Andi Kleen
@ 2007-07-23 17:15     ` Satyam Sharma
  0 siblings, 0 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 17:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds, jh

On Mon, 23 Jul 2007, Andi Kleen wrote:

> On Monday 23 July 2007 18:05:43 Satyam Sharma wrote:
> > From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> > 
> > [3/8] i386: bitops: Rectify bogus "+m" constraints
> > 
> > From the gcc manual:
> > 
> >   Extended asm supports input-output or read-write operands. Use the
> >   constraint character `+' to indicate such an operand and list it with
> >   the output operands. You should only use read-write operands when the
> >   constraints for the operand (or the operand in which only some of the
> >   bits are to be changed) allow a register.
> > 
> > So, using the "+" constraint modifier for memory, like "+m" is bogus.
> > We must simply specify "=m" which handles the case correctly.
> 
> I checked with Honza (cc'ed) and he stated that the + are really needed
> at least in newer gcc.


That extract is from the latest (4.2.1) manual, but they could've
forgotten to update the documentation, of course.

But even then, I'm not sure they could ever make it "really needed",
that'll be a step that would break all existing code, otherwise! :-)

Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 16:43     ` Satyam Sharma
@ 2007-07-23 17:39       ` Jeremy Fitzhardinge
  2007-07-23 18:07         ` Satyam Sharma
  0 siblings, 1 reply; 92+ messages in thread
From: Jeremy Fitzhardinge @ 2007-07-23 17:39 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> Hi Jeremy,
>
>
> On Mon, 23 Jul 2007, Jeremy Fitzhardinge wrote:
>
>   
>> Satyam Sharma wrote:
>>     
>>> From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
>>>
>>> [7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
>>>
>>> Another oddity I noticed in this file. The semantics of __volatile__
>>> when used to qualify inline __asm__ are that the compiler will not
>>> (1) elid, or, (2) reorder, or, (3) intersperse, our inline asm with
>>> the rest of the generated code.
>>>   
>>>       
>> "asm volatile" does not mean that at all.  It only guarantees (1),
>>     
>
>
> Actually, you're probably right about (2), but (3)?
>
> From the gcc manual:
>
> <quote>
>
> Similarly, you can't expect a sequence of volatile asm instructions to
> remain perfectly consecutive. If you want consecutive output, use a
> single asm. Also GCC will perform some optimizations across a volatile
> asm instruction, GCC does not "forget everything" when it encounters a
> volatile asm instruction the way some other compilers do.
>
> </quote>
>
> I'm reading "Similarly, you can't expect a sequence of volatile asm
> instructions to remain perfectly consecutive" to mean they're talking
> about something like:
>
> asm volatile(...);
> asm volatile(...);
> asm volatile(...);
>
> But "If you want consecutive output, use a single asm" probably means:
>
> asm volatile(... multiple instructions here ...);
>
> would actually ensure the code written in there would not be
> interspersed ... at least that's how I read it.
>   

I'm not quite sure what your point is.  The paragraph you quoted is
pretty explicit in saying that volatile doesn't prevent an "asm
volatile" from being interspersed with other code, and the example just
before that is explicit in talking about how to use dependencies to
control the ordering of asm volatiles with respect to surrounding code. 
In fact nothing in that section precludes asm volatiles from being
reordered with respect to each other either; you just have to make sure
your dependency constraints are all correct.

    J

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints
  2007-07-23 16:05 ` [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints Satyam Sharma
  2007-07-23 16:37   ` Andi Kleen
@ 2007-07-23 17:46   ` Linus Torvalds
  2007-07-24  9:22   ` David Howells
  2 siblings, 0 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-07-23 17:46 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton



On Mon, 23 Jul 2007, Satyam Sharma wrote:
> 
> [3/8] i386: bitops: Rectify bogus "+m" constraints
> 
> From the gcc manual:
> 
>   Extended asm supports input-output or read-write operands. Use the
>   constraint character `+' to indicate such an operand and list it with
>   the output operands. You should only use read-write operands when the
>   constraints for the operand (or the operand in which only some of the
>   bits are to be changed) allow a register.
> 
> So, using the "+" constraint modifier for memory, like "+m" is bogus.
> We must simply specify "=m" which handles the case correctly.

No. This is wrong.

"=m" means that the new value of the memory location is *set*.

Which means that gcc will potentially optimize away any previous stores to 
that memory location.

And yes, it happens, and yes, we've seen it.

The gcc docs are secondary. They're not updated, they aren't correct, they 
don't reflect reality, and they don't matter. The only correct thing to 
use is "+m" for things like this, or alternatively to just use the 
"memory" specifier at the end and make gcc generate really crappy code.

			Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-23 17:12         ` Satyam Sharma
@ 2007-07-23 17:49           ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 92+ messages in thread
From: Jeremy Fitzhardinge @ 2007-07-23 17:49 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Andi Kleen, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> Exactly, but the actual _synchronization_ in all users of the bitops API
> should (should, at least, otherwise the bugs lie in the callers) depend
> upon the _bit-string_ whose address is passed to us. That could be some
> flags/lock/etc in some caller, whatever, but all the synchronization in
> those users would be based upon _that_ -- and we're handling it alright
> with the full CPU barrier for _that_ address. Also note that in all the
> atomic/unatomic variants, we always constrain the passed pointer to
> memory and for the atomic/locked variants at least, I don't really see
> how compiler optimizations would get us into trouble (again, for the
> _passed bit-string memory address_).
>   

The compiler doesn't know how large the object at that address is, so
there's limits to how much alias analysis it can do.  If it assumes
you're passing a pointer to a single word then it may fail to order the
bit operation correctly with respect to another operation on another
part of the same bitfield (a memset to clear it, for example).

    J

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses
  2007-07-23 16:05 ` [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses Satyam Sharma
@ 2007-07-23 17:52   ` Linus Torvalds
  2007-07-24  4:19     ` Nick Piggin
  2007-07-24  9:49     ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-07-23 17:52 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton



On Mon, 23 Jul 2007, Satyam Sharma wrote:

> 
> [4/8] i386: bitops: Kill volatile-casting of memory addresses

This is wrong.

The "const volatile" is so that you can pass an arbitrary pointer. The 
only kind of abritraty pointer is "const volatile".

In other words, the "volatile" has nothing at all to do with whether the 
memory is volatile or not (the same way "const" has nothing to do with it: 
it's purely a C type *safety* issue, exactly the same way "const" is a 
type safety issue.

A "const" on a pointer doesn't mean that the thing it points to cannot 
change. When you pass a source pointer to "strlen()", it doesn't have to 
be constant. But "strlen()" takes a "const" pointer, because it work son 
constant pointers *too*.

Same deal here.

Admittedly this may be mostly historic, but regardless - the "volatiles" 
are right.

Using volatile on *data* is generally considered incorrect and bad taste, 
but using it in situations like this potentially makes sense.

Of course, if we remove all "volatiles" in data in the kernel (with the 
possible exception of "jiffies"), we can then remove them from function 
declarations too, but it should be done in that order.

			Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-23 16:05 ` [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily Satyam Sharma
  2007-07-23 16:13   ` Andi Kleen
@ 2007-07-23 17:55   ` Linus Torvalds
  2007-07-24  9:52     ` Benjamin Herrenschmidt
  2007-07-24  3:57   ` Nick Piggin
  2 siblings, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-07-23 17:55 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton



On Mon, 23 Jul 2007, Satyam Sharma wrote:
> 
> The goal is to let gcc generate good, beautiful, optimized code.

No. The point is to let gcc generate *correct* code.

It's either "=m" together with "memory", or it's "+m".

You just introduced a bug.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 16:05 ` [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints Satyam Sharma
  2007-07-23 16:10   ` Andi Kleen
@ 2007-07-23 17:57   ` Linus Torvalds
  2007-07-23 18:14     ` Satyam Sharma
  2007-07-23 18:39     ` H. Peter Anvin
  1 sibling, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-07-23 17:57 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton



On Mon, 23 Jul 2007, Satyam Sharma wrote:
>
> * The "I" constraint modifier is applicable only to immediate-value operands,
>   and combining it with "r" is bogus.

This is wrong too.

The whole point of a "Ir" modifier is to say that the instruction takes 
*either* an "I" or an "r".

Andrew - the ones I've looked at were all wrong. Please don't take this 
series.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 16:30       ` Andi Kleen
  2007-07-23 16:36         ` Jan Hubicka
@ 2007-07-23 18:05         ` H. Peter Anvin
  2007-07-23 18:28           ` H. Peter Anvin
  1 sibling, 1 reply; 92+ messages in thread
From: H. Peter Anvin @ 2007-07-23 18:05 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andrew Morton, Linus Torvalds, jh

Andi Kleen wrote:
>> Whoa, thanks for explaining that to me -- I didn't know, obviously. I had
>> just written a test program that used "Ir" with an automatic variable
>> defined in the inline function (as is the case with these bitops) and
>> observed that even when I gave > 32 values, it would still work -- hence
>> my conclusion.
>>
>> However, the patch still stands, does it not? [ I will modify the
>> changelog, obviously. ] The thing is that we don't want to limit
>> @nr to <= 31 in the first place, or am I wrong again? :-)
> 
> These bit operations only allow 8 bit immediates, so 0..255 would
> be correct. N might work from the 4.1 docs, but I don't know if it works 
> in all old supported gccs (3.2+)
> 
> However I is definitely not wrong and most bit numbers are small anyways.
> 

"I" is correct.  The Intel documentation on this is highly confusing
(and has bugs in it), but it does unambiguously state:

"Some assemblers support immediate bit offsets larger than 31 by using
the immediate bit offset field in combination with the displacement
field in the memory operand ... The processor will ignore the high-order
bits if they are not zero."  AMD processors might be different for all I
know.

So unless gas is capable of doing this transformation (and it's not as
of binutils-2.17.50.0.6) "I" is what's needed here.

	-hpa


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 17:39       ` Jeremy Fitzhardinge
@ 2007-07-23 18:07         ` Satyam Sharma
  2007-07-23 18:28           ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 18:07 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton, Linus Torvalds

On Mon, 23 Jul 2007, Jeremy Fitzhardinge wrote:

> I'm not quite sure what your point is.

Could be a case of terminology confusion ...

> The paragraph you quoted is
> pretty explicit in saying that volatile doesn't prevent an "asm
> volatile" from being interspersed with other code, and the example just
> before that is explicit in talking about how to use dependencies to
> control the ordering of asm volatiles with respect to surrounding code.

Yes, that was the (2).

> In fact nothing in that section precludes asm volatiles from being
> reordered with respect to each other either; you just have to make sure
> your dependency constraints are all correct.

The (3) as I had originally written / meant was that multiple
instructions in a volatile asm would not get _individually_
interspersed with the rest of the code i.e. be emitted out
_consecutively_. I don't think we need any such guarantees for
the non-atomic variants of those operations, so it's good to
let the compiler have a free hand with what it wants to do,
and optimize/combine multiple bitops as necessary / possible,
which was the original intention.

Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 17:57   ` Linus Torvalds
@ 2007-07-23 18:14     ` Satyam Sharma
  2007-07-23 18:32       ` Andi Kleen
  2007-07-23 18:39     ` H. Peter Anvin
  1 sibling, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 18:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton

On Mon, 23 Jul 2007, Linus Torvalds wrote:

> On Mon, 23 Jul 2007, Satyam Sharma wrote:
> >
> > * The "I" constraint modifier is applicable only to immediate-value operands,
> >   and combining it with "r" is bogus.
> 
> This is wrong too.
> 
> The whole point of a "Ir" modifier is to say that the instruction takes 
> *either* an "I" or an "r".

Yup, sorry about this one, Andi pointed this out earlier. But the "I"
must still go I think, for the third reason in that changelog -- it
unnecessarily limits the bit offset to 0..31, but (at least from the
comment up front in that file) we do allow arbitrarily large @nr (upto
255, of course, these instructions won't take anything greater than that).

> Andrew - the ones I've looked at were all wrong. Please don't take this 
> series.

I think I'll rescind the series anyway, a lot of patches turned out to
be wrong -- some due to mis-reading / incorrect gcc docs, others due to
other reasons ... this was just something I did thinking of as a cleanup
anyway, so I don't intend to push or correct this or anything.

Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 18:07         ` Satyam Sharma
@ 2007-07-23 18:28           ` Jeremy Fitzhardinge
  2007-07-23 20:29             ` Trent Piepho
  0 siblings, 1 reply; 92+ messages in thread
From: Jeremy Fitzhardinge @ 2007-07-23 18:28 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> The (3) as I had originally written / meant was that multiple
> instructions in a volatile asm would not get _individually_
> interspersed with the rest of the code i.e. be emitted out
> _consecutively_. I don't think we need any such guarantees for
> the non-atomic variants of those operations, so it's good to
> let the compiler have a free hand with what it wants to do,
> and optimize/combine multiple bitops as necessary / possible,
> which was the original intention.
>   

No, a single asm statement is always emitted in one piece.  Gcc doesn't
parse the string other than to do %-substitution to insert arguments, so
it has no way to meaningfully split it up.

    J

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 18:05         ` H. Peter Anvin
@ 2007-07-23 18:28           ` H. Peter Anvin
  0 siblings, 0 replies; 92+ messages in thread
From: H. Peter Anvin @ 2007-07-23 18:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andrew Morton, Linus Torvalds, jh

H. Peter Anvin wrote:
> 
> "I" is correct.  The Intel documentation on this is highly confusing
> (and has bugs in it), but it does unambiguously state:
> 
> "Some assemblers support immediate bit offsets larger than 31 by using
> the immediate bit offset field in combination with the displacement
> field in the memory operand ... The processor will ignore the high-order
> bits if they are not zero."  AMD processors might be different for all I
> know.
> 
> So unless gas is capable of doing this transformation (and it's not as
> of binutils-2.17.50.0.6) "I" is what's needed here.
> 

Just tested it on a K8 machine; AMD behaves the same way.  So "I" is
correct, and changing it to "N" would introduce a bug.

The only way to optimize this is by using __builtin_constant_p() and
adjust the offset appropriately.

	-hpa

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 18:14     ` Satyam Sharma
@ 2007-07-23 18:32       ` Andi Kleen
  0 siblings, 0 replies; 92+ messages in thread
From: Andi Kleen @ 2007-07-23 18:32 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linus Torvalds, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andrew Morton

On Monday 23 July 2007 20:14:52 Satyam Sharma wrote:
> On Mon, 23 Jul 2007, Linus Torvalds wrote:
> 
> > On Mon, 23 Jul 2007, Satyam Sharma wrote:
> > >
> > > * The "I" constraint modifier is applicable only to immediate-value operands,
> > >   and combining it with "r" is bogus.
> > 
> > This is wrong too.
> > 
> > The whole point of a "Ir" modifier is to say that the instruction takes 
> > *either* an "I" or an "r".
> 
> Yup, sorry about this one, Andi pointed this out earlier. But the "I"
> must still go I think, for the third reason in that changelog -- it
> unnecessarily limits the bit offset to 0..31, but (at least from the
> comment up front in that file) we do allow arbitrarily large @nr (upto
> 255, of course, these instructions won't take anything greater than that).


As HPA pointed out that would risk not being correctly assembled by at 
least some binutils versions

 
> > Andrew - the ones I've looked at were all wrong. Please don't take this 
> > series.
> 
> I think I'll rescind the series anyway, a lot of patches turned out to
> be wrong -- some due to mis-reading / incorrect gcc docs, others due to
> other reasons ... this was just something I did thinking of as a cleanup
> anyway, so I don't intend to push or correct this or anything.

cpumask_t/nodemask_t bitmap optimizations would be useful.

-Andi


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 17:57   ` Linus Torvalds
  2007-07-23 18:14     ` Satyam Sharma
@ 2007-07-23 18:39     ` H. Peter Anvin
  2007-07-23 18:52       ` Satyam Sharma
  1 sibling, 1 reply; 92+ messages in thread
From: H. Peter Anvin @ 2007-07-23 18:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton

Linus Torvalds wrote:
> 
> On Mon, 23 Jul 2007, Satyam Sharma wrote:
>> * The "I" constraint modifier is applicable only to immediate-value operands,
>>   and combining it with "r" is bogus.
> 
> This is wrong too.
> 
> The whole point of a "Ir" modifier is to say that the instruction takes 
> *either* an "I" or an "r".
> 
> Andrew - the ones I've looked at were all wrong. Please don't take this 
> series.
> 

Incidentally, I just noticed the x86-64 bitops have "dIr" as their
constraint set.  "d" would normally be redundant with "r", and as far as
I know, gcc doesn't prefer one over the other without having "?" or "!"
as part of the constraint, so is is "d" a stray or is there some meaning
behind it?

	-hpa

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints
  2007-07-23 18:39     ` H. Peter Anvin
@ 2007-07-23 18:52       ` Satyam Sharma
  0 siblings, 0 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-23 18:52 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton

On Mon, 23 Jul 2007, H. Peter Anvin wrote:

> Linus Torvalds wrote:
> > 
> > On Mon, 23 Jul 2007, Satyam Sharma wrote:
> >> * The "I" constraint modifier is applicable only to immediate-value operands,
> >>   and combining it with "r" is bogus.
> > 
> > This is wrong too.
> > 
> > The whole point of a "Ir" modifier is to say that the instruction takes 
> > *either* an "I" or an "r".
> > 
> > Andrew - the ones I've looked at were all wrong. Please don't take this 
> > series.
> > 
> 
> Incidentally, I just noticed the x86-64 bitops have "dIr" as their
> constraint set.  "d" would normally be redundant with "r", and as far as
> I know, gcc doesn't prefer one over the other without having "?" or "!"
> as part of the constraint, so is is "d" a stray or is there some meaning
> behind it?


Yup, I had noticed that myself. We would need to use "J" if we want
to limit the offsets to 0..63, but "d" sounds weird / stray indeed,
unless there's yet another undocumented/wrongly-documented mystery
behind this one too ... (I'd like to know even if that's the case,
obviously).

Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 18:28           ` Jeremy Fitzhardinge
@ 2007-07-23 20:29             ` Trent Piepho
  2007-07-23 20:40               ` Jeremy Fitzhardinge
  2007-07-23 21:30               ` Andi Kleen
  0 siblings, 2 replies; 92+ messages in thread
From: Trent Piepho @ 2007-07-23 20:29 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

On Mon, 23 Jul 2007, Jeremy Fitzhardinge wrote:
> Satyam Sharma wrote:
> > The (3) as I had originally written / meant was that multiple
> > instructions in a volatile asm would not get _individually_
> > interspersed with the rest of the code i.e. be emitted out
> > _consecutively_. I don't think we need any such guarantees for
> > the non-atomic variants of those operations, so it's good to
> > let the compiler have a free hand with what it wants to do,
> > and optimize/combine multiple bitops as necessary / possible,
> > which was the original intention.
> >
>
> No, a single asm statement is always emitted in one piece.  Gcc doesn't
> parse the string other than to do %-substitution to insert arguments, so
> it has no way to meaningfully split it up.

gcc also tries to count the number of instructions, to guess how large in
bytes the asm block is, as it could make a difference for near vs short
jumps, etc.

I wonder it it also affects the instruction count the inline heuristics
use?

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 20:29             ` Trent Piepho
@ 2007-07-23 20:40               ` Jeremy Fitzhardinge
  2007-07-23 21:06                 ` Trent Piepho
  2007-07-23 21:30               ` Andi Kleen
  1 sibling, 1 reply; 92+ messages in thread
From: Jeremy Fitzhardinge @ 2007-07-23 20:40 UTC (permalink / raw)
  To: Trent Piepho
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

Trent Piepho wrote:
> gcc also tries to count the number of instructions, to guess how large in
> bytes the asm block is, as it could make a difference for near vs short
> jumps, etc.
>   

How does it do that?  By looking for \n, ';', etc?

    J

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 20:40               ` Jeremy Fitzhardinge
@ 2007-07-23 21:06                 ` Trent Piepho
  0 siblings, 0 replies; 92+ messages in thread
From: Trent Piepho @ 2007-07-23 21:06 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton, Linus Torvalds

On Mon, 23 Jul 2007, Jeremy Fitzhardinge wrote:
> Trent Piepho wrote:
> > gcc also tries to count the number of instructions, to guess how large in
> > bytes the asm block is, as it could make a difference for near vs short
> > jumps, etc.
> >
>
> How does it do that?  By looking for \n, ';', etc?

Yes:

    Some targets require that GCC track the size of each instruction used in
    order to generate correct code.  Because the final length of an `asm' is
    only known by the assembler, GCC must make an estimate as to how big it
    will be.  The estimate is formed by counting the number of statements in
    the pattern of the `asm' and multiplying that by the length of the longest
    instruction on that processor.  Statements in the `asm' are identified by
    newline characters and whatever statement separator characters are
    supported by the assembler; on most processors this is the ``;''
    character.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 20:29             ` Trent Piepho
  2007-07-23 20:40               ` Jeremy Fitzhardinge
@ 2007-07-23 21:30               ` Andi Kleen
  2007-07-23 21:48                 ` Nicholas Miell
  1 sibling, 1 reply; 92+ messages in thread
From: Andi Kleen @ 2007-07-23 21:30 UTC (permalink / raw)
  To: Trent Piepho
  Cc: Jeremy Fitzhardinge, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Nick Piggin, Andrew Morton, Linus Torvalds


> gcc also tries to count the number of instructions, to guess how large in
> bytes the asm block is, as it could make a difference for near vs short
> jumps, etc.

Are you sure? I doubt it. It would need a full asm parser to do this
properly and then even it could be wrong 
(e.g. when the sections are switched like Linux does extensively)  

gcc doesn't have such a parser.

Also on x86 gcc doesn't need to care about long/short anyways because
the assembler takes care of this and the other instructions who cared
about this (loop) isn't generated anymore.

You're probably confusing it with some other compiler, who sometimes
do this. e.g. the Microsoft inline asm syntax requires assembler parsing
in the compiler.

> I wonder it it also affects the instruction count the inline heuristics
> use?

AFAIK it counts like one operand.

-Andi

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__
  2007-07-23 21:30               ` Andi Kleen
@ 2007-07-23 21:48                 ` Nicholas Miell
  0 siblings, 0 replies; 92+ messages in thread
From: Nicholas Miell @ 2007-07-23 21:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Trent Piepho, Jeremy Fitzhardinge, Satyam Sharma,
	Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andrew Morton, Linus Torvalds

On Mon, 2007-07-23 at 23:30 +0200, Andi Kleen wrote:
> > gcc also tries to count the number of instructions, to guess how large in
> > bytes the asm block is, as it could make a difference for near vs short
> > jumps, etc.
> 
> Are you sure? I doubt it. It would need a full asm parser to do this
> properly and then even it could be wrong 
> (e.g. when the sections are switched like Linux does extensively)  
> 
> gcc doesn't have such a parser.
> 
> Also on x86 gcc doesn't need to care about long/short anyways because
> the assembler takes care of this and the other instructions who cared
> about this (loop) isn't generated anymore.
> 
> You're probably confusing it with some other compiler, who sometimes
> do this. e.g. the Microsoft inline asm syntax requires assembler parsing
> in the compiler.
> 
> > I wonder it it also affects the instruction count the inline heuristics
> > use?
> 
> AFAIK it counts like one operand.
> 
> -Andi

GCC counts newlines and semicolons and uses that number as the likely
instruction count.

See asm_insn_count() in gcc/gcc/final.c

-- 
Nicholas Miell <nmiell@comcast.net>


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-23 16:06 ` [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions Satyam Sharma
@ 2007-07-24  3:53   ` Nick Piggin
  2007-07-24  7:34     ` Satyam Sharma
  0 siblings, 1 reply; 92+ messages in thread
From: Nick Piggin @ 2007-07-24  3:53 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> 
> [8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
> 
>>From Documentation/atomic_ops.txt, those archs that require explicit
> memory barriers around clear_bit() must also implement these two interfaces.
> However, for i386, clear_bit() is a strict, locked, atomic and
> un-reorderable operation and includes an implicit memory barrier already.
> 
> But these two functions have been wrongly defined as "barrier()" which is
> a pointless _compiler optimization_ barrier, and only serves to make gcc
> not do legitimate optimizations that it could have otherwise done.
> 
> So let's make these proper no-ops, because that's exactly what we require
> these to be on the i386 platform.

No. clear_bit is not a compiler barrier on i386, thus smp_mb__before/after
must be.


> 
> Signed-off-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Nick Piggin <nickpiggin@yahoo.com.au>
> 
> ---
> 
> [ A similar optimization needs to be done in the atomic.h also.
>   Will submit that patch shortly. ]
> 
>  include/asm-i386/bitops.h |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
> index 4f1fda5..42999eb 100644
> --- a/include/asm-i386/bitops.h
> +++ b/include/asm-i386/bitops.h
> @@ -106,8 +106,8 @@ static inline void __clear_bit(int nr, unsigned long *addr)
>   * Bit operations are already serializing on x86.
>   * These must still be defined here for API completeness.
>   */
> -#define smp_mb__before_clear_bit()	barrier()
> -#define smp_mb__after_clear_bit()	barrier()
> +#define smp_mb__before_clear_bit()	do {} while (0)
> +#define smp_mb__after_clear_bit()	do {} while (0)
>  
>  /**
>   * __change_bit - Toggle a bit in memory
> 


-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-23 16:05 ` [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily Satyam Sharma
  2007-07-23 16:13   ` Andi Kleen
  2007-07-23 17:55   ` Linus Torvalds
@ 2007-07-24  3:57   ` Nick Piggin
  2007-07-24  6:38     ` Satyam Sharma
  2007-07-24  9:44     ` David Howells
  2 siblings, 2 replies; 92+ messages in thread
From: Nick Piggin @ 2007-07-24  3:57 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> 
> [6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
> 
> The goal is to let gcc generate good, beautiful, optimized code.
> 
> But test_and_set_bit, test_and_clear_bit, __test_and_change_bit,
> and test_and_change_bit unnecessarily mark all of memory as clobbered,
> thereby preventing gcc from doing perfectly valid optimizations.
> 
> The case of __test_and_change_bit() is particularly surprising, given
> that it's a variant where we don't make any guarantees at all.

__test_and_change_bit is one that you could remove the memory clobber
from.

> ---
> 
>  include/asm-i386/bitops.h |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/asm-i386/bitops.h b/include/asm-i386/bitops.h
> index 0f5634b..f37b8a2 100644
> --- a/include/asm-i386/bitops.h
> +++ b/include/asm-i386/bitops.h
> @@ -254,7 +254,7 @@ static inline int __test_and_change_bit(int nr, unsigned long *addr)
>  	__asm__ __volatile__(
>  		"btcl %2,%1\n\tsbbl %0,%0"
>  		:"=r" (oldbit),"=m" (*addr)
> -		:"r" (nr) : "memory");
> +		:"r" (nr));
>  	return oldbit;
>  }
>  


-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses
  2007-07-23 17:52   ` Linus Torvalds
@ 2007-07-24  4:19     ` Nick Piggin
  2007-07-24  6:23       ` Satyam Sharma
  2007-07-24  9:49     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 92+ messages in thread
From: Nick Piggin @ 2007-07-24  4:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Andi Kleen, Andrew Morton

Linus Torvalds wrote:
> 
> On Mon, 23 Jul 2007, Satyam Sharma wrote:
> 
> 
>>[4/8] i386: bitops: Kill volatile-casting of memory addresses
> 
> 
> This is wrong.
> 
> The "const volatile" is so that you can pass an arbitrary pointer. The 
> only kind of abritraty pointer is "const volatile".
> 
> In other words, the "volatile" has nothing at all to do with whether the 
> memory is volatile or not (the same way "const" has nothing to do with it: 
> it's purely a C type *safety* issue, exactly the same way "const" is a 
> type safety issue.
> 
> A "const" on a pointer doesn't mean that the thing it points to cannot 
> change. When you pass a source pointer to "strlen()", it doesn't have to 
> be constant. But "strlen()" takes a "const" pointer, because it work son 
> constant pointers *too*.
> 
> Same deal here.
> 
> Admittedly this may be mostly historic, but regardless - the "volatiles" 
> are right.
> 
> Using volatile on *data* is generally considered incorrect and bad taste, 
> but using it in situations like this potentially makes sense.
> 
> Of course, if we remove all "volatiles" in data in the kernel (with the 
> possible exception of "jiffies"), we can then remove them from function 
> declarations too, but it should be done in that order.

Well, regardless, it still forces the function to treat the pointer
target as volatile, won't it? It definitely prevents valid optimisations
that would be useful for me in mm/page_alloc.c where page flags are
being set up or torn down or checked with non-atomic bitops.

OK, not the i386 functions as much because they are all in asm anwyay,
but in general (btw. why does i386 or any architecture define their own
non-atomic bitops? If the version in asm-generic/bitops/non-atomic.h
is not good enough then surely it is a bug in gcc or that file?)

Anyway by type safety, do you mean it will stop the compiler from
warning if a pointer to a volatile is passed to the bitop? If so, then
why don't we just kill all the volatiles out of here and fix any
warnings that comeup? I doubt there would be many, and of those, some
might show up real synchronisation problems.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses
  2007-07-24  4:19     ` Nick Piggin
@ 2007-07-24  6:23       ` Satyam Sharma
  2007-07-24  7:16         ` Nick Piggin
  0 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-24  6:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Linux Kernel Mailing List, David Howells,
	Andi Kleen, Andrew Morton

On Tue, 24 Jul 2007, Nick Piggin wrote:

> Linus Torvalds wrote:
> > 
> > On Mon, 23 Jul 2007, Satyam Sharma wrote:
> > 
> > 
> > > [4/8] i386: bitops: Kill volatile-casting of memory addresses
> > 
> > 
> > This is wrong.
> > 
> > The "const volatile" is so that you can pass an arbitrary pointer. The only
> > kind of abritraty pointer is "const volatile".
> > 
> > In other words, the "volatile" has nothing at all to do with whether the
> > memory is volatile or not (the same way "const" has nothing to do with it:
> > it's purely a C type *safety* issue, exactly the same way "const" is a type
> > safety issue.
> > 
> > A "const" on a pointer doesn't mean that the thing it points to cannot
> > change. When you pass a source pointer to "strlen()", it doesn't have to be
> > constant. But "strlen()" takes a "const" pointer, because it work son
> > constant pointers *too*.
> > 
> > Same deal here.
> > 
> > Admittedly this may be mostly historic, but regardless - the "volatiles" are
> > right.
> > 
> > Using volatile on *data* is generally considered incorrect and bad taste,
> > but using it in situations like this potentially makes sense.
> > 
> > Of course, if we remove all "volatiles" in data in the kernel (with the
> > possible exception of "jiffies"), we can then remove them from function
> > declarations too, but it should be done in that order.
> 
> Well, regardless, it still forces the function to treat the pointer
> target as volatile, won't it? It definitely prevents valid optimisations
> that would be useful for me in mm/page_alloc.c where page flags are
> being set up or torn down or checked with non-atomic bitops.

Yes, and yes. But I think what he meant there is that we'd need to
audit the kernel for all users of set_bit and friends and see if callers
actually pass in any _data_ that _is_ volatile. So we have to kill them
there first, and then in the function declarations here. I think I'll put
that on my long-term todo list, but see below.

> Anyway by type safety, do you mean it will stop the compiler from
> warning if a pointer to a volatile is passed to the bitop?

The compiler would start warning for all those cases (passing in
a pointer to volatile data, when the bitops have lost the volatile
casting from their function declarations), actually. Something like
"passing argument discards qualifiers from pointer type" ... but
considering I didn't see *any* of those warnings after these patches,
I'm confused as to what exactly Linus meant here ... and what exactly
do we need to do "kill the volatiles".

> If so, then
> why don't we just kill all the volatiles out of here and fix any
> warnings that comeup? I doubt there would be many, and of those, some
> might show up real synchronisation problems.

Yes, but see above.

> OK, not the i386 functions as much because they are all in asm anwyay,
> but in general (btw. why does i386 or any architecture define their own
> non-atomic bitops? If the version in asm-generic/bitops/non-atomic.h
> is not good enough then surely it is a bug in gcc or that file?)



Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24  3:57   ` Nick Piggin
@ 2007-07-24  6:38     ` Satyam Sharma
  2007-07-24  7:24       ` Nick Piggin
  2007-07-24  9:44     ` David Howells
  1 sibling, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-24  6:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Tue, 24 Jul 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> > 
> > [6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
> > 
> > The goal is to let gcc generate good, beautiful, optimized code.
> > 
> > But test_and_set_bit, test_and_clear_bit, __test_and_change_bit,
> > and test_and_change_bit unnecessarily mark all of memory as clobbered,
> > thereby preventing gcc from doing perfectly valid optimizations.
> > 
> > The case of __test_and_change_bit() is particularly surprising, given
> > that it's a variant where we don't make any guarantees at all.
> 
> __test_and_change_bit is one that you could remove the memory clobber
> from.

Yes, for the atomic versions we don't care if we're asking gcc to
generate trashy code (even though I'd have wanted to only disallow
problematic optimizations -- ones involving the passed bit-string
address -- there, and allow other memory references to be optimized
as and how the compiler feels like it) because the atomic variants
are slow anyway and we probably want to be extra-safe there.

But for the non-atomic variants, it does make sense to remove the
memory clobber (and the unneeded __asm__ __volatile__ that another
patch did -- for the non-atomic variants, again).

OTOH, as per Linus' review it seems we can drop the "memory" clobber
and specify the output operand for the extended asm as "+m". But I
must admit I didn't quite understand that at all.

[ I should probably start reading gcc sources, the docs are said to
  be insufficient/out-of-date, as per the reviews of the patches. ]


Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses
  2007-07-24  6:23       ` Satyam Sharma
@ 2007-07-24  7:16         ` Nick Piggin
  0 siblings, 0 replies; 92+ messages in thread
From: Nick Piggin @ 2007-07-24  7:16 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linus Torvalds, Linux Kernel Mailing List, David Howells,
	Andi Kleen, Andrew Morton

Satyam Sharma wrote:
> On Tue, 24 Jul 2007, Nick Piggin wrote:
> 
> 
>>Linus Torvalds wrote:
>>

>>>Of course, if we remove all "volatiles" in data in the kernel (with the
>>>possible exception of "jiffies"), we can then remove them from function
>>>declarations too, but it should be done in that order.
>>
>>Well, regardless, it still forces the function to treat the pointer
>>target as volatile, won't it? It definitely prevents valid optimisations
>>that would be useful for me in mm/page_alloc.c where page flags are
>>being set up or torn down or checked with non-atomic bitops.
> 
> 
> Yes, and yes. But I think what he meant there is that we'd need to
> audit the kernel for all users of set_bit and friends and see if callers
> actually pass in any _data_ that _is_ volatile. So we have to kill them
> there first, and then in the function declarations here. I think I'll put
> that on my long-term todo list, but see below.

Yeah that is probably what he meant.


>>Anyway by type safety, do you mean it will stop the compiler from
>>warning if a pointer to a volatile is passed to the bitop?
> 
> 
> The compiler would start warning for all those cases (passing in
> a pointer to volatile data, when the bitops have lost the volatile
> casting from their function declarations), actually. Something like
> "passing argument discards qualifiers from pointer type" ... but
> considering I didn't see *any* of those warnings after these patches,
> I'm confused as to what exactly Linus meant here ... and what exactly
> do we need to do "kill the volatiles".

Because even with an allyesconfig, your compile isn't testing the
entire kernel. So given the relatively minor benefit of removing
the volatiles, I suppose we shouldn't risk slipping a bug in. If
you can make gcc throw an error in that case it would be a different
story.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24  6:38     ` Satyam Sharma
@ 2007-07-24  7:24       ` Nick Piggin
  2007-07-24  8:29         ` Satyam Sharma
  2007-07-24  8:38         ` Trent Piepho
  0 siblings, 2 replies; 92+ messages in thread
From: Nick Piggin @ 2007-07-24  7:24 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> On Tue, 24 Jul 2007, Nick Piggin wrote:
> 
> 
>>Satyam Sharma wrote:
>>
>>>From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
>>>
>>>[6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
>>>
>>>The goal is to let gcc generate good, beautiful, optimized code.
>>>
>>>But test_and_set_bit, test_and_clear_bit, __test_and_change_bit,
>>>and test_and_change_bit unnecessarily mark all of memory as clobbered,
>>>thereby preventing gcc from doing perfectly valid optimizations.
>>>
>>>The case of __test_and_change_bit() is particularly surprising, given
>>>that it's a variant where we don't make any guarantees at all.
>>
>>__test_and_change_bit is one that you could remove the memory clobber
>>from.
> 
> 
> Yes, for the atomic versions we don't care if we're asking gcc to
> generate trashy code (even though I'd have wanted to only disallow
> problematic optimizations -- ones involving the passed bit-string
> address -- there, and allow other memory references to be optimized
> as and how the compiler feels like it) because the atomic variants
> are slow anyway and we probably want to be extra-safe there.
> 
> But for the non-atomic variants, it does make sense to remove the
> memory clobber (and the unneeded __asm__ __volatile__ that another
> patch did -- for the non-atomic variants, again).

No. It has nothing to do with atomicity and all to do with ordering.
For example test_bit, clear_bit, set_bit, etc are all atomic but none
place any restrictions on ordering.

__test_and_change_bit has no restriction on ordering, so as long as
the correct operands are clobbered, a "memory" clobber to enforce a
compiler barrier is not needed.


> OTOH, as per Linus' review it seems we can drop the "memory" clobber
> and specify the output operand for the extended asm as "+m". But I
> must admit I didn't quite understand that at all.

Yes, but that is for cases where "memory" is being used to say that
some otherwise unspecified memory is actually being changed, rather
than to provide a compiler barrier as is the case for test_and_set_bit
and co.


> [ I should probably start reading gcc sources, the docs are said to
>   be insufficient/out-of-date, as per the reviews of the patches. ]

You should read Documentation/atomic_ops.txt and memory-barriers.txt,
which are quite useful and should be uptodate.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24  3:53   ` Nick Piggin
@ 2007-07-24  7:34     ` Satyam Sharma
  2007-07-24  7:48       ` Jeremy Fitzhardinge
  2007-07-24  8:20       ` Nick Piggin
  0 siblings, 2 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-24  7:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Tue, 24 Jul 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
> > 
> > [8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
> > 
> > > From Documentation/atomic_ops.txt, those archs that require explicit
> > memory barriers around clear_bit() must also implement these two interfaces.
> > However, for i386, clear_bit() is a strict, locked, atomic and
> > un-reorderable operation and includes an implicit memory barrier already.
> > 
> > But these two functions have been wrongly defined as "barrier()" which is
> > a pointless _compiler optimization_ barrier, and only serves to make gcc
> > not do legitimate optimizations that it could have otherwise done.
> > 
> > So let's make these proper no-ops, because that's exactly what we require
> > these to be on the i386 platform.
> 
> No. clear_bit is not a compiler barrier on i386,

Obvious.

> thus smp_mb__before/after
> must be.

Not so obvious. Why do we require these to be a full compiler barrier
is precisely the question I raised here.

Consider this (the above two functions exist only for clear_bit(),
the atomic variant, as you already know), the _only_ memory reference
we care about is that of the address of the passed bit-string:

(1) The compiler must not optimize / elid it (i.e. we need to disallow
    compiler optimization for that reference) -- but we've already taken
    care of that with the __asm__ __volatile__ and the constraints on
    the memory "addr" operand there, and,
(2) For the i386, it also includes an implicit memory (CPU) barrier
    already.

So I /think/ it makes sense to let the compiler optimize _other_ memory
references across the call to clear_bit(). There's a difference. I think
we'd be safe even if we do this, because the synchronization in callers
must be based upon the _passed bit-string_, otherwise _they_ are the
ones who're buggy.

[ However, elsewhere Jeremy Fitzhardinge mentioned the case of
  some callers, for instance, doing a memset() on an alias of
  the same bit-string. But again, I think that is dodgy/buggy/
  extremely border-line usage on the caller's side itself ...
  *unless* the caller is doing that inside a higher-level lock
  anyway, in which case he wouldn't be needing to use the
  locked variants either ... ]

[ For those interested, I've been looking at the code generated
  for the test kernel I built with these patches, and I don't
  really see anything gcc did that it shouldn't have -- but ok,
  that doesn't mean other versions/toolchains for other setups
  won't. Also, the test box has been up all night, but I'm only
  running Firefox on it anyway, and don't really know how to
  verify if I've introduced any correctness issues / bugs. ]


Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24  7:34     ` Satyam Sharma
@ 2007-07-24  7:48       ` Jeremy Fitzhardinge
  2007-07-24  8:31         ` Nick Piggin
  2007-07-24  8:20       ` Nick Piggin
  1 sibling, 1 reply; 92+ messages in thread
From: Jeremy Fitzhardinge @ 2007-07-24  7:48 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Nick Piggin, Linux Kernel Mailing List, David Howells,
	Andi Kleen, Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> Consider this (the above two functions exist only for clear_bit(),
> the atomic variant, as you already know), the _only_ memory reference
> we care about is that of the address of the passed bit-string:
>
> (1) The compiler must not optimize / elid it (i.e. we need to disallow
>     compiler optimization for that reference) -- but we've already taken
>     care of that with the __asm__ __volatile__ and the constraints on
>     the memory "addr" operand there, and,
> (2) For the i386, it also includes an implicit memory (CPU) barrier
>     already.
>
> So I /think/ it makes sense to let the compiler optimize _other_ memory
> references across the call to clear_bit(). There's a difference. I think
> we'd be safe even if we do this, because the synchronization in callers
> must be based upon the _passed bit-string_, otherwise _they_ are the
> ones who're buggy.
>
> [ However, elsewhere Jeremy Fitzhardinge mentioned the case of
>   some callers, for instance, doing a memset() on an alias of
>   the same bit-string. But again, I think that is dodgy/buggy/
>   extremely border-line usage on the caller's side itself ...
>   *unless* the caller is doing that inside a higher-level lock
>   anyway, in which case he wouldn't be needing to use the
>   locked variants either ... ]
>   

You miss my point.  If you have:

	memset(&my_bitmask, 0, sizeof(my_bitmask));
	set_bit(my_bitmask, 44);

Then unless the set_bit has a constraint argument which covers the whole
of the (multiword) bitmask, the compiler may see fit to interleave the
memset writes with the set_bit in bad ways.  In other words, plain "+m"
(*(long *)ptr) won't cut it.  You'd need "+m" (my_bitmask), I think.

    J

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24  7:34     ` Satyam Sharma
  2007-07-24  7:48       ` Jeremy Fitzhardinge
@ 2007-07-24  8:20       ` Nick Piggin
  2007-07-24  9:21         ` Satyam Sharma
  1 sibling, 1 reply; 92+ messages in thread
From: Nick Piggin @ 2007-07-24  8:20 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> On Tue, 24 Jul 2007, Nick Piggin wrote:
> 
> 
>>Satyam Sharma wrote:
>>
>>>From: Satyam Sharma <ssatyam@cse.iitk.ac.in>
>>>
>>>[8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
>>>
>>>
>>>>From Documentation/atomic_ops.txt, those archs that require explicit
>>>
>>>memory barriers around clear_bit() must also implement these two interfaces.
>>>However, for i386, clear_bit() is a strict, locked, atomic and
>>>un-reorderable operation and includes an implicit memory barrier already.
>>>
>>>But these two functions have been wrongly defined as "barrier()" which is
>>>a pointless _compiler optimization_ barrier, and only serves to make gcc
>>>not do legitimate optimizations that it could have otherwise done.
>>>
>>>So let's make these proper no-ops, because that's exactly what we require
>>>these to be on the i386 platform.
>>
>>No. clear_bit is not a compiler barrier on i386,
> 
> 
> Obvious.
> 
> 
>>thus smp_mb__before/after
>>must be.
> 
> 
> Not so obvious. Why do we require these to be a full compiler barrier
> is precisely the question I raised here.
> 
> Consider this (the above two functions exist only for clear_bit(),
> the atomic variant, as you already know), the _only_ memory reference
> we care about is that of the address of the passed bit-string:

No. Memory barriers explicitly extend to all memory references.


> (1) The compiler must not optimize / elid it (i.e. we need to disallow
>     compiler optimization for that reference) -- but we've already taken
>     care of that with the __asm__ __volatile__ and the constraints on
>     the memory "addr" operand there, and,
> (2) For the i386, it also includes an implicit memory (CPU) barrier
>     already.

Repeating what has been said before: A CPU memory barrier is not a
compiler barrier or vice versa. Seeing as we are talking about
the compiler barrier, it is irrelevant as to whether or not the
assembly includes a CPU barrier.


> So I /think/ it makes sense to let the compiler optimize _other_ memory
> references across the call to clear_bit(). There's a difference. I think
> we'd be safe even if we do this, because the synchronization in callers
> must be based upon the _passed bit-string_, otherwise _they_ are the
> ones who're buggy.

Yes it makes sense to let the compiler move memory operations over
clear_bit(), because we have defined the interface to be nice and
relaxed. And this is exactly why we do need to have an additional
barrier there in smp_mb__*_clear_bit().


> [ For those interested, I've been looking at the code generated
>   for the test kernel I built with these patches, and I don't
>   really see anything gcc did that it shouldn't have -- but ok,
>   that doesn't mean other versions/toolchains for other setups
>   won't. Also, the test box has been up all night, but I'm only
>   running Firefox on it anyway, and don't really know how to
>   verify if I've introduced any correctness issues / bugs. ]

correct output != correct input.

Without a barrier there, we _allow_ the compiler to reorder. If it
does not reorder, the missing barrier is still a bug :)

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24  7:24       ` Nick Piggin
@ 2007-07-24  8:29         ` Satyam Sharma
  2007-07-24  8:39           ` Nick Piggin
  2007-07-24  8:38         ` Trent Piepho
  1 sibling, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-24  8:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds, Jeremy Fitzhardinge

On Tue, 24 Jul 2007, Nick Piggin wrote:

> > > [...]
> > > 
> > > __test_and_change_bit is one that you could remove the memory clobber
> > > from.
> > 
> > Yes, for the atomic versions we don't care if we're asking gcc to
> > generate trashy code (even though I'd have wanted to only disallow
> > problematic optimizations -- ones involving the passed bit-string
> > address -- there, and allow other memory references to be optimized
> > as and how the compiler feels like it) because the atomic variants
> > are slow anyway and we probably want to be extra-safe there.
> > 
> > But for the non-atomic variants, it does make sense to remove the
> > memory clobber (and the unneeded __asm__ __volatile__ that another
> > patch did -- for the non-atomic variants, again).
> 
> No. It has nothing to do with atomicity and all to do with ordering.

The memory clobber, or the volatile asm? There's more than one variable
here ... but still, I don't think either affects _ordering_ in any
_direct_ way.


> For example test_bit, clear_bit, set_bit, etc are all atomic but none
> place any restrictions on ordering.

In that case we need to update comments in include/asm-i386/bitops.h


> __test_and_change_bit has no restriction on ordering, so as long as
> the correct operands are clobbered, a "memory" clobber to enforce a
> compiler barrier is not needed.

But why even for the other operations? Consider (current code of)
test_and_set_bit():

static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
{
	int oldbit;
	__asm__ __volatile__( LOCK_PREFIX
		"btsl %2,%1\n\tsbbl %0,%0"
		:"=r" (oldbit),"+m" (ADDR)
		:"Ir" (nr) : "memory");

	return oldbit;
}

The only memory reference in there is to the passed address, it will
be modified, yes, but that's been made obvious to gcc in the asm
already. So why are we marking all of memory as clobbered, is the
question. (I just read Jeremy's latest reply, but I don't see how
or why the memory clobber helps that case either -- does a memory
clobber affect how gcc would order / reorder code?)


> > OTOH, as per Linus' review it seems we can drop the "memory" clobber
> > and specify the output operand for the extended asm as "+m". But I
> > must admit I didn't quite understand that at all.
> 
> Yes, but that is for cases where "memory" is being used to say that
> some otherwise unspecified memory is actually being changed, rather
> than to provide a compiler barrier as is the case for test_and_set_bit
> and co.

Again, as I said above.

> > [ I should probably start reading gcc sources, the docs are said to
> >   be insufficient/out-of-date, as per the reviews of the patches. ]
> 
> You should read Documentation/atomic_ops.txt and memory-barriers.txt,
> which are quite useful and should be uptodate.

I have, of course. Will probably read again, thanks.


Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24  7:48       ` Jeremy Fitzhardinge
@ 2007-07-24  8:31         ` Nick Piggin
  0 siblings, 0 replies; 92+ messages in thread
From: Nick Piggin @ 2007-07-24  8:31 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Andi Kleen, Andrew Morton, Linus Torvalds

Jeremy Fitzhardinge wrote:
> Satyam Sharma wrote:
> 
>>Consider this (the above two functions exist only for clear_bit(),
>>the atomic variant, as you already know), the _only_ memory reference
>>we care about is that of the address of the passed bit-string:
>>
>>(1) The compiler must not optimize / elid it (i.e. we need to disallow
>>    compiler optimization for that reference) -- but we've already taken
>>    care of that with the __asm__ __volatile__ and the constraints on
>>    the memory "addr" operand there, and,
>>(2) For the i386, it also includes an implicit memory (CPU) barrier
>>    already.
>>
>>So I /think/ it makes sense to let the compiler optimize _other_ memory
>>references across the call to clear_bit(). There's a difference. I think
>>we'd be safe even if we do this, because the synchronization in callers
>>must be based upon the _passed bit-string_, otherwise _they_ are the
>>ones who're buggy.
>>
>>[ However, elsewhere Jeremy Fitzhardinge mentioned the case of
>>  some callers, for instance, doing a memset() on an alias of
>>  the same bit-string. But again, I think that is dodgy/buggy/
>>  extremely border-line usage on the caller's side itself ...
>>  *unless* the caller is doing that inside a higher-level lock
>>  anyway, in which case he wouldn't be needing to use the
>>  locked variants either ... ]
>>  
> 
> 
> You miss my point.  If you have:
> 
> 	memset(&my_bitmask, 0, sizeof(my_bitmask));
> 	set_bit(my_bitmask, 44);
> 
> Then unless the set_bit has a constraint argument which covers the whole
> of the (multiword) bitmask, the compiler may see fit to interleave the
> memset writes with the set_bit in bad ways.  In other words, plain "+m"
> (*(long *)ptr) won't cut it.  You'd need "+m" (my_bitmask), I think.

That's a valid point, and looks like it is a bug in the existing i386
macros, doesn't it? We should be clobbering addr + BITOP_WORD(nr).

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24  7:24       ` Nick Piggin
  2007-07-24  8:29         ` Satyam Sharma
@ 2007-07-24  8:38         ` Trent Piepho
  2007-07-24 19:39           ` Linus Torvalds
  1 sibling, 1 reply; 92+ messages in thread
From: Trent Piepho @ 2007-07-24  8:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Andi Kleen, Andrew Morton, Linus Torvalds

On Tue, 24 Jul 2007, Nick Piggin wrote:
> Satyam Sharma wrote:
> > But for the non-atomic variants, it does make sense to remove the
> > memory clobber (and the unneeded __asm__ __volatile__ that another
> > patch did -- for the non-atomic variants, again).
>
> No. It has nothing to do with atomicity and all to do with ordering.
> For example test_bit, clear_bit, set_bit, etc are all atomic but none
> place any restrictions on ordering.
>
> __test_and_change_bit has no restriction on ordering, so as long as
> the correct operands are clobbered, a "memory" clobber to enforce a
> compiler barrier is not needed.

Speaking of that, why are all the asm functions in arch/i386/lib/string.c
defined as having a memory clobber, even those which don't modify memory
like strcmp, strchr, strlen and so on?

Why are they all volatile too?  Even those with no side effects.

It seems like any asm which writes to or _reads from_ memory that isn't one
of the operands must have a memory clobber.  And any asm with side effects
other than it's operands (and asm("mov $0, (%0)" :  :  "r"(some_pointer))
has side effects) must be volatile.

So something like strlen() needs to have a memory clobber, otherwise a
store to the string could be moved to the other side of the strlen(), but
doesn't need to be volatile, since it has no side effects.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24  8:29         ` Satyam Sharma
@ 2007-07-24  8:39           ` Nick Piggin
  0 siblings, 0 replies; 92+ messages in thread
From: Nick Piggin @ 2007-07-24  8:39 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds, Jeremy Fitzhardinge

Satyam Sharma wrote:
> On Tue, 24 Jul 2007, Nick Piggin wrote:
> 
> 
>>>>[...]
>>>>
>>>>__test_and_change_bit is one that you could remove the memory clobber
>>>>from.
>>>
>>>Yes, for the atomic versions we don't care if we're asking gcc to
>>>generate trashy code (even though I'd have wanted to only disallow
>>>problematic optimizations -- ones involving the passed bit-string
>>>address -- there, and allow other memory references to be optimized
>>>as and how the compiler feels like it) because the atomic variants
>>>are slow anyway and we probably want to be extra-safe there.
>>>
>>>But for the non-atomic variants, it does make sense to remove the
>>>memory clobber (and the unneeded __asm__ __volatile__ that another
>>>patch did -- for the non-atomic variants, again).
>>
>>No. It has nothing to do with atomicity and all to do with ordering.
> 
> 
> The memory clobber, or the volatile asm? There's more than one variable
> here ... but still, I don't think either affects _ordering_ in any
> _direct_ way.

The clobber which you remove with this patch.


>>For example test_bit, clear_bit, set_bit, etc are all atomic but none
>>place any restrictions on ordering.
> 
> 
> In that case we need to update comments in include/asm-i386/bitops.h

Hmm... yeah it looks like they could be reordered. I think?


>>__test_and_change_bit has no restriction on ordering, so as long as
>>the correct operands are clobbered, a "memory" clobber to enforce a
>>compiler barrier is not needed.
> 
> 
> But why even for the other operations? Consider (current code of)
> test_and_set_bit():
> 
> static inline int test_and_set_bit(int nr, volatile unsigned long * addr)
> {
> 	int oldbit;
> 	__asm__ __volatile__( LOCK_PREFIX
> 		"btsl %2,%1\n\tsbbl %0,%0"
> 		:"=r" (oldbit),"+m" (ADDR)
> 		:"Ir" (nr) : "memory");
> 
> 	return oldbit;
> }
> 
> The only memory reference in there is to the passed address, it will
> be modified, yes, but that's been made obvious to gcc in the asm
> already. So why are we marking all of memory as clobbered, is the
> question. (I just read Jeremy's latest reply, but I don't see how
> or why the memory clobber helps that case either -- does a memory
> clobber affect how gcc would order / reorder code?)

Of course, because then the compiler can't assume anything about
the contents of memory after the operation.

   #define barrier() __asm__ __volatile__("": : :"memory")

A memory clobber is equivalent to a compiler barrier.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24  8:20       ` Nick Piggin
@ 2007-07-24  9:21         ` Satyam Sharma
  2007-07-24 10:25           ` Nick Piggin
  0 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-24  9:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Tue, 24 Jul 2007, Nick Piggin wrote:

> Satyam Sharma wrote:
> > On Tue, 24 Jul 2007, Nick Piggin wrote:
> > > Satyam Sharma wrote:
> > > [...]
> > > > So let's make these proper no-ops, because that's exactly what we
> > > > require
> > > > these to be on the i386 platform.
> > > 
> > > No. clear_bit is not a compiler barrier on i386,
> > 
> > Obvious.
> > 
> > > thus smp_mb__before/after
> > > must be.

> > Not so obvious. Why do we require these to be a full compiler barrier
> > is precisely the question I raised here.

> > Consider this (the above two functions exist only for clear_bit(),
> > the atomic variant, as you already know), the _only_ memory reference
> > we care about is that of the address of the passed bit-string:
> 
> No. Memory barriers explicitly extend to all memory references.

[ Compiler barrier, you mean, that's not true of CPU barriers. ]

In any case, I know that, obviously. I asked "why" not "what" :-) i.e.
why should we care about other addresses / why do we want to extend
the compiler barrier to all memory references -- but Jeremy seems to
have answered that ...

> > (1) The compiler must not optimize / elid it (i.e. we need to disallow
> >     compiler optimization for that reference) -- but we've already taken
> >     care of that with the __asm__ __volatile__ and the constraints on
> >     the memory "addr" operand there, and,
> > (2) For the i386, it also includes an implicit memory (CPU) barrier
> >     already.
> 
> Repeating what has been said before: A CPU memory barrier is not a
> compiler barrier or vice versa. Seeing as we are talking about
> the compiler barrier, it is irrelevant as to whether or not the
> assembly includes a CPU barrier.

I think it is quite relevant, in fact. From Documentation/atomic_ops.txt,
smp_mb__{before,after}_clear_bit(), as the name itself suggests, must
be _CPU barriers_ for those arch's that don't have an implicit
_CPU barrier_ in the clear_bit() itself [ which i386 does have already ].

As for a compiler barrier, the asm there already guarantees the compiler
will not optimize references to _that_ address, but there could still be
the memset()/set{clear}_bit() interspersing pitfalls for example, so yeah
the memory clobber would probably protect us there.


Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints
  2007-07-23 16:05 ` [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints Satyam Sharma
  2007-07-23 16:37   ` Andi Kleen
  2007-07-23 17:46   ` Linus Torvalds
@ 2007-07-24  9:22   ` David Howells
  2 siblings, 0 replies; 92+ messages in thread
From: David Howells @ 2007-07-24  9:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Satyam Sharma, Linux Kernel Mailing List, Nick Piggin,
	Andi Kleen, Andrew Morton

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > So, using the "+" constraint modifier for memory, like "+m" is bogus.
> > We must simply specify "=m" which handles the case correctly.
> 
> No. This is wrong.

Agreed.

> "=m" means that the new value of the memory location is *set*.
> 
> Which means that gcc will potentially optimize away any previous stores to 
> that memory location.
> 
> And yes, it happens, and yes, we've seen it.

I had a lot of "fun" with this when mucking around with the asm-optimised R/W
semaphores.

David

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24  3:57   ` Nick Piggin
  2007-07-24  6:38     ` Satyam Sharma
@ 2007-07-24  9:44     ` David Howells
  2007-07-24 10:02       ` Satyam Sharma
  1 sibling, 1 reply; 92+ messages in thread
From: David Howells @ 2007-07-24  9:44 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Nick Piggin, Linux Kernel Mailing List, Andi Kleen,
	Andrew Morton, Linus Torvalds

Satyam Sharma <ssatyam@cse.iitk.ac.in> wrote:

> OTOH, as per Linus' review it seems we can drop the "memory" clobber
> and specify the output operand for the extended asm as "+m". But I
> must admit I didn't quite understand that at all.

As I understand it, the "+m" indicates to the compiler a restriction on the
ordering of things that access that particular memory location, whereas a
"memory" indicates a restriction on the orderings of all accesses to memory -
precisely what you need to produce a lock.

There are a number of things that use test_and_set_bit() and co to implement a
lock or other synchronisation.  This means that they must exhibit LOCK-class
barrier effects or better.  LOCK-class barrier effects mean, more or less,
that all memory accesses issued before the lock must happen before all memory
accesses issued after the lock.  But it most happen at both CPU-level and
compiler-level.  The "memory" constraint instructs the compiler in this
regard.

Remember also that this is gcc black magic and so some of it has had to be
worked out empirically - possibly after sacrificing a goat under a full moon.

David

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses
  2007-07-23 17:52   ` Linus Torvalds
  2007-07-24  4:19     ` Nick Piggin
@ 2007-07-24  9:49     ` Benjamin Herrenschmidt
  2007-07-24 17:20       ` Linus Torvalds
  1 sibling, 1 reply; 92+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-24  9:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton


> The "const volatile" is so that you can pass an arbitrary pointer. The 
> only kind of abritraty pointer is "const volatile".
> 
> In other words, the "volatile" has nothing at all to do with whether the 
> memory is volatile or not (the same way "const" has nothing to do with it: 
> it's purely a C type *safety* issue, exactly the same way "const" is a 
> type safety issue.
> 
> A "const" on a pointer doesn't mean that the thing it points to cannot 
> change. When you pass a source pointer to "strlen()", it doesn't have to 
> be constant. But "strlen()" takes a "const" pointer, because it work son 
> constant pointers *too*.

However... What about that:

 - This "volatile" will allow to pass pointers to volatile data to the
bitops.

 - Most users of "volatile" in the kenrel (except maybe jiffies) are
bogus

 - Thus let's remove it -as a type safety thing- to catch more of those
stupid volatile that shouldn't be ? :-)

Besides, as Nick pointed out, it prevents some valid optimizations.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-23 17:55   ` Linus Torvalds
@ 2007-07-24  9:52     ` Benjamin Herrenschmidt
  2007-07-24 17:24       ` Linus Torvalds
  0 siblings, 1 reply; 92+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-24  9:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton

On Mon, 2007-07-23 at 10:55 -0700, Linus Torvalds wrote:
> > The goal is to let gcc generate good, beautiful, optimized code.
> 
> No. The point is to let gcc generate *correct* code.
> 
> It's either "=m" together with "memory", or it's "+m".

In fact, it's more than that... the bitops that return a value are often
used to have hand-made spinlock semantics. I'm sure we would get funky
bugs if loads or stores leaked out of the locked region. I think a full
"memory" clobber should be kept around for those cases.

(That's also why on ppc, we give them a few more barriers)

Ben.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24  9:44     ` David Howells
@ 2007-07-24 10:02       ` Satyam Sharma
  0 siblings, 0 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-24 10:02 UTC (permalink / raw)
  To: David Howells
  Cc: Nick Piggin, Linux Kernel Mailing List, Andi Kleen,
	Andrew Morton, Linus Torvalds

Hi David,

On Tue, 24 Jul 2007, David Howells wrote:

> Satyam Sharma <ssatyam@cse.iitk.ac.in> wrote:
> 
> > OTOH, as per Linus' review it seems we can drop the "memory" clobber
> > and specify the output operand for the extended asm as "+m". But I
> > must admit I didn't quite understand that at all.
> 
> As I understand it, the "+m" indicates to the compiler a restriction on the
> ordering of things that access that particular memory location, whereas a
> "memory" indicates a restriction on the orderings of all accesses to memory -
> precisely what you need to produce a lock.

Ok, thanks -- I didn't know gcc's behaviour w.r.t. "+m" at all, but in my
defense I'd add all this was quite poorly/wrongly documented in the docs.

> There are a number of things that use test_and_set_bit() and co to implement a
> lock or other synchronisation.  This means that they must exhibit LOCK-class
> barrier effects or better.  LOCK-class barrier effects mean, more or less,
> that all memory accesses issued before the lock must happen before all memory
> accesses issued after the lock.  But it most happen at both CPU-level and
> compiler-level.  The "memory" constraint instructs the compiler in this
> regard.

Yes, thanks for laying this out so clearly, again. So combined with what
you explained above, I think I now fully understand why most of this
series was incorrect ...

> Remember also that this is gcc black magic and so some of it has had to be
> worked out empirically - possibly after sacrificing a goat under a full moon.

:-)


Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24  9:21         ` Satyam Sharma
@ 2007-07-24 10:25           ` Nick Piggin
  2007-07-24 11:10             ` Satyam Sharma
  0 siblings, 1 reply; 92+ messages in thread
From: Nick Piggin @ 2007-07-24 10:25 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

--- Satyam Sharma <ssatyam@cse.iitk.ac.in> wrote:

> On Tue, 24 Jul 2007, Nick Piggin wrote:
> 
> > Satyam Sharma wrote:

> > > Consider this (the above two functions exist
> only for clear_bit(),
> > > the atomic variant, as you already know), the
> _only_ memory reference
> > > we care about is that of the address of the
> passed bit-string:
> > 
> > No. Memory barriers explicitly extend to all
> memory references.
> 
> [ Compiler barrier, you mean, that's not true of CPU
> barriers. ]

For the purpose of this discussion (Linux memory
barrier semantics, on WB memory), it is true of CPU
and compiler barriers.

 
> In any case, I know that, obviously. I asked "why"
> not "what" :-) i.e.
> why should we care about other addresses / why do we
> want to extend
> the compiler barrier to all memory references -- but
> Jeremy seems to
> have answered that ...

Obviously because we want some kind of ordering
guarantee at a given point. All the CPU barriers
in the world are useless if the compiler can reorder
access over them.


> > Repeating what has been said before: A CPU memory
> barrier is not a
> > compiler barrier or vice versa. Seeing as we are
> talking about
> > the compiler barrier, it is irrelevant as to
> whether or not the
> > assembly includes a CPU barrier.
> 
> I think it is quite relevant, in fact. From
> Documentation/atomic_ops.txt,
> smp_mb__{before,after}_clear_bit(), as the name
> itself suggests, must
> be _CPU barriers_ for those arch's that don't have
> an implicit
> _CPU barrier_ in the clear_bit() itself [ which i386
> does have already ].
> 
> As for a compiler barrier, the asm there already
> guarantees the compiler
> will not optimize references to _that_ address

One or both of us still fails to understand the other.

bit_spin_lock(LOCK_NR, &word);
var++;
/* this is bit_spin_unlock(LOCK_NR, &word); */
smp_mb__before_clear_bit();
clear_bit(LOCK_NR, &word);

Are you saying that it is OK for the store to var to
be reordered below the clear_bit? If not, what are you
saying?



      Yahoo!7 Mail has just got even bigger and better with unlimited storage on all webmail accounts.
http://au.docs.yahoo.com/mail/unlimitedstorage.html




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24 10:25           ` Nick Piggin
@ 2007-07-24 11:10             ` Satyam Sharma
  2007-07-24 11:32               ` Nick Piggin
  0 siblings, 1 reply; 92+ messages in thread
From: Satyam Sharma @ 2007-07-24 11:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Tue, 24 Jul 2007, Nick Piggin wrote:

> > > Satyam Sharma wrote:
> 
> > > > Consider this (the above two functions exist
> > only for clear_bit(),
> > > > the atomic variant, as you already know), the
> > _only_ memory reference
> > > > we care about is that of the address of the
> > passed bit-string:
> > > 
> > > No. Memory barriers explicitly extend to all
> > memory references.
> > 
> > [ Compiler barrier, you mean, that's not true of CPU
> > barriers. ]
> 
> For the purpose of this discussion (Linux memory
> barrier semantics, on WB memory), it is true of CPU
> and compiler barriers.

On later Intel processors, if the memory address range being referenced
(and say written to) by the (locked) instruction is in the cache of a
CPU, then it will not assert the LOCK signal on the system bus --
thus not assume exclusive use of shared memory. So other CPUs are free
to modify (other) memory at that point. Cache coherency will still
ensure _that_ (locked) memory area is still updated atomically, though.

> Obviously because we want some kind of ordering
> guarantee at a given point. All the CPU barriers
> in the world are useless if the compiler can reorder
> access over them.

Yes ...

> > As for a compiler barrier, the asm there already
> > guarantees the compiler
> > will not optimize references to _that_ address
> 
> One or both of us still fails to understand the other.

... I think the culprit is me ...

> bit_spin_lock(LOCK_NR, &word);
> var++;
> /* this is bit_spin_unlock(LOCK_NR, &word); */
> smp_mb__before_clear_bit();
> clear_bit(LOCK_NR, &word);

Yup, David has laid this out clearly, as well.

> Are you saying that it is OK for the store to var to
> be reordered below the clear_bit? If not, what are you
> saying?

I might be making a radical turn-around here, but all of a
sudden I think it's actually a good idea to put a complete
memory clobber in set_bit/clear_bit and friends themselves,
and leave the "__" variants as they are.


Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24 11:10             ` Satyam Sharma
@ 2007-07-24 11:32               ` Nick Piggin
  2007-07-24 11:45                 ` Satyam Sharma
  0 siblings, 1 reply; 92+ messages in thread
From: Nick Piggin @ 2007-07-24 11:32 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> On Tue, 24 Jul 2007, Nick Piggin wrote:
> 

>>For the purpose of this discussion (Linux memory
>>barrier semantics, on WB memory), it is true of CPU
>>and compiler barriers.
> 
> 
> On later Intel processors, if the memory address range being referenced
> (and say written to) by the (locked) instruction is in the cache of a
> CPU, then it will not assert the LOCK signal on the system bus --
> thus not assume exclusive use of shared memory. So other CPUs are free
> to modify (other) memory at that point. Cache coherency will still
> ensure _that_ (locked) memory area is still updated atomically, though.

The system bus does not need to be serialised because the CPU already
holds the cacheline in exclusive state. That *is* the cache coherency
protocol.

The memory ordering is enforced by the CPU effectively preventing
speculative loads to pass the locked instruction and ensuring all
stores reach the cache before it is executed. (I say effectively
because the CPU might do clever tricks without you knowing).


>>Are you saying that it is OK for the store to var to
>>be reordered below the clear_bit? If not, what are you
>>saying?
> 
> 
> I might be making a radical turn-around here, but all of a
> sudden I think it's actually a good idea to put a complete
> memory clobber in set_bit/clear_bit and friends themselves,
> and leave the "__" variants as they are.

Why?

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24 11:32               ` Nick Piggin
@ 2007-07-24 11:45                 ` Satyam Sharma
  2007-07-24 12:01                   ` Nick Piggin
  2007-07-24 17:12                   ` Linus Torvalds
  0 siblings, 2 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-24 11:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

On Tue, 24 Jul 2007, Nick Piggin wrote:

> > > For the purpose of this discussion (Linux memory
> > > barrier semantics, on WB memory), it is true of CPU
> > > and compiler barriers.
> > 
> > On later Intel processors, if the memory address range being referenced
> > (and say written to) by the (locked) instruction is in the cache of a
> > CPU, then it will not assert the LOCK signal on the system bus --
> > thus not assume exclusive use of shared memory. So other CPUs are free
> > to modify (other) memory at that point. Cache coherency will still
> > ensure _that_ (locked) memory area is still updated atomically, though.
> 
> The system bus does not need to be serialised because the CPU already
> holds the cacheline in exclusive state. That *is* the cache coherency
> protocol.
> 
> The memory ordering is enforced by the CPU effectively preventing
> speculative loads to pass the locked instruction and ensuring all
> stores reach the cache before it is executed. (I say effectively
> because the CPU might do clever tricks without you knowing).

Looks like when you said "CPU memory barrier extends to all memory
references" you were probably referring to a _given_ CPU ... yes,
that statement is correct in that case.

> > > Are you saying that it is OK for the store to var to
> > > be reordered below the clear_bit? If not, what are you
> > > saying?
> > 
> > 
> > I might be making a radical turn-around here, but all of a
> > sudden I think it's actually a good idea to put a complete
> > memory clobber in set_bit/clear_bit and friends themselves,
> > and leave the "__" variants as they are.
> 
> Why?

Well, why not. Callers who don't want/need any guarantees whatsoever
can still use __foo() -- for others, it makes sense to just use
foo() and get *both* the compiler and CPU barrier semantics -- I think
that's the behaviour most callers would want anyway.


Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24 11:45                 ` Satyam Sharma
@ 2007-07-24 12:01                   ` Nick Piggin
  2007-07-24 17:12                   ` Linus Torvalds
  1 sibling, 0 replies; 92+ messages in thread
From: Nick Piggin @ 2007-07-24 12:01 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton, Linus Torvalds

Satyam Sharma wrote:
> On Tue, 24 Jul 2007, Nick Piggin wrote:

>>>>Are you saying that it is OK for the store to var to
>>>>be reordered below the clear_bit? If not, what are you
>>>>saying?
>>>
>>>
>>>I might be making a radical turn-around here, but all of a
>>>sudden I think it's actually a good idea to put a complete
>>>memory clobber in set_bit/clear_bit and friends themselves,
>>>and leave the "__" variants as they are.
>>
>>Why?
> 
> 
> Well, why not. Callers who don't want/need any guarantees whatsoever
> can still use __foo() -- for others, it makes sense to just use
> foo() and get *both* the compiler and CPU barrier semantics -- I think
> that's the behaviour most callers would want anyway.

Firstly, __foo() is non-atomic, secondly set_bit/clear_bit/etc do not
provide any CPU or compiler semantics.

It was set up this way because other CPU ISAs don't couple atomicity
with memory ordering, and with many implementations of those, the cost
to order memory is quite high.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24 11:45                 ` Satyam Sharma
  2007-07-24 12:01                   ` Nick Piggin
@ 2007-07-24 17:12                   ` Linus Torvalds
  2007-07-24 19:01                     ` Satyam Sharma
  1 sibling, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-07-24 17:12 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Nick Piggin, Linux Kernel Mailing List, David Howells,
	Andi Kleen, Andrew Morton



On Tue, 24 Jul 2007, Satyam Sharma wrote:
> 
> Looks like when you said "CPU memory barrier extends to all memory
> references" you were probably referring to a _given_ CPU ... yes,
> that statement is correct in that case.

No. CPU memory barriers extend to all CPU's. End of discussion.

It's not about "that cacheline". The whole *point* of a CPU memory barrier 
is that it's about independent memory accesses.

Yes, for a memory barrier to be effective, all CPU's involved in the 
transaction have to have the barriers - the same way a lock needs to be 
taken by everybody in order for it to make sense - but the point is, CPU 
barriers are about *global* behaviour, not local ones.

So there's a *huge* difference between

	clear_bit(x,y);

and

	clear_bit(x,y);
	smp_mb__before_after_clear_bit();

and it has absolutely nothing to do with the particular cacheline that "y" 
is in, it's about the *global* memory ordering.

Any write you do after that "smp_mb__before_after_clear_bit()" will be 
guaranteed to be visible to _other_ CPU's *after* they have seen the bit 
being cleared. Yes, those other CPU's need to have a read barrier between 
reading the bit and reading some other thign, but the point is, this hass 
*nothing* to do with cache coherency, and the particular cache line that 
"y" is in.

And no, "smp_mb__before/after_clear_bit()" must *not* be just an empty "do 
{} while (0)". It needs to be a compiler barrier even when it has no 
actual CPU meaning, unless clear_bit() itself is guaranteed to be a 
compiler barrier (which it isn't, although the "volatile" on the asm in 
practice makes it something *close* to that).

Why? Think of the sequence like this:

	clear_bit(x,y);
	smp_mb__after_clear_bit();
	other_variable = 10;

the whole *point* of this sequence is that if another CPU does

	x = other_variable;
	smp_rmb();
	bit = test_bit(x,y)

then if it sees "x" being 10, then the bit *has* to be clear.

And this is why the compiler barrier in "smp_mb__after_clear_bit()" needs 
to be a compiler barrier:

 - it doesn't matter for the action of the "clear_bit()" itself: that one 
   is locked, and on x86 it thus also happens to be a serializing 
   instruction, and the cache coherency and lock obviously means that the 
   bit clearing *itself* is safe!

 - but it *does* matter for the compiler scheduling. If the compiler were 
   to decide that "y" and "other_variable" are totally independent, it 
   might otherwise decide to move the "other_variable = 10" assignment to 
   *before* the clear_bit(), which would make the whole code pointless!

See? We have two totally independent issues:

 - the CPU itself can re-order the visibility of accesses. x86 doesn't do 
   this very much, and doesn't do it at all across a locked instruction, 
   but it's still a real issue, even if it tends to be much easier to see 
   on other architectures.

 - the compiler doesn't care about rules of "locked instruction" at all, 
   because it has no clue. It has *different* rules about how it can 
   re-order instructions and accesses, and maybe the "asm volatile" will 
   guarantee that the compiler won't re-order things around the 
   clear_bit(), and maybe it won't. But making it a compiler barrier (by 
   using the "memory clobber" thing, *guarantees* that gcc cannot reorder 
   memory writes or reads.

See? Two different - and _totally_ independent - levels of ordering, and 
we need to make sure that both are valid.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses
  2007-07-24  9:49     ` Benjamin Herrenschmidt
@ 2007-07-24 17:20       ` Linus Torvalds
  2007-07-24 17:39         ` Jeff Garzik
  2007-07-25  4:54         ` Nick Piggin
  0 siblings, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-07-24 17:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton



On Tue, 24 Jul 2007, Benjamin Herrenschmidt wrote:
> 
>  - This "volatile" will allow to pass pointers to volatile data to the
> bitops.
> 
>  - Most users of "volatile" in the kenrel (except maybe jiffies) are
> bogus
> 
>  - Thus let's remove it -as a type safety thing- to catch more of those
> stupid volatile that shouldn't be ? :-)

Quite frankly, I'd really really rather kill the "volatile" data
structures independently. Once that is done, it doesn't really matter any
more.

> Besides, as Nick pointed out, it prevents some valid optimizations.

No it doesn't. Not the ones on the functions that just do an inline asm.

The only valid optimization it might break is for "constant_test_bit()", 
which isn't even using inline asm.

And before we remove that "volatile", we'd better make damn sure that 
there isn't any driver that does

	/* Wait for the command queue to be cleared by DMA */
	while (test_bit(...))
		;

or similar.

Yes, it's annoying, but this is a scary and subtle area. And we sadly 
_have_ had code that does things like that.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24  9:52     ` Benjamin Herrenschmidt
@ 2007-07-24 17:24       ` Linus Torvalds
  2007-07-24 17:42         ` Trond Myklebust
  2007-07-24 21:36         ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-07-24 17:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton



On Tue, 24 Jul 2007, Benjamin Herrenschmidt wrote:
> 
> In fact, it's more than that... the bitops that return a value are often
> used to have hand-made spinlock semantics. I'm sure we would get funky
> bugs if loads or stores leaked out of the locked region. I think a full
> "memory" clobber should be kept around for those cases.

Not helpful.

The CPU ordering constraints for "test_and_set_bit()" and friends are weak 
enough that even if you have a full memory clobber, it still wouldn't work 
as a lock.

That's exactly why we have smp_mb__after_set_bit() and friends. On some 
architectures (arm, mips, parsic, powerpc), *that* is where the CPU memory 
barrier is, because the "test_and_set_bit()" itself is just a 
cache-coherent operation, not an actual barrier.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses
  2007-07-24 17:20       ` Linus Torvalds
@ 2007-07-24 17:39         ` Jeff Garzik
  2007-07-25  4:54         ` Nick Piggin
  1 sibling, 0 replies; 92+ messages in thread
From: Jeff Garzik @ 2007-07-24 17:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Nick Piggin, Andi Kleen, Andrew Morton

Linus Torvalds wrote:
> And before we remove that "volatile", we'd better make damn sure that 
> there isn't any driver that does
> 
> 	/* Wait for the command queue to be cleared by DMA */
> 	while (test_bit(...))
> 		;
> 
> or similar.
> 
> Yes, it's annoying, but this is a scary and subtle area. And we sadly 
> _have_ had code that does things like that.


I certainly cannot speak for all the grotty drivers out there, but I've 
never ever seen anything like the above.  I would consider anyone using 
kernel bit operations on DMA memory to be more than a little crazy.  But 
that's just me :)

Usually you will see

	while (1) {
		rmb();
		if (software_index == hardware_index_in_DMA_memory)
			break;
		
		... handle a unit of work ...
	}

Though ISTR being told that even rmb() was not sufficient in all cases 
[nonetheless, that's what I use in net and SATA drivers and email 
recommendations, and people seem happy]

	Jeff



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 17:24       ` Linus Torvalds
@ 2007-07-24 17:42         ` Trond Myklebust
  2007-07-24 18:13           ` Linus Torvalds
  2007-07-24 21:36         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 92+ messages in thread
From: Trond Myklebust @ 2007-07-24 17:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Nick Piggin, Andi Kleen, Andrew Morton

On Tue, 2007-07-24 at 10:24 -0700, Linus Torvalds wrote:
> 
> On Tue, 24 Jul 2007, Benjamin Herrenschmidt wrote:
> > 
> > In fact, it's more than that... the bitops that return a value are often
> > used to have hand-made spinlock semantics. I'm sure we would get funky
> > bugs if loads or stores leaked out of the locked region. I think a full
> > "memory" clobber should be kept around for those cases.
> 
> Not helpful.
> 
> The CPU ordering constraints for "test_and_set_bit()" and friends are weak 
> enough that even if you have a full memory clobber, it still wouldn't work 
> as a lock.
> 
> That's exactly why we have smp_mb__after_set_bit() and friends. On some 
> architectures (arm, mips, parsic, powerpc), *that* is where the CPU memory 
> barrier is, because the "test_and_set_bit()" itself is just a 
> cache-coherent operation, not an actual barrier.

That's not what the Documentation/memory-barriers.txt states:

        Any atomic operation that modifies some state in memory and returns information
        about the state (old or new) implies an SMP-conditional general memory barrier
        (smp_mb()) on each side of the actual operation.  These include:
        
        .....
        
        test_and_set_bit();
        test_and_clear_bit();
        test_and_change_bit();
...

Trond


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 17:42         ` Trond Myklebust
@ 2007-07-24 18:13           ` Linus Torvalds
  2007-07-24 18:28             ` Trond Myklebust
  2007-07-24 21:37             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-07-24 18:13 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Benjamin Herrenschmidt, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Nick Piggin, Andi Kleen, Andrew Morton



On Tue, 24 Jul 2007, Trond Myklebust wrote:
> 
> That's not what the Documentation/memory-barriers.txt states:

Hmm.. You're right. We only actually need it for the unconditional bitops, 
like the *unlock* side.

IOW, if you do a spinlock with the bitops, the locking side should be able 
to use a "test_and_set_bit()" on its own, but the unlocking side should be

	smp_mb__before_clear_bit();
	clear_bit();

because the ones that don't return a value also don't imply a memory 
barrier.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 18:13           ` Linus Torvalds
@ 2007-07-24 18:28             ` Trond Myklebust
  2007-07-24 21:37             ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 92+ messages in thread
From: Trond Myklebust @ 2007-07-24 18:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Nick Piggin, Andi Kleen, Andrew Morton

On Tue, 2007-07-24 at 11:13 -0700, Linus Torvalds wrote:
> 
> On Tue, 24 Jul 2007, Trond Myklebust wrote:
> > 
> > That's not what the Documentation/memory-barriers.txt states:
> 
> Hmm.. You're right. We only actually need it for the unconditional bitops, 
> like the *unlock* side.
> 
> IOW, if you do a spinlock with the bitops, the locking side should be able 
> to use a "test_and_set_bit()" on its own, but the unlocking side should be
> 
> 	smp_mb__before_clear_bit();
> 	clear_bit();
> 
> because the ones that don't return a value also don't imply a memory 
> barrier.

Yup, and this is exactly what we currently do in bit_spin_unlock().

Trond


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions
  2007-07-24 17:12                   ` Linus Torvalds
@ 2007-07-24 19:01                     ` Satyam Sharma
  0 siblings, 0 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-24 19:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Linux Kernel Mailing List, David Howells,
	Andi Kleen, Andrew Morton


On Tue, 24 Jul 2007, Linus Torvalds wrote:

> On Tue, 24 Jul 2007, Satyam Sharma wrote:
> > 
> > Looks like when you said "CPU memory barrier extends to all memory
> > references" you were probably referring to a _given_ CPU ... yes,
> > that statement is correct in that case.
> 
> No. CPU memory barriers extend to all CPU's. End of discussion.
> 
> It's not about "that cacheline". The whole *point* of a CPU memory barrier 
> is that it's about independent memory accesses.
> 
> Yes, for a memory barrier to be effective, all CPU's involved in the 
> transaction have to have the barriers - the same way a lock needs to be 
> taken by everybody in order for it to make sense - but the point is, CPU 
> barriers are about *global* behaviour, not local ones.
> 
> So there's a *huge* difference between
> 
> 	clear_bit(x,y);
> 
> and
> 
> 	clear_bit(x,y);
> 	smp_mb__before_after_clear_bit();
> 
> and it has absolutely nothing to do with the particular cacheline that "y" 
> is in, it's about the *global* memory ordering.
> 
> Any write you do after that "smp_mb__before_after_clear_bit()" will be 
> guaranteed to be visible to _other_ CPU's *after* they have seen the bit 
> being cleared. Yes, those other CPU's need to have a read barrier between 
> reading the bit and reading some other thign, but the point is, this hass 
> *nothing* to do with cache coherency, and the particular cache line that 
> "y" is in.
> 
> And no, "smp_mb__before/after_clear_bit()" must *not* be just an empty "do 
> {} while (0)". It needs to be a compiler barrier even when it has no 
> actual CPU meaning, unless clear_bit() itself is guaranteed to be a 
> compiler barrier (which it isn't, although the "volatile" on the asm in 
> practice makes it something *close* to that).
> 
> Why? Think of the sequence like this:
> 
> 	clear_bit(x,y);
> 	smp_mb__after_clear_bit();
> 	other_variable = 10;
> 
> the whole *point* of this sequence is that if another CPU does
> 
> 	x = other_variable;
> 	smp_rmb();
> 	bit = test_bit(x,y)
> 
> then if it sees "x" being 10, then the bit *has* to be clear.
> 
> And this is why the compiler barrier in "smp_mb__after_clear_bit()" needs 
> to be a compiler barrier:
> 
>  - it doesn't matter for the action of the "clear_bit()" itself: that one 
>    is locked, and on x86 it thus also happens to be a serializing 
>    instruction, and the cache coherency and lock obviously means that the 
>    bit clearing *itself* is safe!
> 
>  - but it *does* matter for the compiler scheduling. If the compiler were 
>    to decide that "y" and "other_variable" are totally independent, it 
>    might otherwise decide to move the "other_variable = 10" assignment to 
>    *before* the clear_bit(), which would make the whole code pointless!
> 
> See? We have two totally independent issues:
> 
>  - the CPU itself can re-order the visibility of accesses. x86 doesn't do 
>    this very much, and doesn't do it at all across a locked instruction, 
>    but it's still a real issue, even if it tends to be much easier to see 
>    on other architectures.
> 
>  - the compiler doesn't care about rules of "locked instruction" at all, 
>    because it has no clue. It has *different* rules about how it can 
>    re-order instructions and accesses, and maybe the "asm volatile" will 
>    guarantee that the compiler won't re-order things around the 
>    clear_bit(), and maybe it won't. But making it a compiler barrier (by 
>    using the "memory clobber" thing, *guarantees* that gcc cannot reorder 
>    memory writes or reads.
> 
> See? Two different - and _totally_ independent - levels of ordering, and 
> we need to make sure that both are valid.

Ok, thanks much (David and Nick too!) for taking the time to explain this
out clearly. It does look amazingly obvious now that I see it, with
callers using bitops to implement locking schemes who would completely
break otherwise -- in fact 6 and 8 in this series look amazingly obtuse
now ...


Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24  8:38         ` Trent Piepho
@ 2007-07-24 19:39           ` Linus Torvalds
  2007-07-24 20:37             ` Andi Kleen
  2007-07-26  1:07             ` Trent Piepho
  0 siblings, 2 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-07-24 19:39 UTC (permalink / raw)
  To: Trent Piepho
  Cc: Nick Piggin, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Andi Kleen, Andrew Morton



On Tue, 24 Jul 2007, Trent Piepho wrote:
> 
> Speaking of that, why are all the asm functions in arch/i386/lib/string.c
> defined as having a memory clobber, even those which don't modify memory
> like strcmp, strchr, strlen and so on?

That's because the memory clobber will serialize the inline asm with 
anything else that reads or writes memory.

So even if we don't actually change any memory, if we cannot describe what 
we *read*, then we need to tell gcc to not re-order us wrt things that 
could *write*. And the simplest way to do that is to say that you clobber 
memory, even if you don't.

So as a more concrete example: imagine that we're doing a "strlen()" on 
some local variable. We need to tell gcc that that variable has to be 
updated in memory before it schedules the asm. The memory clobber does 
that.

(Yes, the "asm volatile" may do so too, but it's very unclear what the 
"volatile" on the asm actually does, so ..)

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 20:37             ` Andi Kleen
@ 2007-07-24 20:08               ` Linus Torvalds
  2007-07-24 21:31                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-07-24 20:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Trent Piepho, Nick Piggin, Satyam Sharma,
	Linux Kernel Mailing List, David Howells, Andrew Morton



On Tue, 24 Jul 2007, Andi Kleen wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> > 
> > (Yes, the "asm volatile" may do so too, but it's very unclear what the 
> > "volatile" on the asm actually does, so ..)
> 
> Without the volatile they get completely optimized away :/
> [tried that, cost a lot of debugging time -- empty string functions
> cause a lot of strange side effects]

Sure, that's *one* thing that "volatile" guarantees (it guarantees that 
gcc won't optimize away things where the end result isn't actually visibly 
used).

But gcc docs also talk about the other things volatile means, including 
"not significantly moved".

Is that "not significantly moved" already equivalent to "not moved past 
something that can change memory"? Probably not. Which is why it's a good 
idea to have the "memory" clobber there too, because the semantics of the 
"volatile" just aren't very well defined from a code movement standpoint 
(while a general memory clobber is *very* clear: if gcc moves the inline 
asm across another thing that uses/changes memory, that's a gcc bug, plain 
and simple (*)).

		Linus

(*) Other than pure internal gcc spills/reloads. Spilling/reloading local 
registers that haven't had their address taken obviously cannot be seen as 
memory accesses.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 19:39           ` Linus Torvalds
@ 2007-07-24 20:37             ` Andi Kleen
  2007-07-24 20:08               ` Linus Torvalds
  2007-07-26  1:07             ` Trent Piepho
  1 sibling, 1 reply; 92+ messages in thread
From: Andi Kleen @ 2007-07-24 20:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Trent Piepho, Nick Piggin, Satyam Sharma,
	Linux Kernel Mailing List, David Howells, Andrew Morton

Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> (Yes, the "asm volatile" may do so too, but it's very unclear what the 
> "volatile" on the asm actually does, so ..)

Without the volatile they get completely optimized away :/
[tried that, cost a lot of debugging time -- empty string functions
cause a lot of strange side effects]

-Andi

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 20:08               ` Linus Torvalds
@ 2007-07-24 21:31                 ` Jeremy Fitzhardinge
  2007-07-24 21:46                   ` Linus Torvalds
  0 siblings, 1 reply; 92+ messages in thread
From: Jeremy Fitzhardinge @ 2007-07-24 21:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Trent Piepho, Nick Piggin, Satyam Sharma,
	Linux Kernel Mailing List, David Howells, Andrew Morton

Linus Torvalds wrote:
> Sure, that's *one* thing that "volatile" guarantees (it guarantees that 
> gcc won't optimize away things where the end result isn't actually visibly 
> used).
>
> But gcc docs also talk about the other things volatile means, including 
> "not significantly moved".
>   

Actually, it doesn't.  In fact it goes out of its way to say that "asm
volatile" statements can be moved quite a bit, with respect to other
asms, other code, jumps, basic blocks, etc.  The only reliable way to
control the placement of an asm is have the right dependencies.

    The `volatile' keyword indicates that the instruction has important
    side-effects.  GCC will not delete a volatile `asm' if it is reachable.
    (The instruction can still be deleted if GCC can prove that
    control-flow will never reach the location of the instruction.)  Note
    that even a volatile `asm' instruction can be moved relative to other
    code, including across jump instructions.

also:

    An `asm' instruction without any output operands will be treated
    identically to a volatile `asm' instruction.

So there isn't anything very special about "asm volatile".  It's purely
to stop apparently useless code from being removed.

    J

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 17:24       ` Linus Torvalds
  2007-07-24 17:42         ` Trond Myklebust
@ 2007-07-24 21:36         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 92+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-24 21:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Satyam Sharma, Linux Kernel Mailing List, David Howells,
	Nick Piggin, Andi Kleen, Andrew Morton

On Tue, 2007-07-24 at 10:24 -0700, Linus Torvalds wrote:
> 
> On Tue, 24 Jul 2007, Benjamin Herrenschmidt wrote:
> > 
> > In fact, it's more than that... the bitops that return a value are often
> > used to have hand-made spinlock semantics. I'm sure we would get funky
> > bugs if loads or stores leaked out of the locked region. I think a full
> > "memory" clobber should be kept around for those cases.
> 
> Not helpful.
> 
> The CPU ordering constraints for "test_and_set_bit()" and friends are weak 
> enough that even if you have a full memory clobber, it still wouldn't work 
> as a lock.
>
> That's exactly why we have smp_mb__after_set_bit() and friends. On some 
> architectures (arm, mips, parsic, powerpc), *that* is where the CPU memory 
> barrier is, because the "test_and_set_bit()" itself is just a 
> cache-coherent operation, not an actual barrier.

Well, as I said, our test_and_set_bit() asm (and in general, the asm for
all the atomic ops that -return- a value) have at least some level of
barriers in them because of that. We do that because people are abusing
them as locks. The smp_mb__after_set_bit() I never quite grokked. We do
an mb in there but I suspect we don't need if it's only ever used after
test_and_set_bit() because of the above. The smb_mb__before_clear_bit()
makes more sense as it's supposed to give clear_bit() a spin_unlock
semantic.

But we do need the "memory" clobber as well.

That's one reason why I like Nick's bitop locks patches, providing
-explicit- test_and_set_bit_lock() and clear_bit_unlock(), we can fix a
whole lot of things and make sure they have the right barriers and not
more. (We save a few useless barriers on POWER that way in the page lock
path and it's measureable in his benchmark).

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 18:13           ` Linus Torvalds
  2007-07-24 18:28             ` Trond Myklebust
@ 2007-07-24 21:37             ` Benjamin Herrenschmidt
  2007-07-24 21:55               ` Trond Myklebust
  1 sibling, 1 reply; 92+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-24 21:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Trond Myklebust, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Nick Piggin, Andi Kleen, Andrew Morton

On Tue, 2007-07-24 at 11:13 -0700, Linus Torvalds wrote:
> 
> IOW, if you do a spinlock with the bitops, the locking side should be
> able 
> to use a "test_and_set_bit()" on its own, but the unlocking side
> should be
> 
>         smp_mb__before_clear_bit();
>         clear_bit();
> 
> because the ones that don't return a value also don't imply a memory 
> barrier.

Yup. But I much prefer Nick's clear_bit_unlock() :-)

Ben.



^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 21:31                 ` Jeremy Fitzhardinge
@ 2007-07-24 21:46                   ` Linus Torvalds
  0 siblings, 0 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-07-24 21:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Andi Kleen, Trent Piepho, Nick Piggin, Satyam Sharma,
	Linux Kernel Mailing List, David Howells, Andrew Morton



On Tue, 24 Jul 2007, Jeremy Fitzhardinge wrote:
> >
> > But gcc docs also talk about the other things volatile means, including 
> > "not significantly moved".
> 
> Actually, it doesn't.  In fact it goes out of its way to say that "asm
> volatile" statements can be moved quite a bit, with respect to other
> asms, other code, jumps, basic blocks, etc.

Ahh. That's newer.

Historically, gcc manuals used to say "may not be deleted or significantly 
reordered".

So they've weakened what it means, probably exactly because it wasn't 
well-defined before either.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 21:37             ` Benjamin Herrenschmidt
@ 2007-07-24 21:55               ` Trond Myklebust
  2007-07-24 22:32                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 92+ messages in thread
From: Trond Myklebust @ 2007-07-24 21:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Nick Piggin, Andi Kleen, Andrew Morton

On Wed, 2007-07-25 at 07:37 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2007-07-24 at 11:13 -0700, Linus Torvalds wrote:
> > 
> > IOW, if you do a spinlock with the bitops, the locking side should be
> > able 
> > to use a "test_and_set_bit()" on its own, but the unlocking side
> > should be
> > 
> >         smp_mb__before_clear_bit();
> >         clear_bit();
> > 
> > because the ones that don't return a value also don't imply a memory 
> > barrier.
> 
> Yup. But I much prefer Nick's clear_bit_unlock() :-)
> 
> Ben

If you want to use bitops as spinlocks you should rather be using
<linux/bit_spinlock.h>. That also does the right thing w.r.t.
pre-emption and sparse locking annotations.

Trond


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 21:55               ` Trond Myklebust
@ 2007-07-24 22:32                 ` Benjamin Herrenschmidt
  2007-07-25  4:10                   ` Nick Piggin
  0 siblings, 1 reply; 92+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-24 22:32 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Linus Torvalds, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Nick Piggin, Andi Kleen, Andrew Morton

On Tue, 2007-07-24 at 17:55 -0400, Trond Myklebust wrote:
> 
> If you want to use bitops as spinlocks you should rather be using
> <linux/bit_spinlock.h>. That also does the right thing w.r.t.
> pre-emption and sparse locking annotations.

Heh, I didn't know about those... A bit annoying that I can't override
them in the arch, I might be able to save a barrier or two here. Our
test_and_set_bit() contains both barriers for lock and unlock semantics
to cope with all kind of abuses, but your bit_spinlock obviously doesn't
need that.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 22:32                 ` Benjamin Herrenschmidt
@ 2007-07-25  4:10                   ` Nick Piggin
  0 siblings, 0 replies; 92+ messages in thread
From: Nick Piggin @ 2007-07-25  4:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Trond Myklebust, Linus Torvalds, Satyam Sharma,
	Linux Kernel Mailing List, David Howells, Andi Kleen,
	Andrew Morton

Benjamin Herrenschmidt wrote:
> On Tue, 2007-07-24 at 17:55 -0400, Trond Myklebust wrote:
> 
>>If you want to use bitops as spinlocks you should rather be using
>><linux/bit_spinlock.h>. That also does the right thing w.r.t.
>>pre-emption and sparse locking annotations.
> 
> 
> Heh, I didn't know about those... A bit annoying that I can't override
> them in the arch, I might be able to save a barrier or two here. Our

I guess the test_and_set_bit_lock / clear_bit_unlock will allow you to
override them in a way.

The big performance problem I see on my powerpc system is not the bit
spinlocks (open-coded or not), but the bit sleep locks.

Anyway, I'll finally send out the lock bitops patches again today...

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses
  2007-07-24 17:20       ` Linus Torvalds
  2007-07-24 17:39         ` Jeff Garzik
@ 2007-07-25  4:54         ` Nick Piggin
  1 sibling, 0 replies; 92+ messages in thread
From: Nick Piggin @ 2007-07-25  4:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Andi Kleen, Andrew Morton

Linus Torvalds wrote:
> 
> On Tue, 24 Jul 2007, Benjamin Herrenschmidt wrote:

>>Besides, as Nick pointed out, it prevents some valid optimizations.
> 
> 
> No it doesn't. Not the ones on the functions that just do an inline asm.
> 
> The only valid optimization it might break is for "constant_test_bit()", 
> which isn't even using inline asm.

The constant case is probably most used (at least for page flags), and
is most important for me. constant_test_bit may not be using inline asm,
but the volatile pointer target means that it reloads the value and can't
do much optimisation over it.

BTW. once volatile goes away, i386 really should start using the C
versions of __set_bit and __clear_bit as well IMO. (at least for the
constant bitnr case), so gcc can potentially optimise better.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-24 19:39           ` Linus Torvalds
  2007-07-24 20:37             ` Andi Kleen
@ 2007-07-26  1:07             ` Trent Piepho
  2007-07-26  1:18               ` Linus Torvalds
  1 sibling, 1 reply; 92+ messages in thread
From: Trent Piepho @ 2007-07-26  1:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Andi Kleen, Andrew Morton

On Tue, 24 Jul 2007, Linus Torvalds wrote:
> On Tue, 24 Jul 2007, Trent Piepho wrote:
> >
> > Speaking of that, why are all the asm functions in arch/i386/lib/string.c
> > defined as having a memory clobber, even those which don't modify memory
> > like strcmp, strchr, strlen and so on?
>
> That's because the memory clobber will serialize the inline asm with
> anything else that reads or writes memory.
>
> So even if we don't actually change any memory, if we cannot describe what
> we *read*, then we need to tell gcc to not re-order us wrt things that
> could *write*. And the simplest way to do that is to say that you clobber
> memory, even if you don't.

I went a made a test suite to see what really happened, and this isn't how
it works.  It appears that a memory clobber only tells gcc that the asm
writes to memory.  It does _not_ tell gcc that the asm reads from memory.

It's at http://www.speakeasy.org/~xyzzy/download/opttest.tar.gz
It's only 3k, but there are 16 files so I'm not inlining it.

The suite has a few test c files, which are compiled with various different
functions, inline norm asm, inline volatile asm, inline asm with a memory
clobber, a normal function, a __attribute__((const)) function, and so on.

They are compiled to asm file, and then a perl script scans the asm files to
figure out what optimizations gcc made.

"make test" will compile all the tests and run them through the perl scripts.
"make test1" will just run test1, etc.

It appears that a normal asm, one without volatile or a memory clobber, is
treated like a const function, which returns the output via a struct (not
using pass-by-address).  It has no side-effects, can't read or write global
variables, and can't dereference pointer arguments.

Adding volatile tells gcc the asm has some hidden side-effect.  It still can't
r/w globals or dereference inputs.  But it won't get elided if there are no
used outputs, common subexpression merged, or treated as a loop invariant.

Adding a memory clobber tells gcc that the asm modifies memory.  It doesn't
modify un-aliased local variables in registers.  It does modify aliased local
variables.  It does not read from memory.  gcc will move or elide a memory
write before an asm with a memory clobber if nothing else (besides the asm)
could see the write.  A memory clobber doesn't count as a side-effect either,
a non-volatile asm without unused outputs will be elided, even if has a memory
clobber.

Specifically, check test6_memasm.s.  The C code looks like this:

extern int a; /* keep asm from being elided for having no used output */
static inline void bar(void) { asm("call bar" : "=m"(a) : : "memory"); }
/* float x can't alias asm's output int a */
void foo(float *x) { x[20] = 1; bar(); x[20] = 2; }

The asm code ends up like this:
foo:
        call bar
        movl    4(%esp), %eax   # x, x
        movl    $0x40000000, 80(%eax)   #,
        ret

Notice that the first write to x[20] was NOT done.  It's also not done for a
volatile asm without a memory clobber.  But if you combine both volatile and a
memory clobber, then it is!  How to explain that?

The difference between test2_volasm.s and test2_normasm.s is hard to explain
too.  It seems like some times gcc forgets that imull is commutative.  It will
emit "imull %edx, %eax" in some cases, but change an asm slightly and it will
decide it must do "imull %eax, %edx ; movl %edx, %eax" for no apparent reason.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-26  1:07             ` Trent Piepho
@ 2007-07-26  1:18               ` Linus Torvalds
  2007-07-26  1:22                 ` Linus Torvalds
  0 siblings, 1 reply; 92+ messages in thread
From: Linus Torvalds @ 2007-07-26  1:18 UTC (permalink / raw)
  To: Trent Piepho
  Cc: Nick Piggin, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Andi Kleen, Andrew Morton



On Wed, 25 Jul 2007, Trent Piepho wrote:
> 
> Specifically, check test6_memasm.s.  The C code looks like this:
> 
> extern int a; /* keep asm from being elided for having no used output */
> static inline void bar(void) { asm("call bar" : "=m"(a) : : "memory"); }
> /* float x can't alias asm's output int a */
> void foo(float *x) { x[20] = 1; bar(); x[20] = 2; }
> 
> The asm code ends up like this:
> foo:
>         call bar
>         movl    4(%esp), %eax   # x, x
>         movl    $0x40000000, 80(%eax)   #,
>         ret

Hmm. I really think you should take this up with the gcc people. That 
looks like a gcc bug - because there really is nothing that guarantees 
that the asm doesn't change the array that "x" points to, and the asm 
clearly talks about clobbering memory.

> Notice that the first write to x[20] was NOT done.  It's also not done for a
> volatile asm without a memory clobber.  But if you combine both volatile and a
> memory clobber, then it is!  How to explain that?

I can't explain it. I do think you've found a gcc bug.

That said, the kernel mostly uses "asm volatile()" _together_ with a 
memory clobber for these kinds of things, so it sounds like the kernel 
wouldn't be impacted. But you're definitely right - the above report makes 
me worry.

> The difference between test2_volasm.s and test2_normasm.s is hard to explain
> too.  It seems like some times gcc forgets that imull is commutative.  It will
> emit "imull %edx, %eax" in some cases, but change an asm slightly and it will
> decide it must do "imull %eax, %edx ; movl %edx, %eax" for no apparent reason.

Well, that's likely just a subtle register allocation issue, and 
understandable. Generating perfect code is impossible, you want to 
generate good code on average.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily
  2007-07-26  1:18               ` Linus Torvalds
@ 2007-07-26  1:22                 ` Linus Torvalds
  0 siblings, 0 replies; 92+ messages in thread
From: Linus Torvalds @ 2007-07-26  1:22 UTC (permalink / raw)
  To: Trent Piepho
  Cc: Nick Piggin, Satyam Sharma, Linux Kernel Mailing List,
	David Howells, Andi Kleen, Andrew Morton



On Wed, 25 Jul 2007, Linus Torvalds wrote:
> 
> Hmm. I really think you should take this up with the gcc people. That 
> looks like a gcc bug - because there really is nothing that guarantees 
> that the asm doesn't change the array that "x" points to, and the asm 
> clearly talks about clobbering memory.

Actually, I take that back. I think gcc does the right thing, and yes, 
it's explained by the memory clobber being just a blind "write to memory" 
rather than read memory. My bad.

It does leave us with very few ways of saying that an asm can *read* 
memory, and so it might be good to have it clarified that "volatile" 
implies that (at least with the memory clobber).

Your examples are good, I think.

		Linus

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize
  2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
                   ` (7 preceding siblings ...)
  2007-07-23 16:06 ` [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions Satyam Sharma
@ 2007-07-30 17:57 ` Denis Vlasenko
  2007-07-31  1:07   ` Satyam Sharma
  8 siblings, 1 reply; 92+ messages in thread
From: Denis Vlasenko @ 2007-07-30 17:57 UTC (permalink / raw)
  To: Satyam Sharma
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton, Linus Torvalds

Hi Satyam,

On Monday 23 July 2007 17:05, Satyam Sharma wrote:
> There was a lot of bogus stuff that include/asm-i386/bitops.h was doing,
> that was unnecessary and not required for the correctness of those APIs.
> All that superfluous stuff was also unnecessarily disallowing compiler
> optimization possibilities, and making gcc generate code that wasn't as
> beautiful as it could otherwise have been. Hence the following series
> of cleanups (some trivial, some surprising, in no particular order):

[I did read entire thread]

Welcome to the minefield.

This bitops and barrier stuff is complicated. It's very easy to
introduce bugs which are hard to trigger, or happen only with some
specific gcc versions, or only on massively parallel SMP boxes.

You can also make technically correct changes which relax needlessly
strict barrier semantics of some bitops and trigger latent bugs
in code which was unknowingly depending on it.

How you can proceed:

Make a change which you believe is right. Recompile allyesconfig
kernel with and without this change. Find a piece of assembly code
which become different. Check that new code is correct (and smaller
and/or faster). Post your patch together with example(s) of code
fragments that got better. Be as verbose as needed.

Repeat for each change separately.

This can be painfully slow, but less likely to be rejected outright
in fear of introducing difficult bugs.

> * Marking "memory" as clobbered for no good reason

I vaguely remember that "memory" clobbers are needed in some rather
obscure, non-obvious situations. Google for it - Linus wrote about it
to lkml (a few years ago IIRC).

> * Volatile-casting of memory addresses
>   (wholly unnecessary, makes gcc generate bad code)

Do you know any code difference resulting from this patch?

> * Unwarranted use of __asm__ __volatile__ even when those semantics
>   are not required

ditto

> * Unnecessarily harsh definitions of smp_mb__{before, after}_clear_bit()
>   (again, this was like *asking* gcc to generate bad code)

ditto

> My testbox boots/works fine with all these patches (uptime half an hour)

For this kind of things, you really need something more stressing.
Try to find big SMP people who is willing to give it a whirl.

> and the compressed bzImage is smaller by about ~2 KB for my .config --

At least it proves that _something_ changed.
--
vda

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize
  2007-07-30 17:57 ` [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Denis Vlasenko
@ 2007-07-31  1:07   ` Satyam Sharma
  0 siblings, 0 replies; 92+ messages in thread
From: Satyam Sharma @ 2007-07-31  1:07 UTC (permalink / raw)
  To: Denis Vlasenko
  Cc: Linux Kernel Mailing List, David Howells, Nick Piggin,
	Andi Kleen, Andrew Morton, Linus Torvalds

Hi Denis,


On Mon, 30 Jul 2007, Denis Vlasenko wrote:

> Hi Satyam,
> 
> On Monday 23 July 2007 17:05, Satyam Sharma wrote:
> > There was a lot of bogus stuff that include/asm-i386/bitops.h was doing,
> > that was unnecessary and not required for the correctness of those APIs.
> > All that superfluous stuff was also unnecessarily disallowing compiler
> > optimization possibilities, and making gcc generate code that wasn't as
> > beautiful as it could otherwise have been. Hence the following series
> > of cleanups (some trivial, some surprising, in no particular order):
> 
> [I did read entire thread]
> 
> Welcome to the minefield.
> 
> This bitops and barrier stuff is complicated. It's very easy to
> introduce bugs which are hard to trigger, or happen only with some
> specific gcc versions, or only on massively parallel SMP boxes.
> 
> You can also make technically correct changes which relax needlessly
> strict barrier semantics of some bitops and trigger latent bugs
> in code which was unknowingly depending on it.
> 
> How you can proceed:
> 
> Make a change which you believe is right. Recompile allyesconfig
> kernel with and without this change. Find a piece of assembly code
> which become different. Check that new code is correct (and smaller
> and/or faster). Post your patch together with example(s) of code
> fragments that got better. Be as verbose as needed.
> 
> Repeat for each change separately.
> 
> This can be painfully slow, but less likely to be rejected outright
> in fear of introducing difficult bugs.

Thanks for the tips, I'd remember these in the future :-)


> > * Marking "memory" as clobbered for no good reason
> 
> I vaguely remember that "memory" clobbers are needed in some rather
> obscure, non-obvious situations. Google for it - Linus wrote about it
> to lkml (a few years ago IIRC).
> 
> > * Volatile-casting of memory addresses
> >   (wholly unnecessary, makes gcc generate bad code)
> 
> Do you know any code difference resulting from this patch?
> 
> > * Unwarranted use of __asm__ __volatile__ even when those semantics
> >   are not required
> 
> ditto
> 
> > * Unnecessarily harsh definitions of smp_mb__{before, after}_clear_bit()
> >   (again, this was like *asking* gcc to generate bad code)
> 
> ditto

The problem with most of the patches in this series (other than the two
related to vague gcc docs), as you know, turned out to be the fact that
a lot of code/callers out there rely on the bitops as also being
lock/unlock functions -- and hence the need for disallowing both
compiler as well as CPU reordering / optimizations.

There were few valid changes in the patchset too, and probably I'll try
to look at them the way you described above (or probably it also makes
some sense to keep the bitops as-is, "it works" today after all, and
there are other avenues for optimizations/cleanups too).


Thanks again,
Satyam

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2007-07-31  0:50 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-07-23 16:05 [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Satyam Sharma
2007-07-23 16:05 ` [PATCH 1/8] i386: bitops: Update/correct comments Satyam Sharma
2007-07-23 16:05 ` [PATCH 2/8] i386: bitops: Rectify bogus "Ir" constraints Satyam Sharma
2007-07-23 16:10   ` Andi Kleen
2007-07-23 16:21     ` Satyam Sharma
2007-07-23 16:30       ` Andi Kleen
2007-07-23 16:36         ` Jan Hubicka
2007-07-23 18:05         ` H. Peter Anvin
2007-07-23 18:28           ` H. Peter Anvin
2007-07-23 17:57   ` Linus Torvalds
2007-07-23 18:14     ` Satyam Sharma
2007-07-23 18:32       ` Andi Kleen
2007-07-23 18:39     ` H. Peter Anvin
2007-07-23 18:52       ` Satyam Sharma
2007-07-23 16:05 ` [PATCH 3/8] i386: bitops: Rectify bogus "+m" constraints Satyam Sharma
2007-07-23 16:37   ` Andi Kleen
2007-07-23 17:15     ` Satyam Sharma
2007-07-23 17:46   ` Linus Torvalds
2007-07-24  9:22   ` David Howells
2007-07-23 16:05 ` [PATCH 4/8] i386: bitops: Kill volatile-casting of memory addresses Satyam Sharma
2007-07-23 17:52   ` Linus Torvalds
2007-07-24  4:19     ` Nick Piggin
2007-07-24  6:23       ` Satyam Sharma
2007-07-24  7:16         ` Nick Piggin
2007-07-24  9:49     ` Benjamin Herrenschmidt
2007-07-24 17:20       ` Linus Torvalds
2007-07-24 17:39         ` Jeff Garzik
2007-07-25  4:54         ` Nick Piggin
2007-07-23 16:05 ` [PATCH 5/8] i386: bitops: Contain warnings fallout from the death of volatiles Satyam Sharma
2007-07-23 16:05 ` [PATCH 6/8] i386: bitops: Don't mark memory as clobbered unnecessarily Satyam Sharma
2007-07-23 16:13   ` Andi Kleen
2007-07-23 16:26     ` Satyam Sharma
2007-07-23 16:33       ` Andi Kleen
2007-07-23 17:12         ` Satyam Sharma
2007-07-23 17:49           ` Jeremy Fitzhardinge
2007-07-23 17:55   ` Linus Torvalds
2007-07-24  9:52     ` Benjamin Herrenschmidt
2007-07-24 17:24       ` Linus Torvalds
2007-07-24 17:42         ` Trond Myklebust
2007-07-24 18:13           ` Linus Torvalds
2007-07-24 18:28             ` Trond Myklebust
2007-07-24 21:37             ` Benjamin Herrenschmidt
2007-07-24 21:55               ` Trond Myklebust
2007-07-24 22:32                 ` Benjamin Herrenschmidt
2007-07-25  4:10                   ` Nick Piggin
2007-07-24 21:36         ` Benjamin Herrenschmidt
2007-07-24  3:57   ` Nick Piggin
2007-07-24  6:38     ` Satyam Sharma
2007-07-24  7:24       ` Nick Piggin
2007-07-24  8:29         ` Satyam Sharma
2007-07-24  8:39           ` Nick Piggin
2007-07-24  8:38         ` Trent Piepho
2007-07-24 19:39           ` Linus Torvalds
2007-07-24 20:37             ` Andi Kleen
2007-07-24 20:08               ` Linus Torvalds
2007-07-24 21:31                 ` Jeremy Fitzhardinge
2007-07-24 21:46                   ` Linus Torvalds
2007-07-26  1:07             ` Trent Piepho
2007-07-26  1:18               ` Linus Torvalds
2007-07-26  1:22                 ` Linus Torvalds
2007-07-24  9:44     ` David Howells
2007-07-24 10:02       ` Satyam Sharma
2007-07-23 16:06 ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ Satyam Sharma
2007-07-23 16:18   ` Andi Kleen
2007-07-23 16:22     ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ II Andi Kleen
2007-07-23 16:32     ` [PATCH 7/8] i386: bitops: Kill needless usage of __asm__ __volatile__ Satyam Sharma
2007-07-23 16:23   ` Jeremy Fitzhardinge
2007-07-23 16:43     ` Satyam Sharma
2007-07-23 17:39       ` Jeremy Fitzhardinge
2007-07-23 18:07         ` Satyam Sharma
2007-07-23 18:28           ` Jeremy Fitzhardinge
2007-07-23 20:29             ` Trent Piepho
2007-07-23 20:40               ` Jeremy Fitzhardinge
2007-07-23 21:06                 ` Trent Piepho
2007-07-23 21:30               ` Andi Kleen
2007-07-23 21:48                 ` Nicholas Miell
2007-07-23 16:06 ` [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit() definitions Satyam Sharma
2007-07-24  3:53   ` Nick Piggin
2007-07-24  7:34     ` Satyam Sharma
2007-07-24  7:48       ` Jeremy Fitzhardinge
2007-07-24  8:31         ` Nick Piggin
2007-07-24  8:20       ` Nick Piggin
2007-07-24  9:21         ` Satyam Sharma
2007-07-24 10:25           ` Nick Piggin
2007-07-24 11:10             ` Satyam Sharma
2007-07-24 11:32               ` Nick Piggin
2007-07-24 11:45                 ` Satyam Sharma
2007-07-24 12:01                   ` Nick Piggin
2007-07-24 17:12                   ` Linus Torvalds
2007-07-24 19:01                     ` Satyam Sharma
2007-07-30 17:57 ` [PATCH 0/8] i386: bitops: Cleanup, sanitize, optimize Denis Vlasenko
2007-07-31  1:07   ` Satyam Sharma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.