linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Bitmap optimisations
@ 2017-06-07 14:29 Matthew Wilcox
  2017-06-07 14:29 ` [PATCH 1/3] bitmap: Optimise bitmap_set and bitmap_clear of a single bit Matthew Wilcox
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Matthew Wilcox @ 2017-06-07 14:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Martin Schwidefsky, Rasmus Villemoes, Matthew Wilcox

From: Matthew Wilcox <mawilcox@microsoft.com>

These three bitmap patches use more efficient specialisations when the
compiler can figure out that it's safe to do so.

Matthew Wilcox (3):
  bitmap: Optimise bitmap_set and bitmap_clear of a single bit
  Turn bitmap_set and bitmap_clear into memset when possible
  bitmap: Use memcmp optimisation in more situations

 include/linux/bitmap.h | 33 +++++++++++++++++++++++++++------
 lib/bitmap.c           |  8 ++++----
 2 files changed, 31 insertions(+), 10 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] bitmap: Optimise bitmap_set and bitmap_clear of a single bit
  2017-06-07 14:29 [PATCH 0/3] Bitmap optimisations Matthew Wilcox
@ 2017-06-07 14:29 ` Matthew Wilcox
  2017-06-07 14:29 ` [PATCH 2/3] Turn bitmap_set and bitmap_clear into memset when possible Matthew Wilcox
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2017-06-07 14:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Martin Schwidefsky, Rasmus Villemoes, Matthew Wilcox

From: Matthew Wilcox <mawilcox@microsoft.com>

We have eight users calling bitmap_clear for a single bit and seventeen
calling bitmap_set for a single bit.  Rather than fix all of them to call
__clear_bit or __set_bit, turn bitmap_clear and bitmap_set into inline
functions and make this special case efficient.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
 include/linux/bitmap.h | 23 ++++++++++++++++++++---
 lib/bitmap.c           |  8 ++++----
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 3b77588a9360..4e0f0c8167af 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -112,9 +112,8 @@ extern int __bitmap_intersects(const unsigned long *bitmap1,
 extern int __bitmap_subset(const unsigned long *bitmap1,
 			const unsigned long *bitmap2, unsigned int nbits);
 extern int __bitmap_weight(const unsigned long *bitmap, unsigned int nbits);
-
-extern void bitmap_set(unsigned long *map, unsigned int start, int len);
-extern void bitmap_clear(unsigned long *map, unsigned int start, int len);
+extern void __bitmap_set(unsigned long *map, unsigned int start, int len);
+extern void __bitmap_clear(unsigned long *map, unsigned int start, int len);
 
 extern unsigned long bitmap_find_next_zero_area_off(unsigned long *map,
 						    unsigned long size,
@@ -315,6 +314,24 @@ static __always_inline int bitmap_weight(const unsigned long *src, unsigned int
 	return __bitmap_weight(src, nbits);
 }
 
+static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
+		unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__set_bit(start, map);
+	else
+		__bitmap_set(map, start, nbits);
+}
+
+static __always_inline void bitmap_clear(unsigned long *map, unsigned int start,
+		unsigned int nbits)
+{
+	if (__builtin_constant_p(nbits) && nbits == 1)
+		__clear_bit(start, map);
+	else
+		__bitmap_clear(map, start, nbits);
+}
+
 static inline void bitmap_shift_right(unsigned long *dst, const unsigned long *src,
 				unsigned int shift, int nbits)
 {
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 0b66f0e5eb6b..6e354a6d0dbb 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -251,7 +251,7 @@ int __bitmap_weight(const unsigned long *bitmap, unsigned int bits)
 }
 EXPORT_SYMBOL(__bitmap_weight);
 
-void bitmap_set(unsigned long *map, unsigned int start, int len)
+void __bitmap_set(unsigned long *map, unsigned int start, int len)
 {
 	unsigned long *p = map + BIT_WORD(start);
 	const unsigned int size = start + len;
@@ -270,9 +270,9 @@ void bitmap_set(unsigned long *map, unsigned int start, int len)
 		*p |= mask_to_set;
 	}
 }
-EXPORT_SYMBOL(bitmap_set);
+EXPORT_SYMBOL(__bitmap_set);
 
-void bitmap_clear(unsigned long *map, unsigned int start, int len)
+void __bitmap_clear(unsigned long *map, unsigned int start, int len)
 {
 	unsigned long *p = map + BIT_WORD(start);
 	const unsigned int size = start + len;
@@ -291,7 +291,7 @@ void bitmap_clear(unsigned long *map, unsigned int start, int len)
 		*p &= ~mask_to_clear;
 	}
 }
-EXPORT_SYMBOL(bitmap_clear);
+EXPORT_SYMBOL(__bitmap_clear);
 
 /**
  * bitmap_find_next_zero_area_off - find a contiguous aligned zero area
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] Turn bitmap_set and bitmap_clear into memset when possible
  2017-06-07 14:29 [PATCH 0/3] Bitmap optimisations Matthew Wilcox
  2017-06-07 14:29 ` [PATCH 1/3] bitmap: Optimise bitmap_set and bitmap_clear of a single bit Matthew Wilcox
@ 2017-06-07 14:29 ` Matthew Wilcox
  2017-06-07 21:16   ` Rasmus Villemoes
  2017-06-07 14:29 ` [PATCH 3/3] bitmap: Use memcmp optimisation in more situations Matthew Wilcox
  2017-06-07 21:37 ` [PATCH 0/3] Bitmap optimisations Rasmus Villemoes
  3 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2017-06-07 14:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Martin Schwidefsky, Rasmus Villemoes, Matthew Wilcox

From: Matthew Wilcox <mawilcox@microsoft.com>

Several callers have constant 'start' and an 'nbits' that is a multiple of
8, so we can turn them into calls to memset.  We don't need the entirety
of 'start' and 'nbits' to  be constant, we just need to know whether
they're divisible by 8.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
 include/linux/bitmap.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 4e0f0c8167af..0b3e4452b054 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -319,6 +319,9 @@ static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
 {
 	if (__builtin_constant_p(nbits) && nbits == 1)
 		__set_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset(map + start / 8, 0xff, nbits / 8);
 	else
 		__bitmap_set(map, start, nbits);
 }
@@ -328,6 +331,9 @@ static __always_inline void bitmap_clear(unsigned long *map, unsigned int start,
 {
 	if (__builtin_constant_p(nbits) && nbits == 1)
 		__clear_bit(start, map);
+	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
+		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
+		memset(map + start / 8, 0, nbits / 8);
 	else
 		__bitmap_clear(map, start, nbits);
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] bitmap: Use memcmp optimisation in more situations
  2017-06-07 14:29 [PATCH 0/3] Bitmap optimisations Matthew Wilcox
  2017-06-07 14:29 ` [PATCH 1/3] bitmap: Optimise bitmap_set and bitmap_clear of a single bit Matthew Wilcox
  2017-06-07 14:29 ` [PATCH 2/3] Turn bitmap_set and bitmap_clear into memset when possible Matthew Wilcox
@ 2017-06-07 14:29 ` Matthew Wilcox
  2017-06-08  1:48   ` Andy Shevchenko
  2017-06-07 21:37 ` [PATCH 0/3] Bitmap optimisations Rasmus Villemoes
  3 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2017-06-07 14:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Martin Schwidefsky, Rasmus Villemoes, Matthew Wilcox

From: Matthew Wilcox <mawilcox@microsoft.com>

Commit 7dd968163f ("bitmap: bitmap_equal memcmp optimization") was
rather more restrictive than necessary; we can use memcmp() to implement
bitmap_equal() as long as the number of bits can be proved to be a
multiple of 8.  And architectures other than s390 may be able to make
good use of this optimisation.

Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
---
 include/linux/bitmap.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
index 0b3e4452b054..26244e0098f0 100644
--- a/include/linux/bitmap.h
+++ b/include/linux/bitmap.h
@@ -266,10 +266,8 @@ static inline int bitmap_equal(const unsigned long *src1,
 {
 	if (small_const_nbits(nbits))
 		return !((*src1 ^ *src2) & BITMAP_LAST_WORD_MASK(nbits));
-#ifdef CONFIG_S390
-	if (__builtin_constant_p(nbits) && (nbits % BITS_PER_LONG) == 0)
+	if (__builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
 		return !memcmp(src1, src2, nbits / 8);
-#endif
 	return __bitmap_equal(src1, src2, nbits);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] Turn bitmap_set and bitmap_clear into memset when possible
  2017-06-07 14:29 ` [PATCH 2/3] Turn bitmap_set and bitmap_clear into memset when possible Matthew Wilcox
@ 2017-06-07 21:16   ` Rasmus Villemoes
  2017-06-27  7:02     ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Rasmus Villemoes @ 2017-06-07 21:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Andrew Morton, Martin Schwidefsky, Matthew Wilcox

On Wed, Jun 07 2017, Matthew Wilcox <willy@infradead.org> wrote:

> From: Matthew Wilcox <mawilcox@microsoft.com>
>
> Several callers have constant 'start' and an 'nbits' that is a multiple of
> 8, so we can turn them into calls to memset.  We don't need the entirety
> of 'start' and 'nbits' to  be constant, we just need to know whether
> they're divisible by 8.
>
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
> ---
>  include/linux/bitmap.h | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 4e0f0c8167af..0b3e4452b054 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -319,6 +319,9 @@ static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
>  {
>  	if (__builtin_constant_p(nbits) && nbits == 1)
>  		__set_bit(start, map);
> +	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
> +		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
> +		memset(map + start / 8, 0xff, nbits / 8);
>  	else

Isn't the pointer arithmetic wrong here? I think you need to cast map to
(char*).

> 
>  		__bitmap_set(map, start, nbits);
>  }
> @@ -328,6 +331,9 @@ static __always_inline void bitmap_clear(unsigned long *map, unsigned int start,
>  {
>  	if (__builtin_constant_p(nbits) && nbits == 1)
>  		__clear_bit(start, map);
> +	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
> +		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
> +		memset(map + start / 8, 0, nbits / 8);
>  	else

Ditto.

Do you have an example of how the generated code changes, both in the
case of actual constants and a case where gcc can see that start and
nbits are byte-aligned without knowing their actual values?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] Bitmap optimisations
  2017-06-07 14:29 [PATCH 0/3] Bitmap optimisations Matthew Wilcox
                   ` (2 preceding siblings ...)
  2017-06-07 14:29 ` [PATCH 3/3] bitmap: Use memcmp optimisation in more situations Matthew Wilcox
@ 2017-06-07 21:37 ` Rasmus Villemoes
  3 siblings, 0 replies; 13+ messages in thread
From: Rasmus Villemoes @ 2017-06-07 21:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Andrew Morton, Martin Schwidefsky, Matthew Wilcox

On Wed, Jun 07 2017, Matthew Wilcox <willy@infradead.org> wrote:

> From: Matthew Wilcox <mawilcox@microsoft.com>
>
> These three bitmap patches use more efficient specialisations when the
> compiler can figure out that it's safe to do so.
>

With the pointer arithmetic fixed,

Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] bitmap: Use memcmp optimisation in more situations
  2017-06-07 14:29 ` [PATCH 3/3] bitmap: Use memcmp optimisation in more situations Matthew Wilcox
@ 2017-06-08  1:48   ` Andy Shevchenko
  2017-06-08  2:55     ` Matthew Wilcox
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Shevchenko @ 2017-06-08  1:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Andrew Morton, Martin Schwidefsky,
	Rasmus Villemoes, Matthew Wilcox

On Wed, Jun 7, 2017 at 5:29 PM, Matthew Wilcox <willy@infradead.org> wrote:

> Commit 7dd968163f ("bitmap: bitmap_equal memcmp optimization") was
> rather more restrictive than necessary; we can use memcmp() to implement
> bitmap_equal() as long as the number of bits can be proved to be a
> multiple of 8.  And architectures other than s390 may be able to make
> good use of this optimisation.

> -       if (__builtin_constant_p(nbits) && (nbits % BITS_PER_LONG) == 0)
> +       if (__builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
>                 return !memcmp(src1, src2, nbits / 8);

I'm not sure this is a fully correct change.
What exactly ' & 7' part does?
For me looks like you may just drop it.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] bitmap: Use memcmp optimisation in more situations
  2017-06-08  1:48   ` Andy Shevchenko
@ 2017-06-08  2:55     ` Matthew Wilcox
  2017-06-08 12:31       ` Andy Shevchenko
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2017-06-08  2:55 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: linux-kernel, Andrew Morton, Martin Schwidefsky,
	Rasmus Villemoes, Matthew Wilcox

On Thu, Jun 08, 2017 at 04:48:04AM +0300, Andy Shevchenko wrote:
> > Commit 7dd968163f ("bitmap: bitmap_equal memcmp optimization") was
> > rather more restrictive than necessary; we can use memcmp() to implement
> > bitmap_equal() as long as the number of bits can be proved to be a
> > multiple of 8.  And architectures other than s390 may be able to make
> > good use of this optimisation.
> 
> > -       if (__builtin_constant_p(nbits) && (nbits % BITS_PER_LONG) == 0)
> > +       if (__builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
> >                 return !memcmp(src1, src2, nbits / 8);
> 
> I'm not sure this is a fully correct change.
> What exactly ' & 7' part does?
> For me looks like you may just drop it.

We only need to know if the bottom 3 bits are 0 to apply this optimisation.
For example, if we have a user which does this:

	nbits = 8;
	if (argle)
		nbits += 8;
	if (bitmap_equal(ptr1, ptr2, nbits))
		blah();

then we can use memcmp() because gcc can deduce that the bottom 3 bits
are never set (try it!  it works!).  We don't need nbits as a whole to
be const.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] bitmap: Use memcmp optimisation in more situations
  2017-06-08  2:55     ` Matthew Wilcox
@ 2017-06-08 12:31       ` Andy Shevchenko
  2017-06-08 13:43         ` Rasmus Villemoes
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Shevchenko @ 2017-06-08 12:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Andrew Morton, Martin Schwidefsky,
	Rasmus Villemoes, Matthew Wilcox

On Thu, Jun 8, 2017 at 5:55 AM, Matthew Wilcox <willy@infradead.org> wrote:
> On Thu, Jun 08, 2017 at 04:48:04AM +0300, Andy Shevchenko wrote:
>> > Commit 7dd968163f ("bitmap: bitmap_equal memcmp optimization") was
>> > rather more restrictive than necessary; we can use memcmp() to implement
>> > bitmap_equal() as long as the number of bits can be proved to be a
>> > multiple of 8.  And architectures other than s390 may be able to make
>> > good use of this optimisation.
>>
>> > -       if (__builtin_constant_p(nbits) && (nbits % BITS_PER_LONG) == 0)
>> > +       if (__builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
>> >                 return !memcmp(src1, src2, nbits / 8);
>>
>> I'm not sure this is a fully correct change.
>> What exactly ' & 7' part does?
>> For me looks like you may just drop it.
>
> We only need to know if the bottom 3 bits are 0 to apply this optimisation.
> For example, if we have a user which does this:
>
>         nbits = 8;
>         if (argle)
>                 nbits += 8;
>         if (bitmap_equal(ptr1, ptr2, nbits))
>                 blah();
>
> then we can use memcmp() because gcc can deduce that the bottom 3 bits
> are never set (try it!  it works!).  We don't need nbits as a whole to
> be const.

What I'm talking about is that by my opinion the both below are equivalent.
__builtin_constant_p(nbits)
__builtin_constant_p(nbits & 7)

Thus, again, what & 7 does there?

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] bitmap: Use memcmp optimisation in more situations
  2017-06-08 12:31       ` Andy Shevchenko
@ 2017-06-08 13:43         ` Rasmus Villemoes
  2017-06-08 14:47           ` Andy Shevchenko
  0 siblings, 1 reply; 13+ messages in thread
From: Rasmus Villemoes @ 2017-06-08 13:43 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Matthew Wilcox, linux-kernel, Andrew Morton, Martin Schwidefsky,
	Matthew Wilcox

On 8 June 2017 at 14:31, Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
> On Thu, Jun 8, 2017 at 5:55 AM, Matthew Wilcox <willy@infradead.org> wrote:
>> We only need to know if the bottom 3 bits are 0 to apply this optimisation.
>> For example, if we have a user which does this:
>>
>>         nbits = 8;
>>         if (argle)
>>                 nbits += 8;
>>         if (bitmap_equal(ptr1, ptr2, nbits))
>>                 blah();
>>
>> then we can use memcmp() because gcc can deduce that the bottom 3 bits
>> are never set (try it!  it works!).  We don't need nbits as a whole to
>> be const.
>
> What I'm talking about is that by my opinion the both below are equivalent.
> __builtin_constant_p(nbits)
> __builtin_constant_p(nbits & 7)

They are not. Read Matthew's example again. Assuming that argle is
something non-constant (maybe an argument to the function), the value
of nbits at the time of the bitmap_equal call is _not_ a
compile-time-constant. However, if the compiler is smart (which at
least some versions of gcc are), the compiler may deduce that nbits is
either 8 or 16; there really are no other options. Hence it _is_
statically known that nbits is divisible by 8, so the expression
nbits&7 _is_ compile-time constant (0), so gcc can change the
bitmap_equal call to a memcmp call.

(It may then either pass a run-time value of nbits>>3 and emit a
single memcmp call, or it may decide to unroll the two options,
creating two memcmp calls with 1 and 2 as compile-time arguments;
these may or may not then in turn be "inlined" to code doing roughly
*(u8*)p1 == *(u8*)p2 and similarly for u16 casts).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] bitmap: Use memcmp optimisation in more situations
  2017-06-08 13:43         ` Rasmus Villemoes
@ 2017-06-08 14:47           ` Andy Shevchenko
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Shevchenko @ 2017-06-08 14:47 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Matthew Wilcox, linux-kernel, Andrew Morton, Martin Schwidefsky,
	Matthew Wilcox

On Thu, Jun 8, 2017 at 4:43 PM, Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> On 8 June 2017 at 14:31, Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
>> On Thu, Jun 8, 2017 at 5:55 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>> We only need to know if the bottom 3 bits are 0 to apply this optimisation.
>>> For example, if we have a user which does this:
>>>
>>>         nbits = 8;
>>>         if (argle)
>>>                 nbits += 8;
>>>         if (bitmap_equal(ptr1, ptr2, nbits))
>>>                 blah();
>>>
>>> then we can use memcmp() because gcc can deduce that the bottom 3 bits
>>> are never set (try it!  it works!).  We don't need nbits as a whole to
>>> be const.
>>
>> What I'm talking about is that by my opinion the both below are equivalent.
>> __builtin_constant_p(nbits)
>> __builtin_constant_p(nbits & 7)
>
> They are not. Read Matthew's example again. Assuming that argle is
> something non-constant (maybe an argument to the function), the value
> of nbits at the time of the bitmap_equal call is _not_ a
> compile-time-constant. However, if the compiler is smart (which at
> least some versions of gcc are), the compiler may deduce that nbits is
> either 8 or 16; there really are no other options. Hence it _is_
> statically known that nbits is divisible by 8, so the expression
> nbits&7 _is_ compile-time constant (0), so gcc can change the
> bitmap_equal call to a memcmp call.

Yeah, thanks for detailed explanation.
So, basically what we do, we consider
1. 3 LSBs _is_ constant, *and*
2. They are equal to 0.

> (It may then either pass a run-time value of nbits>>3 and emit a
> single memcmp call, or it may decide to unroll the two options,
> creating two memcmp calls with 1 and 2 as compile-time arguments;
> these may or may not then in turn be "inlined" to code doing roughly
> *(u8*)p1 == *(u8*)p2 and similarly for u16 casts).

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] Turn bitmap_set and bitmap_clear into memset when possible
  2017-06-07 21:16   ` Rasmus Villemoes
@ 2017-06-27  7:02     ` Matthew Wilcox
  2017-06-27  7:58       ` Rasmus Villemoes
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2017-06-27  7:02 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: linux-kernel, Andrew Morton, Martin Schwidefsky, Matthew Wilcox

On Wed, Jun 07, 2017 at 11:16:37PM +0200, Rasmus Villemoes wrote:
> On Wed, Jun 07 2017, Matthew Wilcox <willy@infradead.org> wrote:
> > @@ -319,6 +319,9 @@ static __always_inline void bitmap_set(unsigned long *map, unsigned int start,
> >  {
> >  	if (__builtin_constant_p(nbits) && nbits == 1)
> >  		__set_bit(start, map);
> > +	else if (__builtin_constant_p(start & 7) && IS_ALIGNED(start, 8) &&
> > +		 __builtin_constant_p(nbits & 7) && IS_ALIGNED(nbits, 8))
> > +		memset(map + start / 8, 0xff, nbits / 8);
> >  	else
> 
> Isn't the pointer arithmetic wrong here? I think you need to cast map to
> (char*).

Good catch.  I've added a test case to test_bitmap.c and it catches
the problem.

> Do you have an example of how the generated code changes, both in the
> case of actual constants and a case where gcc can see that start and
> nbits are byte-aligned without knowing their actual values?

Here's an example of start == 0 and nbits being clearly a multiple of 8,
courtesy of btrfs's extent-io-tests.c.

Before:
  f1:   48 8b 5c 24 10          mov    0x10(%rsp),%rbx
  f6:   4c 8b 3c 24             mov    (%rsp),%r15
  fa:   31 f6                   xor    %esi,%esi
  fc:   44 8d 34 dd 00 00 00    lea    0x0(,%rbx,8),%r14d
 103:   00 
 104:   4c 8d 2c dd 00 00 00    lea    0x0(,%rbx,8),%r13
 10b:   00 
 10c:   4c 89 ff                mov    %r15,%rdi
 10f:   44 89 f2                mov    %r14d,%edx
 112:   e8 00 00 00 00          callq  117 <__test_eb_bitmaps+0x77>
                        113: R_X86_64_PC32      bitmap_set-0x4
 117:   31 d2                   xor    %edx,%edx
 119:   31 f6                   xor    %esi,%esi
 11b:   4c 89 e9                mov    %r13,%rcx
 11e:   4c 89 e7                mov    %r12,%rdi
 121:   e8 00 00 00 00          callq  126 <__test_eb_bitmaps+0x86>
                        122: R_X86_64_PC32      extent_buffer_bitmap_set-0x4

After:
  f1:   4c 8b 7c 24 10          mov    0x10(%rsp),%r15
  f6:   be ff 00 00 00          mov    $0xff,%esi
  fb:   48 89 df                mov    %rbx,%rdi
  fe:   4d 89 fe                mov    %r15,%r14
 101:   4e 8d 2c fd 00 00 00    lea    0x0(,%r15,8),%r13
 108:   00 
 109:   41 81 e6 ff ff ff 1f    and    $0x1fffffff,%r14d
 110:   4c 89 f2                mov    %r14,%rdx
 113:   e8 00 00 00 00          callq  118 <__test_eb_bitmaps+0x78>
                        114: R_X86_64_PC32      memset-0x4
 118:   48 8b 2c 24             mov    (%rsp),%rbp
 11c:   31 d2                   xor    %edx,%edx
 11e:   31 f6                   xor    %esi,%esi
 120:   4c 89 e9                mov    %r13,%rcx
 123:   48 89 ef                mov    %rbp,%rdi
 126:   e8 00 00 00 00          callq  12b <__test_eb_bitmaps+0x8b>
                        127: R_X86_64_PC32      extent_buffer_bitmap_set-0x4

It's actually slightly less efficient in the caller (although obviously
memset() is going to execute faster than bitmap_set()).  Partly because
x86 made some odd choices about the behaviour of an 8-bit move instruction
(it leaves the upper 24 bits intact, rather than zeroing them, so gcc
has to use a 32-bit move instruction to put 0xff into the second argument
to memset()), and partly because gcc is seeing us do:

	unsigned long len;
	len = len * 8 / 8;

so it's (very reasonably) transforming that into len &= 0x1fff'ffff'ffff'ffff;
That's partly a quirk of how this particular function is written.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] Turn bitmap_set and bitmap_clear into memset when possible
  2017-06-27  7:02     ` Matthew Wilcox
@ 2017-06-27  7:58       ` Rasmus Villemoes
  0 siblings, 0 replies; 13+ messages in thread
From: Rasmus Villemoes @ 2017-06-27  7:58 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, Andrew Morton, Martin Schwidefsky, Matthew Wilcox

> It's actually slightly less efficient in the caller (although obviously
> memset() is going to execute faster than bitmap_set()).  Partly because
> x86 made some odd choices about the behaviour of an 8-bit move instruction
> (it leaves the upper 24 bits intact, rather than zeroing them, so gcc
> has to use a 32-bit move instruction to put 0xff into the second argument
> to memset()),

Heh, I thought gcc knew and made full use of the semantics of memset,
so that only the low byte matters. I suppose there might be
architectures where passing -1 is slightly cheaper (at least in code
size) than 255... [quick checking] indeed, on x86_64, there's no
change in the generated code, but on 32 bit, gcc ends up doing

6a ff                   push   $0xffffffff

instead of

68 ff 00 00 00          push   $0xff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-06-27  7:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-07 14:29 [PATCH 0/3] Bitmap optimisations Matthew Wilcox
2017-06-07 14:29 ` [PATCH 1/3] bitmap: Optimise bitmap_set and bitmap_clear of a single bit Matthew Wilcox
2017-06-07 14:29 ` [PATCH 2/3] Turn bitmap_set and bitmap_clear into memset when possible Matthew Wilcox
2017-06-07 21:16   ` Rasmus Villemoes
2017-06-27  7:02     ` Matthew Wilcox
2017-06-27  7:58       ` Rasmus Villemoes
2017-06-07 14:29 ` [PATCH 3/3] bitmap: Use memcmp optimisation in more situations Matthew Wilcox
2017-06-08  1:48   ` Andy Shevchenko
2017-06-08  2:55     ` Matthew Wilcox
2017-06-08 12:31       ` Andy Shevchenko
2017-06-08 13:43         ` Rasmus Villemoes
2017-06-08 14:47           ` Andy Shevchenko
2017-06-07 21:37 ` [PATCH 0/3] Bitmap optimisations Rasmus Villemoes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).