Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
       [not found] <1445522453-14450-1-git-send-email-P@draigBrady.com>
@ 2015-10-22 14:37 ` Eric Blake
  2015-10-22 14:44   ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Blake @ 2015-10-22 14:37 UTC (permalink / raw)
  To: Pádraig Brady, coreutils; +Cc: Rusty Russell, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1388 bytes --]

[adding qemu]

On 10/22/2015 08:00 AM, Pádraig Brady wrote:
> * src/system.h (is_nul): Reimplement with a version
> that doesn't require a sentinel after the buffer,
> and which calls down to (the system optimized) memcmp.
> Performance analyzed at http://rusty.ozlabs.org/?p=560

>  /* Return whether the buffer consists entirely of NULs.
> -   Note the word after the buffer must be non NUL. */
> +   From CCAN by Rusty Russell <rusty@rustcorp.com.au>
> +   released under CC0 (Public domain).  */
>  
>  static inline bool _GL_ATTRIBUTE_PURE
>  is_nul (void const *buf, size_t bufsize)
>  {

> +  const unsigned char *p = buf;
> +  size_t len;
> +
> +  /* Check first 16 bytes manually.  */
> +  for (len = 0; len < 16; len++)
> +    {
> +      if (! bufsize)
> +        return true;
> +      if (*p)
> +        return false;
> +      p++;
> +      bufsize--;
> +    }
> +
> +  /* Now we know that's zero, memcmp with self.  */
> +  return memcmp (buf, p, bufsize) == 0;
>  }

Cool trick of using a suitably-aligned overlap-to-self check to then
trigger platform-specific speedups without having to rewrite them by
hand!  qemu is doing a similar check in util/cutils.c:buffer_is_zero()
that could probably benefit from the same idea.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 14:37 ` [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection Eric Blake
@ 2015-10-22 14:44   ` Paolo Bonzini
  2015-10-22 15:17     ` Pádraig Brady
  0 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2015-10-22 14:44 UTC (permalink / raw)
  To: Eric Blake, Pádraig Brady, coreutils; +Cc: Rusty Russell, qemu-devel



On 22/10/2015 16:37, Eric Blake wrote:
>> > +  /* Check first 16 bytes manually.  */
>> > +  for (len = 0; len < 16; len++)
>> > +    {
>> > +      if (! bufsize)
>> > +        return true;
>> > +      if (*p)
>> > +        return false;
>> > +      p++;
>> > +      bufsize--;
>> > +    }
>> > +
>> > +  /* Now we know that's zero, memcmp with self.  */
>> > +  return memcmp (buf, p, bufsize) == 0;
>> >  }
> Cool trick of using a suitably-aligned overlap-to-self check to then
> trigger platform-specific speedups without having to rewrite them by
> hand!  qemu is doing a similar check in util/cutils.c:buffer_is_zero()
> that could probably benefit from the same idea.

Nice trick indeed.  On the other hand, the first 16 bytes are enough to
rule out 99.99% (number out of thin hair) of the non-zero blocks, so
that's where you want to optimize.  Checking them an unsigned long at a
time, or fetching a few unsigned longs and ORing them together would
probably be the best of both worlds, because you then only use the FPU
in the rare case of a zero buffer.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 14:44   ` Paolo Bonzini
@ 2015-10-22 15:17     ` Pádraig Brady
  2015-10-22 15:31       ` Paolo Bonzini
  2015-10-22 15:47       ` Bernhard Voelker
  0 siblings, 2 replies; 17+ messages in thread
From: Pádraig Brady @ 2015-10-22 15:17 UTC (permalink / raw)
  To: Paolo Bonzini, Eric Blake, coreutils; +Cc: Rusty Russell, qemu-devel

On 22/10/15 15:44, Paolo Bonzini wrote:
> 
> 
> On 22/10/2015 16:37, Eric Blake wrote:
>>>> +  /* Check first 16 bytes manually.  */
>>>> +  for (len = 0; len < 16; len++)
>>>> +    {
>>>> +      if (! bufsize)
>>>> +        return true;
>>>> +      if (*p)
>>>> +        return false;
>>>> +      p++;
>>>> +      bufsize--;
>>>> +    }
>>>> +
>>>> +  /* Now we know that's zero, memcmp with self.  */
>>>> +  return memcmp (buf, p, bufsize) == 0;
>>>>  }
>> Cool trick of using a suitably-aligned overlap-to-self check to then
>> trigger platform-specific speedups without having to rewrite them by
>> hand!  qemu is doing a similar check in util/cutils.c:buffer_is_zero()
>> that could probably benefit from the same idea.
> 
> Nice trick indeed.  On the other hand, the first 16 bytes are enough to
> rule out 99.99% (number out of thin hair) of the non-zero blocks, so
> that's where you want to optimize.  Checking them an unsigned long at a
> time, or fetching a few unsigned longs and ORing them together would
> probably be the best of both worlds, because you then only use the FPU
> in the rare case of a zero buffer.

Note the above does break early if non zero detected in first 16 bytes.

Also I suspect the extra conditions involved in using longs
for just the first 16 bytes would outweigh the benefits?
I.E. the first simple loop probably breaks early, and if not
has the added benefit of "priming the pumps" for the subsequent memcmp().

BTW Rusty has a benchmark framework for this as referenced from:
http://rusty.ozlabs.org/?p=560

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 15:17     ` Pádraig Brady
@ 2015-10-22 15:31       ` Paolo Bonzini
  2015-10-22 16:02         ` Eric Blake
  2015-10-22 15:47       ` Bernhard Voelker
  1 sibling, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2015-10-22 15:31 UTC (permalink / raw)
  To: Pádraig Brady, Eric Blake, coreutils; +Cc: Rusty Russell, qemu-devel



On 22/10/2015 17:17, Pádraig Brady wrote:
>> > Nice trick indeed.  On the other hand, the first 16 bytes are enough to
>> > rule out 99.99% (number out of thin hair) of the non-zero blocks, so
>> > that's where you want to optimize.  Checking them an unsigned long at a
>> > time, or fetching a few unsigned longs and ORing them together would
>> > probably be the best of both worlds, because you then only use the FPU
>> > in the rare case of a zero buffer.
> Note the above does break early if non zero detected in first 16 bytes.

Yes, but it loops unnecessarily if the non-zero byte is the third or fourth.

> Also I suspect the extra conditions involved in using longs
> for just the first 16 bytes would outweigh the benefits?

Only if your machine cannot do unaligned loads.  If it can, you can
align the length instead of the buffer.  memcmp will take care of
aligning the buffer (with some luck it won't have to, e.g. if buf is
0x12340002 and length = 4094).  On x86 unaligned "unsigned long" loads
are basically free as long as they don't cross a cache line.

> BTW Rusty has a benchmark framework for this as referenced from:
> http://rusty.ozlabs.org/?p=560

I missed his benchmark framework so I wrote another one, here it is:
https://gist.githubusercontent.com/bonzini/9a95b0e02d1ceb60af9e/raw/7bc42ddccdb6c42fea3db58e0539d0443d0e6dc6/memeqzero.c

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 15:17     ` Pádraig Brady
  2015-10-22 15:31       ` Paolo Bonzini
@ 2015-10-22 15:47       ` Bernhard Voelker
  2015-10-22 15:52         ` Paolo Bonzini
  2015-10-22 15:55         ` Eric Blake
  1 sibling, 2 replies; 17+ messages in thread
From: Bernhard Voelker @ 2015-10-22 15:47 UTC (permalink / raw)
  To: Pádraig Brady, Paolo Bonzini, Eric Blake, coreutils
  Cc: Rusty Russell, qemu-devel

On 10/22/2015 05:17 PM, Pádraig Brady wrote:
> On 22/10/15 15:44, Paolo Bonzini wrote:
>>
>>
>> On 22/10/2015 16:37, Eric Blake wrote:
>>>>> +  /* Check first 16 bytes manually.  */
>>>>> +  for (len = 0; len < 16; len++)
>>>>> +    {
>>>>> +      if (! bufsize)
>>>>> +        return true;
>>>>> +      if (*p)
>>>>> +        return false;
>>>>> +      p++;
>>>>> +      bufsize--;
>>>>> +    }
>>>>> +
>>>>> +  /* Now we know that's zero, memcmp with self.  */
>>>>> +  return memcmp (buf, p, bufsize) == 0;
>>>>>   }
>>> Cool trick of using a suitably-aligned overlap-to-self check to then
>>> trigger platform-specific speedups without having to rewrite them by
>>> hand!  qemu is doing a similar check in util/cutils.c:buffer_is_zero()
>>> that could probably benefit from the same idea.
>>
>> Nice trick indeed.  On the other hand, the first 16 bytes are enough to
>> rule out 99.99% (number out of thin hair) of the non-zero blocks, so
>> that's where you want to optimize.  Checking them an unsigned long at a
>> time, or fetching a few unsigned longs and ORing them together would
>> probably be the best of both worlds, because you then only use the FPU
>> in the rare case of a zero buffer.
>
> Note the above does break early if non zero detected in first 16 bytes.
>
> Also I suspect the extra conditions involved in using longs
> for just the first 16 bytes would outweigh the benefits?
> I.E. the first simple loop probably breaks early, and if not
> has the added benefit of "priming the pumps" for the subsequent memcmp().

what about spending some 16 bytes of memory and do the memcmp on the whole
buffer?

   static unsigned char p[] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
   return 0 == memcmp (p, buf, bufsize);

Have a nice day,
Berny

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 15:47       ` Bernhard Voelker
@ 2015-10-22 15:52         ` Paolo Bonzini
  2015-10-22 15:55         ` Eric Blake
  1 sibling, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2015-10-22 15:52 UTC (permalink / raw)
  To: Bernhard Voelker, Pádraig Brady, Eric Blake, coreutils
  Cc: Rusty Russell, qemu-devel



On 22/10/2015 17:47, Bernhard Voelker wrote:
>> Note the above does break early if non zero detected in first 16 bytes.
>>
>> Also I suspect the extra conditions involved in using longs
>> for just the first 16 bytes would outweigh the benefits?
>> I.E. the first simple loop probably breaks early, and if not
>> has the added benefit of "priming the pumps" for the subsequent memcmp().
> 
> what about spending some 16 bytes of memory and do the memcmp on the whole
> buffer?
> 
>   static unsigned char p[] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
>   return 0 == memcmp (p, buf, bufsize);

memcmp has high setup costs, because the compiler doesn't know it as
well as memcpy, so no. :(

Nerd sniping at its best!

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 15:47       ` Bernhard Voelker
  2015-10-22 15:52         ` Paolo Bonzini
@ 2015-10-22 15:55         ` Eric Blake
  2015-10-23 10:59           ` Bernhard Voelker
  1 sibling, 1 reply; 17+ messages in thread
From: Eric Blake @ 2015-10-22 15:55 UTC (permalink / raw)
  To: Bernhard Voelker, Pádraig Brady, Paolo Bonzini, coreutils
  Cc: Rusty Russell, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]

On 10/22/2015 09:47 AM, Bernhard Voelker wrote:

>> Also I suspect the extra conditions involved in using longs
>> for just the first 16 bytes would outweigh the benefits?
>> I.E. the first simple loop probably breaks early, and if not
>> has the added benefit of "priming the pumps" for the subsequent memcmp().
> 
> what about spending some 16 bytes of memory and do the memcmp on the whole
> buffer?
> 
>   static unsigned char p[] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
>   return 0 == memcmp (p, buf, bufsize);

Won't work over the whole bufsize for anything larger than 16 unless you
do repeated memcmp()s.

Or are you suggesting that the first 16-byte head validation be done
against a static buffer via one memcmp(), followed by the other
overlap-self memcmp() for the rest of the buffer?  But I suspect that
for short lengths, it is more efficient to do an unrolled loop than to
make a function call (where the function call itself will probably just
do an unrolled loop on the short length).  You want the short case to be
fast, and the real speedup comes by delegating as much of the long case
as possible to the system memcmp() optimizations.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 15:31       ` Paolo Bonzini
@ 2015-10-22 16:02         ` Eric Blake
  2015-10-22 16:14           ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Blake @ 2015-10-22 16:02 UTC (permalink / raw)
  To: Paolo Bonzini, Pádraig Brady, coreutils; +Cc: Rusty Russell, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2212 bytes --]

On 10/22/2015 09:31 AM, Paolo Bonzini wrote:

> Only if your machine cannot do unaligned loads.  If it can, you can
> align the length instead of the buffer.  memcmp will take care of
> aligning the buffer (with some luck it won't have to, e.g. if buf is
> 0x12340002 and length = 4094).  On x86 unaligned "unsigned long" loads
> are basically free as long as they don't cross a cache line.
> 
>> BTW Rusty has a benchmark framework for this as referenced from:
>> http://rusty.ozlabs.org/?p=560
> 
> I missed his benchmark framework so I wrote another one, here it is:
> https://gist.githubusercontent.com/bonzini/9a95b0e02d1ceb60af9e/raw/7bc42ddccdb6c42fea3db58e0539d0443d0e6dc6/memeqzero.c

I see a bug in there:

static __attribute__((noinline)) bool memeqzero4_paolo(const void *data,
size_t length)
{
    const unsigned char *p = data;
    unsigned long word;

    while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
        if (*p)
            return false;
        p++;
        length--;
    }
    while (__builtin_expect(length & (16 - sizeof(word)), 0)) {
        memcpy(&word, p, sizeof(word));
        if (word)
            return false;
        p += sizeof(word);
        length -= sizeof(word);
    }

     /* Now we know that's zero, memcmp with self. */
     return length == 0 || memcmp(data, p, length) == 0;
}

If length is already aligned on entry, then you are calling memcmp(data,
data, length) which is trivially 0 for all input, rather than checking
for actual NUL bytes.  You MUST check at least one byte manually before
handing off to memcmp(), and having the distance between data and p be a
multiple of a cache-line (well, blindly picking 16 as Rusty did is a
close approximation) will probably let the libc memcmp() run a lot
faster than if memcmp() has to deal with unaligned pointers (where it
can optimize for one of the two reads to be aligned, but the other read
is unaligned - even if the two reads are close enough to be hitting the
same cache line, you are still suffering from some performance slowdowns).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 16:02         ` Eric Blake
@ 2015-10-22 16:14           ` Paolo Bonzini
  2015-10-22 17:39             ` Radim Krčmář
  0 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2015-10-22 16:14 UTC (permalink / raw)
  To: Eric Blake, Pádraig Brady, coreutils; +Cc: Rusty Russell, qemu-devel



On 22/10/2015 18:02, Eric Blake wrote:
> On 10/22/2015 09:31 AM, Paolo Bonzini wrote:
> 
>> Only if your machine cannot do unaligned loads.  If it can, you can
>> align the length instead of the buffer.  memcmp will take care of
>> aligning the buffer (with some luck it won't have to, e.g. if buf is
>> 0x12340002 and length = 4094).  On x86 unaligned "unsigned long" loads
>> are basically free as long as they don't cross a cache line.
>>
>>> BTW Rusty has a benchmark framework for this as referenced from:
>>> http://rusty.ozlabs.org/?p=560
>>
>> I missed his benchmark framework so I wrote another one, here it is:
>> https://gist.githubusercontent.com/bonzini/9a95b0e02d1ceb60af9e/raw/7bc42ddccdb6c42fea3db58e0539d0443d0e6dc6/memeqzero.c
> 
> I see a bug in there:

Of course.  You shouldn't have told me what the bug was, I deserved
to look for it myself. :)

Fixed version, same performance (7/24/91/5002 for memeqzero4_rusty,
7/10/59/4963 for mine, so 30-40 clock cycles saved if length >= 16):

bool memeqzero4_paolo(const void *data, size_t length)
{
    const unsigned char *p = data;
    unsigned long word;

    while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
        if (*p)
            return false;
        p++;
        length--;
	if (!length)
            return true;
    }

    /* We must always read one byte or word, even if everything is aligned!
     * Otherwise, memcmp(data, data, length) is trivially true.
     */
    for (;;) {
        memcpy(&word, p, sizeof(word));
        if (word)
            return false;
	if (__builtin_expect(length & (16 - sizeof(word)), 0) == 0)
            break;
        p += sizeof(word);
        length -= sizeof(word);
	if (!length)
            return true;
    }

     /* Now we know that's zero, memcmp with self. */
     return memcmp(data, p, length) == 0;
}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 16:14           ` Paolo Bonzini
@ 2015-10-22 17:39             ` Radim Krčmář
  2015-10-22 19:47               ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Radim Krčmář @ 2015-10-22 17:39 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Pádraig Brady, Rusty Russell, coreutils, qemu-devel

2015-10-22 18:14+0200, Paolo Bonzini:
> On 22/10/2015 18:02, Eric Blake wrote:
>> I see a bug in there:
> 
> Of course.  You shouldn't have told me what the bug was, I deserved
> to look for it myself. :)

It rather seems that you don't want spoilers, :)

I see two bugs now.

> bool memeqzero4_paolo(const void *data, size_t length)
> {
>     const unsigned char *p = data;
>     unsigned long word;
> 
>     while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
>         if (*p)
>             return false;
>         p++;
>         length--;
>         if (!length)
>             return true;
>     }
> 
>     /* We must always read one byte or word, even if everything is aligned!
>      * Otherwise, memcmp(data, data, length) is trivially true.
>      */
>     for (;;) {
>         memcpy(&word, p, sizeof(word));
>         if (word)
>             return false;
>         if (__builtin_expect(length & (16 - sizeof(word)), 0) == 0)
>             break;
>         p += sizeof(word);
>         length -= sizeof(word);
>         if (!length)
>             return true;
>     }
> 
>      /* Now we know that's zero, memcmp with self. */
>      return memcmp(data, p, length) == 0;
> }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 17:39             ` Radim Krčmář
@ 2015-10-22 19:47               ` Paolo Bonzini
  2015-10-23 11:12                 ` Pádraig Brady
  2015-10-23 11:15                 ` Pádraig Brady
  0 siblings, 2 replies; 17+ messages in thread
From: Paolo Bonzini @ 2015-10-22 19:47 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Pádraig Brady, Rusty Russell, coreutils, qemu-devel



On 22/10/2015 19:39, Radim Krčmář wrote:
> 2015-10-22 18:14+0200, Paolo Bonzini:
>> On 22/10/2015 18:02, Eric Blake wrote:
>>> I see a bug in there:
>>
>> Of course.  You shouldn't have told me what the bug was, I deserved
>> to look for it myself. :)
> 
> It rather seems that you don't want spoilers, :)
> 
> I see two bugs now.

Me too. :)  But Rusty surely has some testcases in case he wants to
adopt some of the ideas here. O:-)

Paolo

>> bool memeqzero4_paolo(const void *data, size_t length)
>> {
>>     const unsigned char *p = data;
>>     unsigned long word;
>>
>>     while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
>>         if (*p)
>>             return false;
>>         p++;
>>         length--;
>>         if (!length)
>>             return true;
>>     }
>>
>>     /* We must always read one byte or word, even if everything is aligned!
>>      * Otherwise, memcmp(data, data, length) is trivially true.
>>      */
>>     for (;;) {
>>         memcpy(&word, p, sizeof(word));
>>         if (word)
>>             return false;
>>         if (__builtin_expect(length & (16 - sizeof(word)), 0) == 0)
>>             break;
>>         p += sizeof(word);
>>         length -= sizeof(word);
>>         if (!length)
>>             return true;
>>     }
>>
>>      /* Now we know that's zero, memcmp with self. */
>>      return memcmp(data, p, length) == 0;
>> }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 15:55         ` Eric Blake
@ 2015-10-23 10:59           ` Bernhard Voelker
  0 siblings, 0 replies; 17+ messages in thread
From: Bernhard Voelker @ 2015-10-23 10:59 UTC (permalink / raw)
  To: Eric Blake, Pádraig Brady, Paolo Bonzini, coreutils
  Cc: Rusty Russell, qemu-devel

On 10/22/2015 05:55 PM, Eric Blake wrote:
> On 10/22/2015 09:47 AM, Bernhard Voelker wrote:
>
>>> Also I suspect the extra conditions involved in using longs
>>> for just the first 16 bytes would outweigh the benefits?
>>> I.E. the first simple loop probably breaks early, and if not
>>> has the added benefit of "priming the pumps" for the subsequent memcmp().
>>
>> what about spending some 16 bytes of memory and do the memcmp on the whole
>> buffer?
>>
>>    static unsigned char p[] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
>>    return 0 == memcmp (p, buf, bufsize);
>
> Won't work over the whole bufsize for anything larger than 16 unless you
> do repeated memcmp()s.
>
> Or are you suggesting that the first 16-byte head validation be done
> against a static buffer via one memcmp(), followed by the other
> overlap-self memcmp() for the rest of the buffer?  But I suspect that
> for short lengths, it is more efficient to do an unrolled loop than to
> make a function call (where the function call itself will probably just
> do an unrolled loop on the short length).  You want the short case to be
> fast, and the real speedup comes by delegating as much of the long case
> as possible to the system memcmp() optimizations.

Of course, you're completely right.  My example above was over-simplified
and therefore plain wrong, sorry.

Aiming at tools like dd(1), I played a bit with the idea of pre-known-zeroed
buffer in front of the real payload data, i.e. having a buffer of 16 + 64k
where the first 16 bytes are all NULs, thus being able to immediately use
the overlap-self memcmp() with the payload starting at offset 16.
Tests showed that you are right with your other suspicion, too: the overhead
of calling memcmp() for small buffer sizes is less effective than Rusty's way.

Therefore +1 for Padraig's patch.

Have a nice day,
Berny

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 19:47               ` Paolo Bonzini
@ 2015-10-23 11:12                 ` Pádraig Brady
  2015-10-23 11:14                   ` Paolo Bonzini
  2015-10-23 11:15                 ` Pádraig Brady
  1 sibling, 1 reply; 17+ messages in thread
From: Pádraig Brady @ 2015-10-23 11:12 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Rusty Russell, coreutils, qemu-devel

On 22/10/15 20:47, Paolo Bonzini wrote:
> 
> 
> On 22/10/2015 19:39, Radim Krčmář wrote:
>> 2015-10-22 18:14+0200, Paolo Bonzini:
>>> On 22/10/2015 18:02, Eric Blake wrote:
>>>> I see a bug in there:
>>>
>>> Of course.  You shouldn't have told me what the bug was, I deserved
>>> to look for it myself. :)
>>
>> It rather seems that you don't want spoilers, :)
>>
>> I see two bugs now.
> 
> Me too. :)  But Rusty surely has some testcases in case he wants to
> adopt some of the ideas here. O:-)

For completeness this should address the bugs I think?

bool memeqzero4_paolo(const void *data, size_t length)
{
    const unsigned char *p = data;
    unsigned long word;

    if (!length)
        return true;

    /* Check len bytes not aligned on a word.  */
    while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
        if (*p)
            return false;
        p++;
        length--;
        if (!length)
            return true;
    }

    /* Check up to 16 bytes a word at a time.  */
    for (;;) {
        memcpy(&word, p, sizeof(word));
        if (word)
            return false;
        p += sizeof(word);
        length -= sizeof(word);
        if (!length)
            return true;
        if (__builtin_expect(length & 15, 0) == 0)
            break;
    }

     /* Now we know that's zero, memcmp with self. */
     return memcmp(data, p, length) == 0;
}

compiled with gcc 5.1.1 -march=native -O2 on an i3-2310M
we get these timings:

bytes  1	8	16	512	65536
---------------------------------------------
Rusty: 10	28	59	114	6510
Paolo: 9	9	12	75	6495

It's also smaller, especially at -O3:

$ nm -S a.out | grep memeqzero4
... 000000000000005b t memeqzero4_paolo
... 0000000000000063 t memeqzero4_rusty
$ gcc -march=native -O3 memeqzero.c
$ nm -S a.out | grep memeqzero4
... 000000000000005b t memeqzero4_paolo
... 0000000000000133 t memeqzero4_rusty

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-23 11:12                 ` Pádraig Brady
@ 2015-10-23 11:14                   ` Paolo Bonzini
  0 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2015-10-23 11:14 UTC (permalink / raw)
  To: Pádraig Brady, Radim Krčmář
  Cc: Rusty Russell, coreutils, qemu-devel



On 23/10/2015 13:12, Pádraig Brady wrote:
> On 22/10/15 20:47, Paolo Bonzini wrote:
>>
>>
>> On 22/10/2015 19:39, Radim Krčmář wrote:
>>> 2015-10-22 18:14+0200, Paolo Bonzini:
>>>> On 22/10/2015 18:02, Eric Blake wrote:
>>>>> I see a bug in there:
>>>>
>>>> Of course.  You shouldn't have told me what the bug was, I deserved
>>>> to look for it myself. :)
>>>
>>> It rather seems that you don't want spoilers, :)
>>>
>>> I see two bugs now.
>>
>> Me too. :)  But Rusty surely has some testcases in case he wants to
>> adopt some of the ideas here. O:-)
> 
> For completeness this should address the bugs I think?

Yes, thanks! :D

Paolo

> bool memeqzero4_paolo(const void *data, size_t length)
> {
>     const unsigned char *p = data;
>     unsigned long word;
> 
>     if (!length)
>         return true;
> 
>     /* Check len bytes not aligned on a word.  */
>     while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
>         if (*p)
>             return false;
>         p++;
>         length--;
>         if (!length)
>             return true;
>     }
> 
>     /* Check up to 16 bytes a word at a time.  */
>     for (;;) {
>         memcpy(&word, p, sizeof(word));
>         if (word)
>             return false;
>         p += sizeof(word);
>         length -= sizeof(word);
>         if (!length)
>             return true;
>         if (__builtin_expect(length & 15, 0) == 0)
>             break;
>     }
> 
>      /* Now we know that's zero, memcmp with self. */
>      return memcmp(data, p, length) == 0;
> }
> 
> compiled with gcc 5.1.1 -march=native -O2 on an i3-2310M
> we get these timings:
> 
> bytes  1	8	16	512	65536
> ---------------------------------------------
> Rusty: 10	28	59	114	6510
> Paolo: 9	9	12	75	6495
> 
> It's also smaller, especially at -O3:
> 
> $ nm -S a.out | grep memeqzero4
> ... 000000000000005b t memeqzero4_paolo
> ... 0000000000000063 t memeqzero4_rusty
> $ gcc -march=native -O3 memeqzero.c
> $ nm -S a.out | grep memeqzero4
> ... 000000000000005b t memeqzero4_paolo
> ... 0000000000000133 t memeqzero4_rusty
> 
> cheers,
> Pádraig.
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-22 19:47               ` Paolo Bonzini
  2015-10-23 11:12                 ` Pádraig Brady
@ 2015-10-23 11:15                 ` Pádraig Brady
  2015-10-24  2:24                   ` Pádraig Brady
  1 sibling, 1 reply; 17+ messages in thread
From: Pádraig Brady @ 2015-10-23 11:15 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Rusty Russell, coreutils, qemu-devel

On 22/10/15 20:47, Paolo Bonzini wrote:
> 
> 
> On 22/10/2015 19:39, Radim Krčmář wrote:
>> 2015-10-22 18:14+0200, Paolo Bonzini:
>>> On 22/10/2015 18:02, Eric Blake wrote:
>>>> I see a bug in there:
>>>
>>> Of course.  You shouldn't have told me what the bug was, I deserved
>>> to look for it myself. :)
>>
>> It rather seems that you don't want spoilers, :)
>>
>> I see two bugs now.
> 
> Me too. :)  But Rusty surely has some testcases in case he wants to
> adopt some of the ideas here. O:-)

For completeness this should address the bugs I think?

bool memeqzero4_paolo(const void *data, size_t length)
{
    const unsigned char *p = data;
    unsigned long word;

    if (!length)
        return true;

    /* Check len bytes not aligned on a word.  */
    while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
        if (*p)
            return false;
        p++;
        length--;
        if (!length)
            return true;
    }

    /* Check up to 16 bytes a word at a time.  */
    for (;;) {
        memcpy(&word, p, sizeof(word));
        if (word)
            return false;
        p += sizeof(word);
        length -= sizeof(word);
        if (!length)
            return true;
        if (__builtin_expect(length & 15, 0) == 0)
            break;
    }

     /* Now we know that's zero, memcmp with self. */
     return memcmp(data, p, length) == 0;
}

compiled with gcc 5.1.1 -march=native -O2 on an i3-2310M
we get these timings:

bytes  1	8	16	512	65536
---------------------------------------------
Rusty: 10	28	59	114	6510
Paolo: 9	9	12	75	6495

It's also smaller, especially at -O3:

$ nm -S a.out | grep memeqzero4
... 000000000000005b t memeqzero4_paolo
... 0000000000000063 t memeqzero4_rusty
$ gcc -march=native -O3 memeqzero.c
$ nm -S a.out | grep memeqzero4
... 000000000000005b t memeqzero4_paolo
... 0000000000000133 t memeqzero4_rusty

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-23 11:15                 ` Pádraig Brady
@ 2015-10-24  2:24                   ` Pádraig Brady
  2015-10-25 12:00                     ` Pádraig Brady
  0 siblings, 1 reply; 17+ messages in thread
From: Pádraig Brady @ 2015-10-24  2:24 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Rusty Russell, coreutils, qemu-devel

On 23/10/15 12:15, Pádraig Brady wrote:
> On 22/10/15 20:47, Paolo Bonzini wrote:
>>
>>
>> On 22/10/2015 19:39, Radim Krčmář wrote:
>>> 2015-10-22 18:14+0200, Paolo Bonzini:
>>>> On 22/10/2015 18:02, Eric Blake wrote:
>>>>> I see a bug in there:
>>>>
>>>> Of course.  You shouldn't have told me what the bug was, I deserved
>>>> to look for it myself. :)
>>>
>>> It rather seems that you don't want spoilers, :)
>>>
>>> I see two bugs now.
>>
>> Me too. :)  But Rusty surely has some testcases in case he wants to
>> adopt some of the ideas here. O:-)
> 
> For completeness this should address the bugs I think?
> 
> bool memeqzero4_paolo(const void *data, size_t length)
> {
>     const unsigned char *p = data;
>     unsigned long word;

Note the original coreutils code accessed a word at a time,
but only on aligned buffers. The code below may generate
unaligned accesses for unaligned sizes or buffers.
You can avoid that by auto degenerating back to byte
at a time access using:

#if _STRING_ARCH_unaligned
  unsigned long word;
#else
  unsigned char word;
#endif

> 
>     if (!length)
>         return true;
> 
>     /* Check len bytes not aligned on a word.  */
>     while (__builtin_expect(length & (sizeof(word) - 1), 0)) {
>         if (*p)
>             return false;
>         p++;
>         length--;
>         if (!length)
>             return true;
>     }
> 
>     /* Check up to 16 bytes a word at a time.  */
>     for (;;) {
>         memcpy(&word, p, sizeof(word));
>         if (word)
>             return false;
>         p += sizeof(word);
>         length -= sizeof(word);
>         if (!length)
>             return true;
>         if (__builtin_expect(length & 15, 0) == 0)
>             break;
>     }
> 
>      /* Now we know that's zero, memcmp with self. */
>      return memcmp(data, p, length) == 0;
> }

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection
  2015-10-24  2:24                   ` Pádraig Brady
@ 2015-10-25 12:00                     ` Pádraig Brady
  0 siblings, 0 replies; 17+ messages in thread
From: Pádraig Brady @ 2015-10-25 12:00 UTC (permalink / raw)
  To: Paolo Bonzini, Radim Krčmář
  Cc: Rusty Russell, coreutils, qemu-devel

On 24/10/15 03:24, Pádraig Brady wrote:
> On 23/10/15 12:15, Pádraig Brady wrote:
>> On 22/10/15 20:47, Paolo Bonzini wrote:
>>>
>>>
>>> On 22/10/2015 19:39, Radim Krčmář wrote:
>>>> 2015-10-22 18:14+0200, Paolo Bonzini:
>>>>> On 22/10/2015 18:02, Eric Blake wrote:
>>>>>> I see a bug in there:
>>>>>
>>>>> Of course.  You shouldn't have told me what the bug was, I deserved
>>>>> to look for it myself. :)
>>>>
>>>> It rather seems that you don't want spoilers, :)
>>>>
>>>> I see two bugs now.
>>>
>>> Me too. :)  But Rusty surely has some testcases in case he wants to
>>> adopt some of the ideas here. O:-)
>>
>> For completeness this should address the bugs I think?
>>
>> bool memeqzero4_paolo(const void *data, size_t length)
>> {
>>     const unsigned char *p = data;
>>     unsigned long word;
> 
> Note the original coreutils code accessed a word at a time,
> but only on aligned buffers. The code below may generate
> unaligned accesses for unaligned sizes or buffers.
> You can avoid that by auto degenerating back to byte
> at a time access using:
> 
> #if _STRING_ARCH_unaligned
>   unsigned long word;
> #else
>   unsigned char word;
> #endif

Note _STRING_ARCH_unaligned being defined at the glibc level
doesn't guarantee the compiler wont do something "defined"
when vectorizing code, or perhaps removing sanity checks like:
  #define ALIGNED_POINTER(ptr, type) ((size_t) (ptr) % alignof (type) == 0)
by assuming that the lower bits of the pointer are zero
as that's the only defined operation?

One gets stronger guarantees from a compiler define, like
__ARM_FEATURE_UNALIGNED which gcc -munaligned-access defines for arm,
though we're still in undefined territory and -fsanitize=undefined
will warn about such accesses.  You'd have to explicitly avoid
such warnings with something like:
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=5760532

Given coreutils only uses is_nul for relatively large buffers,
I'm going to avoid the above complexity, since the focus of this
coreutils patch was simplification, and only enable the byte at a time access.

cheers,
Pádraig.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-10-25 12:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1445522453-14450-1-git-send-email-P@draigBrady.com>
2015-10-22 14:37 ` [Qemu-devel] [PATCH] copy, dd: simplify and optimize NUL bytes detection Eric Blake
2015-10-22 14:44   ` Paolo Bonzini
2015-10-22 15:17     ` Pádraig Brady
2015-10-22 15:31       ` Paolo Bonzini
2015-10-22 16:02         ` Eric Blake
2015-10-22 16:14           ` Paolo Bonzini
2015-10-22 17:39             ` Radim Krčmář
2015-10-22 19:47               ` Paolo Bonzini
2015-10-23 11:12                 ` Pádraig Brady
2015-10-23 11:14                   ` Paolo Bonzini
2015-10-23 11:15                 ` Pádraig Brady
2015-10-24  2:24                   ` Pádraig Brady
2015-10-25 12:00                     ` Pádraig Brady
2015-10-22 15:47       ` Bernhard Voelker
2015-10-22 15:52         ` Paolo Bonzini
2015-10-22 15:55         ` Eric Blake
2015-10-23 10:59           ` Bernhard Voelker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.