qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* Prefetches in buffer_zero_*
@ 2021-07-22 10:02 Dr. David Alan Gilbert
  2021-07-22 18:53 ` Richard Henderson
  0 siblings, 1 reply; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-22 10:02 UTC (permalink / raw)
  To: richard.henderson; +Cc: qemu-devel, jmario

Hi Richard,
  I think you were the last person to fiddle with the prefetching
in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
prefetching still made sense on modern CPUs, and that their hardware
generally figures stuff out better on simple increments.

  What was your thinking on this, and did you actually measure
any improvement?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Prefetches in buffer_zero_*
  2021-07-22 10:02 Prefetches in buffer_zero_* Dr. David Alan Gilbert
@ 2021-07-22 18:53 ` Richard Henderson
  2021-07-22 19:13   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Henderson @ 2021-07-22 18:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: qemu-devel, jmario

On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> Hi Richard,
>    I think you were the last person to fiddle with the prefetching
> in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> prefetching still made sense on modern CPUs, and that their hardware
> generally figures stuff out better on simple increments.
> 
>    What was your thinking on this, and did you actually measure
> any improvement?

Ah, well, that was 5 years ago so I have no particular memory of this.  It wouldn't 
surprise me if you can't measure any improvement on modern hardware.

Do you now measure an improvement with the prefetches gone?


r~


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Prefetches in buffer_zero_*
  2021-07-22 18:53 ` Richard Henderson
@ 2021-07-22 19:13   ` Dr. David Alan Gilbert
  2021-07-24  1:43     ` Joe Mario
  0 siblings, 1 reply; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-22 19:13 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, jmario

* Richard Henderson (richard.henderson@linaro.org) wrote:
> On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> > Hi Richard,
> >    I think you were the last person to fiddle with the prefetching
> > in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> > prefetching still made sense on modern CPUs, and that their hardware
> > generally figures stuff out better on simple increments.
> > 
> >    What was your thinking on this, and did you actually measure
> > any improvement?
> 
> Ah, well, that was 5 years ago so I have no particular memory of this.  It
> wouldn't surprise me if you can't measure any improvement on modern
> hardware.
> 
> Do you now measure an improvement with the prefetches gone?

Not tried, it just came from Joe's suggestion that it was generally a
bad idea these days; I do remember that the behaviour of those functions
is quite tricky because there performance is VERY data dependent - many
VMs actually have pages that are quite dirty so you never iterate the
loop, but then you hit others with big zero pages and you spend your
entire life in the loop.

Dave
> 
> r~
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Prefetches in buffer_zero_*
  2021-07-22 19:13   ` Dr. David Alan Gilbert
@ 2021-07-24  1:43     ` Joe Mario
  2021-07-26  8:47       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 7+ messages in thread
From: Joe Mario @ 2021-07-24  1:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Richard Henderson, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 2300 bytes --]

On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
wrote:

> * Richard Henderson (richard.henderson@linaro.org) wrote:
> > On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> > > Hi Richard,
> > >    I think you were the last person to fiddle with the prefetching
> > > in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> > > prefetching still made sense on modern CPUs, and that their hardware
> > > generally figures stuff out better on simple increments.
> > >
> > >    What was your thinking on this, and did you actually measure
> > > any improvement?
> >
> > Ah, well, that was 5 years ago so I have no particular memory of this.
> It
> > wouldn't surprise me if you can't measure any improvement on modern
> > hardware.
> >
> > Do you now measure an improvement with the prefetches gone?
>
> Not tried, it just came from Joe's suggestion that it was generally a
> bad idea these days; I do remember that the behaviour of those functions
> is quite tricky because there performance is VERY data dependent - many
> VMs actually have pages that are quite dirty so you never iterate the
> loop, but then you hit others with big zero pages and you spend your
> entire life in the loop.
>
>
Dave, Richard:
My curiosity got the best of me.  So I created a small test program that
used the buffer_zero_avx2() routine from qemu's bufferiszero.c.

When I run it on an Intel Cascade Lake processor, the cost of calling
"__builtin_prefetch(p)" is in the noise range .  It's always "just
slightly" slower.  I doubt it could ever be measured in qemu.

Ironically, when I disabled the hardware prefetchers, the program slowed
down over 33%.  And the call to "__builtin_prefetch(p)" actually hurt
performance by over 3%.

My results are below, (only with the hardware prefetchers enabled).  The
program is attached.
Joe

# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
./a.out; done
TSC 356144 Kcycles.
TSC 356714 Kcycles.
TSC 356707 Kcycles.
TSC 356565 Kcycles.
TSC 356853 Kcycles.
# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
TSC 355520 Kcycles.
TSC 355961 Kcycles.
TSC 355872 Kcycles.
TSC 355948 Kcycles.
TSC 355918 Kcycles.

Dave
> >
> > r~
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>

[-- Attachment #1.2: Type: text/html, Size: 4285 bytes --]

[-- Attachment #2: buffer_zero_avx.c --]
[-- Type: text/x-csrc, Size: 3166 bytes --]

/*
 * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
 *
 * Compile with either:
 *  gcc -mavx buffer_zero_avx.c -O 
 * or
 *  gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH 
 */

#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <stddef.h>
#include <sys/mman.h>
#include <string.h>

#define likely(x)       __builtin_expect((x),1)
#define unlikely(x)     __builtin_expect((x),0)

static __inline__ u_int64_t start_clock();
static __inline__ u_int64_t stop_clock();
static int buffer_zero_avx2(const void *buf, size_t len);

/*
 * Allocate a large chuck of anon memory, touch/zero it, 
 * and then time the call to buffer_zero_avx2().
 */
int main() 
{
   long i;
   size_t mmap_len = 2UL*1024*1024*1024;
   char *ptr = mmap(NULL, mmap_len,
       PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0L);

   if (ptr == MAP_FAILED) {
       perror(" mmap");
       exit(1);
   }

   // Touch the pages (they're already cleared)
   memset(ptr,0x0,mmap_len);

   u_int64_t start_rdtsc = start_clock();

   buffer_zero_avx2(ptr, mmap_len);

   u_int64_t stop_rdtsc = stop_clock();
   u_int64_t diff = stop_rdtsc - start_rdtsc;

   printf("TSC %ld Kcycles. \n", diff/1000);

}

static int 
buffer_zero_avx2(const void *buf, size_t len)
{
    /* Begin with an unaligned head of 32 bytes.  */
    __m256i t = _mm256_loadu_si256(buf);
    __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
    __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);

    if (likely(p <= e)) {
        /* Loop over 32-byte aligned blocks of 128.  */
        do {
#ifdef DO_PREFETCH
             __builtin_prefetch(p);
#endif
            if (unlikely(!_mm256_testz_si256(t, t))) {
                printf("In unlikely buffer_zero, p:%lx \n",p);
                return 0;
            }
            t = p[-4] | p[-3] | p[-2] | p[-1];
            p += 4;
        } while (p <= e);
    } else {
        t |= _mm256_loadu_si256(buf + 32);
        if (len <= 128) {
            goto last2;
        }
    }

    /* Finish the last block of 128 unaligned.  */
    t |= _mm256_loadu_si256(buf + len - 4 * 32);
    t |= _mm256_loadu_si256(buf + len - 3 * 32);
last2:
    t |= _mm256_loadu_si256(buf + len - 2 * 32);
    t |= _mm256_loadu_si256(buf + len - 1 * 32);
  
    // printf("End of buffer_zero_avx2\n");
    return _mm256_testz_si256(t, t);
}

static __inline__ u_int64_t 
start_clock() {
    // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
    u_int32_t hi, lo;
    __asm__ __volatile__ (
        "CPUID\n\t"
        "RDTSC\n\t"
        "mov %%edx, %0\n\t"
        "mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
        "%rax", "%rbx", "%rcx", "%rdx");
    return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}

static __inline__ u_int64_t 
stop_clock() {
    // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
    u_int32_t hi, lo;
    __asm__ __volatile__(
        "RDTSCP\n\t"
        "mov %%edx, %0\n\t"
        "mov %%eax, %1\n\t"
        "CPUID\n\t": "=r" (hi), "=r" (lo)::
        "%rax", "%rbx", "%rcx", "%rdx");
    return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Prefetches in buffer_zero_*
  2021-07-24  1:43     ` Joe Mario
@ 2021-07-26  8:47       ` Dr. David Alan Gilbert
  2021-07-26 11:31         ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-26  8:47 UTC (permalink / raw)
  To: Joe Mario; +Cc: Richard Henderson, qemu-devel

* Joe Mario (jmario@redhat.com) wrote:
> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> wrote:
> 
> > * Richard Henderson (richard.henderson@linaro.org) wrote:
> > > On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> > > > Hi Richard,
> > > >    I think you were the last person to fiddle with the prefetching
> > > > in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> > > > prefetching still made sense on modern CPUs, and that their hardware
> > > > generally figures stuff out better on simple increments.
> > > >
> > > >    What was your thinking on this, and did you actually measure
> > > > any improvement?
> > >
> > > Ah, well, that was 5 years ago so I have no particular memory of this.
> > It
> > > wouldn't surprise me if you can't measure any improvement on modern
> > > hardware.
> > >
> > > Do you now measure an improvement with the prefetches gone?
> >
> > Not tried, it just came from Joe's suggestion that it was generally a
> > bad idea these days; I do remember that the behaviour of those functions
> > is quite tricky because there performance is VERY data dependent - many
> > VMs actually have pages that are quite dirty so you never iterate the
> > loop, but then you hit others with big zero pages and you spend your
> > entire life in the loop.
> >
> >
> Dave, Richard:
> My curiosity got the best of me.  So I created a small test program that
> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.

Thanks for testing,

> When I run it on an Intel Cascade Lake processor, the cost of calling
> "__builtin_prefetch(p)" is in the noise range .  It's always "just
> slightly" slower.  I doubt it could ever be measured in qemu.
> 
> Ironically, when I disabled the hardware prefetchers, the program slowed
> down over 33%.  And the call to "__builtin_prefetch(p)" actually hurt
> performance by over 3%.

Yeh that's a bit odd.

> My results are below, (only with the hardware prefetchers enabled).  The
> program is attached.
> Joe
> 
> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> ./a.out; done
> TSC 356144 Kcycles.
> TSC 356714 Kcycles.
> TSC 356707 Kcycles.
> TSC 356565 Kcycles.
> TSC 356853 Kcycles.
> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> TSC 355520 Kcycles.
> TSC 355961 Kcycles.
> TSC 355872 Kcycles.
> TSC 355948 Kcycles.
> TSC 355918 Kcycles.

This basically agrees with the machines I've just tried your test on -
*except* AMD EPYC 7302P's - that really like the prefetch:

[root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do ./a.out; done
TSC 322162 Kcycles.
TSC 321861 Kcycles. 
TSC 322212 Kcycles. 
TSC 321957 Kcycles.
TSC 322085 Kcycles. 
 
[root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
TSC 377988 Kcycles. 
TSC 380125 Kcycles. 
TSC 379440 Kcycles.
TSC 379689 Kcycles. 
TSC 379571 Kcycles. 
 
The 1st gen doesn't seem to see much difference with/without it.

Probably best to leave this code as is!

Dave


> Dave
> > >
> > > r~
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >

> /*
>  * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
>  *
>  * Compile with either:
>  *  gcc -mavx buffer_zero_avx.c -O 
>  * or
>  *  gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH 
>  */
> 
> #include <immintrin.h>
> #include <stdio.h>
> #include <stdint.h>
> #include <stddef.h>
> #include <sys/mman.h>
> #include <string.h>
> 
> #define likely(x)       __builtin_expect((x),1)
> #define unlikely(x)     __builtin_expect((x),0)
> 
> static __inline__ u_int64_t start_clock();
> static __inline__ u_int64_t stop_clock();
> static int buffer_zero_avx2(const void *buf, size_t len);
> 
> /*
>  * Allocate a large chuck of anon memory, touch/zero it, 
>  * and then time the call to buffer_zero_avx2().
>  */
> int main() 
> {
>    long i;
>    size_t mmap_len = 2UL*1024*1024*1024;
>    char *ptr = mmap(NULL, mmap_len,
>        PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0L);
> 
>    if (ptr == MAP_FAILED) {
>        perror(" mmap");
>        exit(1);
>    }
> 
>    // Touch the pages (they're already cleared)
>    memset(ptr,0x0,mmap_len);
> 
>    u_int64_t start_rdtsc = start_clock();
> 
>    buffer_zero_avx2(ptr, mmap_len);
> 
>    u_int64_t stop_rdtsc = stop_clock();
>    u_int64_t diff = stop_rdtsc - start_rdtsc;
> 
>    printf("TSC %ld Kcycles. \n", diff/1000);
> 
> }
> 
> static int 
> buffer_zero_avx2(const void *buf, size_t len)
> {
>     /* Begin with an unaligned head of 32 bytes.  */
>     __m256i t = _mm256_loadu_si256(buf);
>     __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
>     __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
> 
>     if (likely(p <= e)) {
>         /* Loop over 32-byte aligned blocks of 128.  */
>         do {
> #ifdef DO_PREFETCH
>              __builtin_prefetch(p);
> #endif
>             if (unlikely(!_mm256_testz_si256(t, t))) {
>                 printf("In unlikely buffer_zero, p:%lx \n",p);
>                 return 0;
>             }
>             t = p[-4] | p[-3] | p[-2] | p[-1];
>             p += 4;
>         } while (p <= e);
>     } else {
>         t |= _mm256_loadu_si256(buf + 32);
>         if (len <= 128) {
>             goto last2;
>         }
>     }
> 
>     /* Finish the last block of 128 unaligned.  */
>     t |= _mm256_loadu_si256(buf + len - 4 * 32);
>     t |= _mm256_loadu_si256(buf + len - 3 * 32);
> last2:
>     t |= _mm256_loadu_si256(buf + len - 2 * 32);
>     t |= _mm256_loadu_si256(buf + len - 1 * 32);
>   
>     // printf("End of buffer_zero_avx2\n");
>     return _mm256_testz_si256(t, t);
> }
> 
> static __inline__ u_int64_t 
> start_clock() {
>     // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
>     u_int32_t hi, lo;
>     __asm__ __volatile__ (
>         "CPUID\n\t"
>         "RDTSC\n\t"
>         "mov %%edx, %0\n\t"
>         "mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
>         "%rax", "%rbx", "%rcx", "%rdx");
>     return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
> }
> 
> static __inline__ u_int64_t 
> stop_clock() {
>     // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
>     u_int32_t hi, lo;
>     __asm__ __volatile__(
>         "RDTSCP\n\t"
>         "mov %%edx, %0\n\t"
>         "mov %%eax, %1\n\t"
>         "CPUID\n\t": "=r" (hi), "=r" (lo)::
>         "%rax", "%rbx", "%rcx", "%rdx");
>     return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
> }
> 
> 

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Prefetches in buffer_zero_*
  2021-07-26  8:47       ` Dr. David Alan Gilbert
@ 2021-07-26 11:31         ` Philippe Mathieu-Daudé
  2021-07-26 12:07           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 7+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-07-26 11:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Joe Mario
  Cc: Lukáš Doktor, Richard Henderson, qemu-devel

+Lukáš

On 7/26/21 10:47 AM, Dr. David Alan Gilbert wrote:
> * Joe Mario (jmario@redhat.com) wrote:
>> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
>> wrote:
>>
>>> * Richard Henderson (richard.henderson@linaro.org) wrote:
>>>> On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
>>>>> Hi Richard,
>>>>>    I think you were the last person to fiddle with the prefetching
>>>>> in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
>>>>> prefetching still made sense on modern CPUs, and that their hardware
>>>>> generally figures stuff out better on simple increments.
>>>>>
>>>>>    What was your thinking on this, and did you actually measure
>>>>> any improvement?
>>>>
>>>> Ah, well, that was 5 years ago so I have no particular memory of this.
>>> It
>>>> wouldn't surprise me if you can't measure any improvement on modern
>>>> hardware.
>>>>
>>>> Do you now measure an improvement with the prefetches gone?
>>>
>>> Not tried, it just came from Joe's suggestion that it was generally a
>>> bad idea these days; I do remember that the behaviour of those functions
>>> is quite tricky because there performance is VERY data dependent - many
>>> VMs actually have pages that are quite dirty so you never iterate the
>>> loop, but then you hit others with big zero pages and you spend your
>>> entire life in the loop.
>>>
>>>
>> Dave, Richard:
>> My curiosity got the best of me.  So I created a small test program that
>> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
> 
> Thanks for testing,
> 
>> When I run it on an Intel Cascade Lake processor, the cost of calling
>> "__builtin_prefetch(p)" is in the noise range .  It's always "just
>> slightly" slower.  I doubt it could ever be measured in qemu.
>>
>> Ironically, when I disabled the hardware prefetchers, the program slowed
>> down over 33%.  And the call to "__builtin_prefetch(p)" actually hurt
>> performance by over 3%.
> 
> Yeh that's a bit odd.
> 
>> My results are below, (only with the hardware prefetchers enabled).  The
>> program is attached.
>> Joe
>>
>> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
>> ./a.out; done
>> TSC 356144 Kcycles.
>> TSC 356714 Kcycles.
>> TSC 356707 Kcycles.
>> TSC 356565 Kcycles.
>> TSC 356853 Kcycles.
>> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
>> TSC 355520 Kcycles.
>> TSC 355961 Kcycles.
>> TSC 355872 Kcycles.
>> TSC 355948 Kcycles.
>> TSC 355918 Kcycles.
> 
> This basically agrees with the machines I've just tried your test on -
> *except* AMD EPYC 7302P's - that really like the prefetch:
> 
> [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do ./a.out; done
> TSC 322162 Kcycles.
> TSC 321861 Kcycles. 
> TSC 322212 Kcycles. 
> TSC 321957 Kcycles.
> TSC 322085 Kcycles. 
>  
> [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> TSC 377988 Kcycles. 
> TSC 380125 Kcycles. 
> TSC 379440 Kcycles.
> TSC 379689 Kcycles. 
> TSC 379571 Kcycles. 
>  
> The 1st gen doesn't seem to see much difference with/without it.
> 
> Probably best to leave this code as is!

Regardless the decision of changing the code or not, it would be
nice to have this test committed in the repository to run
performance regression testing from time to time.

>> /*
>>  * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
>>  *
>>  * Compile with either:
>>  *  gcc -mavx buffer_zero_avx.c -O 
>>  * or
>>  *  gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH 
>>  */
>>
[...]



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Prefetches in buffer_zero_*
  2021-07-26 11:31         ` Philippe Mathieu-Daudé
@ 2021-07-26 12:07           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 7+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-26 12:07 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Lukáš Doktor, Richard Henderson, qemu-devel, Joe Mario

* Philippe Mathieu-Daudé (philmd@redhat.com) wrote:
> +Lukáš
> 
> On 7/26/21 10:47 AM, Dr. David Alan Gilbert wrote:
> > * Joe Mario (jmario@redhat.com) wrote:
> >> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> wrote:
> >>
> >>> * Richard Henderson (richard.henderson@linaro.org) wrote:
> >>>> On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> >>>>> Hi Richard,
> >>>>>    I think you were the last person to fiddle with the prefetching
> >>>>> in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> >>>>> prefetching still made sense on modern CPUs, and that their hardware
> >>>>> generally figures stuff out better on simple increments.
> >>>>>
> >>>>>    What was your thinking on this, and did you actually measure
> >>>>> any improvement?
> >>>>
> >>>> Ah, well, that was 5 years ago so I have no particular memory of this.
> >>> It
> >>>> wouldn't surprise me if you can't measure any improvement on modern
> >>>> hardware.
> >>>>
> >>>> Do you now measure an improvement with the prefetches gone?
> >>>
> >>> Not tried, it just came from Joe's suggestion that it was generally a
> >>> bad idea these days; I do remember that the behaviour of those functions
> >>> is quite tricky because there performance is VERY data dependent - many
> >>> VMs actually have pages that are quite dirty so you never iterate the
> >>> loop, but then you hit others with big zero pages and you spend your
> >>> entire life in the loop.
> >>>
> >>>
> >> Dave, Richard:
> >> My curiosity got the best of me.  So I created a small test program that
> >> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
> > 
> > Thanks for testing,
> > 
> >> When I run it on an Intel Cascade Lake processor, the cost of calling
> >> "__builtin_prefetch(p)" is in the noise range .  It's always "just
> >> slightly" slower.  I doubt it could ever be measured in qemu.
> >>
> >> Ironically, when I disabled the hardware prefetchers, the program slowed
> >> down over 33%.  And the call to "__builtin_prefetch(p)" actually hurt
> >> performance by over 3%.
> > 
> > Yeh that's a bit odd.
> > 
> >> My results are below, (only with the hardware prefetchers enabled).  The
> >> program is attached.
> >> Joe
> >>
> >> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> >> ./a.out; done
> >> TSC 356144 Kcycles.
> >> TSC 356714 Kcycles.
> >> TSC 356707 Kcycles.
> >> TSC 356565 Kcycles.
> >> TSC 356853 Kcycles.
> >> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> >> TSC 355520 Kcycles.
> >> TSC 355961 Kcycles.
> >> TSC 355872 Kcycles.
> >> TSC 355948 Kcycles.
> >> TSC 355918 Kcycles.
> > 
> > This basically agrees with the machines I've just tried your test on -
> > *except* AMD EPYC 7302P's - that really like the prefetch:
> > 
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do ./a.out; done
> > TSC 322162 Kcycles.
> > TSC 321861 Kcycles. 
> > TSC 322212 Kcycles. 
> > TSC 321957 Kcycles.
> > TSC 322085 Kcycles. 
> >  
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> > TSC 377988 Kcycles. 
> > TSC 380125 Kcycles. 
> > TSC 379440 Kcycles.
> > TSC 379689 Kcycles. 
> > TSC 379571 Kcycles. 
> >  
> > The 1st gen doesn't seem to see much difference with/without it.
> > 
> > Probably best to leave this code as is!
> 
> Regardless the decision of changing the code or not, it would be
> nice to have this test committed in the repository to run
> performance regression testing from time to time.

It could be, although this is a slightly odd microtest for that; it's a bit
specific (the avx2 variant, and only really testing the all zero case).


Dave

> >> /*
> >>  * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
> >>  *
> >>  * Compile with either:
> >>  *  gcc -mavx buffer_zero_avx.c -O 
> >>  * or
> >>  *  gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH 
> >>  */
> >>
> [...]
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-07-26 12:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22 10:02 Prefetches in buffer_zero_* Dr. David Alan Gilbert
2021-07-22 18:53 ` Richard Henderson
2021-07-22 19:13   ` Dr. David Alan Gilbert
2021-07-24  1:43     ` Joe Mario
2021-07-26  8:47       ` Dr. David Alan Gilbert
2021-07-26 11:31         ` Philippe Mathieu-Daudé
2021-07-26 12:07           ` Dr. David Alan Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).