All of lore.kernel.org
 help / color / mirror / Atom feed
* Optimizing mmap_queue on AVX/AVX2 CPUs
@ 2017-08-26  1:46 Rebecca Cran
  2017-08-29 15:33 ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Rebecca Cran @ 2017-08-26  1:46 UTC (permalink / raw)
  To: fio

I'm not sure how far we want to get into optimizing fio for specific CPUs?

I've done some testing and found that when running the mmap ioengine 
against an NVDIMM-N on a modern Intel CPU I can gain a few hundred MB/s 
by optimizing the memory copy using avx/avx2 versus the system's memcpy 
implementation.


Should I proceed with submitting a patch, or do we want to avoid getting 
into these sort of optimizations?


-- 
Rebecca


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-08-26  1:46 Optimizing mmap_queue on AVX/AVX2 CPUs Rebecca Cran
@ 2017-08-29 15:33 ` Jens Axboe
  2017-08-30 20:57   ` Elliott, Robert (Persistent Memory)
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-08-29 15:33 UTC (permalink / raw)
  To: Rebecca Cran, fio

On 08/25/2017 07:46 PM, Rebecca Cran wrote:
> I'm not sure how far we want to get into optimizing fio for specific CPUs?
> 
> I've done some testing and found that when running the mmap ioengine 
> against an NVDIMM-N on a modern Intel CPU I can gain a few hundred MB/s 
> by optimizing the memory copy using avx/avx2 versus the system's memcpy 
> implementation.
> 
> 
> Should I proceed with submitting a patch, or do we want to avoid getting 
> into these sort of optimizations?

If we can do it cleanly, that's fine. See for instance how we detect
presence of crc32c hw assist at init time.

For memcpy(), the libc functions should really be doing this, however.

That said, let's see a patch, it's easier to discuss concrete patches
than just ideas.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-08-29 15:33 ` Jens Axboe
@ 2017-08-30 20:57   ` Elliott, Robert (Persistent Memory)
  2017-09-06 18:31     ` Rebecca Cran
  0 siblings, 1 reply; 12+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2017-08-30 20:57 UTC (permalink / raw)
  To: Jens Axboe, Rebecca Cran, fio



> -----Original Message-----
> From: fio-owner@vger.kernel.org [mailto:fio-owner@vger.kernel.org] On Behalf
> Of Jens Axboe
> Sent: Tuesday, August 29, 2017 10:33 AM
> To: Rebecca Cran <rebecca@bluestop.org>; fio@vger.kernel.org
> Subject: Re: Optimizing mmap_queue on AVX/AVX2 CPUs
> 
> On 08/25/2017 07:46 PM, Rebecca Cran wrote:
> > I'm not sure how far we want to get into optimizing fio for specific CPUs?
> >
> > I've done some testing and found that when running the mmap ioengine
> > against an NVDIMM-N on a modern Intel CPU I can gain a few hundred MB/s
> > by optimizing the memory copy using avx/avx2 versus the system's memcpy
> > implementation.
> >
> >
> > Should I proceed with submitting a patch, or do we want to avoid getting
> > into these sort of optimizations?
> 
> If we can do it cleanly, that's fine. See for instance how we detect
> presence of crc32c hw assist at init time.
> 
> For memcpy(), the libc functions should really be doing this, however.

Unfortunately, the glibc memcpy() implementation changes fairly often;
some versions use rep movsb, others have attempted to use xmm, ymm,
and zmm registers.  So, having more control in fio would help simulate
methods (both good and bad) that are used by different applications
and library versions.

There's even a new patch set to use the Intel QuickData DMA engines 
for transfers rather than the CPU (a "blkmq" pmem driver).  It'd be
interesting if fio could use that hardware too (with direct access by
fio, not resorting to kernel read()/write() calls).


---
Robert Elliott, HPE Persistent Memory



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-08-30 20:57   ` Elliott, Robert (Persistent Memory)
@ 2017-09-06 18:31     ` Rebecca Cran
  2017-09-06 20:20       ` Sitsofe Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Rebecca Cran @ 2017-09-06 18:31 UTC (permalink / raw)
  To: Elliott, Robert (Persistent Memory), Jens Axboe, fio

On 8/30/2017 2:57 PM, Elliott, Robert (Persistent Memory) wrote:
> There's even a new patch set to use the Intel QuickData DMA engines
> for transfers rather than the CPU (a "blkmq" pmem driver).  It'd be
> interesting if fio could use that hardware too (with direct access by
> fio, not resorting to kernel read()/write() calls).

I build the example performance tester program from Intel that compares 
memcpy with QuickData for various buffer and block sizes, and the best 
result was QuickData being the same speed as memcpy; otherwise, 
QuickData was between a tenth and half the speed.
Given that, I'm planning to focus on just adding SSE (not sure about 
this one yet, since all x86_64 systems support it, so memcpy should be 
using it already), AVX, AVX-512 and A64 Advanced SIMD (for ARM64) to FIO.

-- 
Rebecca



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-09-06 18:31     ` Rebecca Cran
@ 2017-09-06 20:20       ` Sitsofe Wheeler
  2017-09-06 20:54         ` Rebecca Cran
  0 siblings, 1 reply; 12+ messages in thread
From: Sitsofe Wheeler @ 2017-09-06 20:20 UTC (permalink / raw)
  To: Rebecca Cran; +Cc: Elliott, Robert (Persistent Memory), Jens Axboe, fio

On 6 September 2017 at 19:31, Rebecca Cran <rebecca@bluestop.org> wrote:
> On 8/30/2017 2:57 PM, Elliott, Robert (Persistent Memory) wrote:
>>
>> There's even a new patch set to use the Intel QuickData DMA engines
>> for transfers rather than the CPU (a "blkmq" pmem driver).  It'd be
>> interesting if fio could use that hardware too (with direct access by
>> fio, not resorting to kernel read()/write() calls).
>
>
> I build the example performance tester program from Intel that compares
> memcpy with QuickData for various buffer and block sizes, and the best
> result was QuickData being the same speed as memcpy; otherwise, QuickData
> was between a tenth and half the speed.
> Given that, I'm planning to focus on just adding SSE (not sure about this
> one yet, since all x86_64 systems support it, so memcpy should be using it
> already), AVX, AVX-512 and A64 Advanced SIMD (for ARM64) to FIO.

Does that mean your assembly copy is better than memcpy on generic
data going memory-memory or is is it just in relation to copying to
block devices?

-- 
Sitsofe | http://sucs.org/~sits/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-09-06 20:20       ` Sitsofe Wheeler
@ 2017-09-06 20:54         ` Rebecca Cran
  2017-09-07  6:00           ` Sitsofe Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Rebecca Cran @ 2017-09-06 20:54 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Elliott, Robert (Persistent Memory), Jens Axboe, fio


> On Sep 6, 2017, at 2:20 PM, Sitsofe Wheeler <sitsofe@gmail.com> wrote:

> Does that mean your assembly copy is better than memcpy on generic
> data going memory-memory or is is it just in relation to copying to
> block devices?

I'm testing memory-based filesystems (mounted with DAX) using the mmap ioengine - either against an NVDIMM-N DDR4 module or on FreeBSD against an md device.

Both my code using assembly intrinsincs and standard loops optimized with -ftree-vectorize are better than generic memcpy.

-- 
Rebecca


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-09-06 20:54         ` Rebecca Cran
@ 2017-09-07  6:00           ` Sitsofe Wheeler
  2017-09-07  6:06             ` Rebecca Cran
  0 siblings, 1 reply; 12+ messages in thread
From: Sitsofe Wheeler @ 2017-09-07  6:00 UTC (permalink / raw)
  To: Rebecca Cran; +Cc: Elliott, Robert (Persistent Memory), Jens Axboe, fio

On 6 September 2017 at 21:54, Rebecca Cran <rebecca@bluestop.org> wrote:
>
>> On Sep 6, 2017, at 2:20 PM, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
>
>> Does that mean your assembly copy is better than memcpy on generic
>> data going memory-memory or is is it just in relation to copying to
>> block devices?
>
> I'm testing memory-based filesystems (mounted with DAX) using the mmap ioengine - either against an NVDIMM-N DDR4 module or on FreeBSD against an md device.
>
> Both my code using assembly intrinsincs and standard loops optimized with -ftree-vectorize are better than generic memcpy.

When this gets added will it be possible for fio to have a "memcpy
benchmark" mode where you're able to compare implementations when
using a fixed block size (in a similar way to --crctest) or does this
not make sense because you actually have to be copying to a device to
see the difference?

-- 
Sitsofe | http://sucs.org/~sits/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-09-07  6:00           ` Sitsofe Wheeler
@ 2017-09-07  6:06             ` Rebecca Cran
  2017-09-07  6:28               ` Sitsofe Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Rebecca Cran @ 2017-09-07  6:06 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Elliott, Robert (Persistent Memory), Jens Axboe, fio


> On Sep 7, 2017, at 12:00 AM, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> 
> When this gets added will it be possible for fio to have a "memcpy
> benchmark" mode where you're able to compare implementations when
> using a fixed block size (in a similar way to --crctest) or does this
> not make sense because you actually have to be copying to a device to
> see the difference?

That does make sense: to see the difference you just need to copy data between areas of memory.

-- 
Rebecca 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-09-07  6:06             ` Rebecca Cran
@ 2017-09-07  6:28               ` Sitsofe Wheeler
  2017-09-07  6:52                 ` Rebecca Cran
  0 siblings, 1 reply; 12+ messages in thread
From: Sitsofe Wheeler @ 2017-09-07  6:28 UTC (permalink / raw)
  To: Rebecca Cran; +Cc: Elliott, Robert (Persistent Memory), Jens Axboe, fio

On 7 September 2017 at 07:06, Rebecca Cran <rebecca@bluestop.org> wrote:
>
>> On Sep 7, 2017, at 12:00 AM, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
>>
>> When this gets added will it be possible for fio to have a "memcpy
>> benchmark" mode where you're able to compare implementations when
>> using a fixed block size (in a similar way to --crctest) or does this
>> not make sense because you actually have to be copying to a device to
>> see the difference?
>
> That does make sense: to see the difference you just need to copy data between areas of memory.

I can't help but be reminded of Linus comment over on
https://bugzilla.redhat.com/show_bug.cgi?id=638477#c46 . At any rate I
notice that Agner Fog has an optimised memcpy too over on
http://agner.org/optimize/#asmlib (and older version apperars to be
here https://github.com/lukego/asmlib/blob/master/memcpy64.asm ).

-- 
Sitsofe | http://sucs.org/~sits/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-09-07  6:28               ` Sitsofe Wheeler
@ 2017-09-07  6:52                 ` Rebecca Cran
  2017-09-07 20:24                   ` Sitsofe Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Rebecca Cran @ 2017-09-07  6:52 UTC (permalink / raw)
  To: Sitsofe Wheeler; +Cc: Elliott, Robert (Persistent Memory), Jens Axboe, fio


> On Sep 7, 2017, at 12:28 AM, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> 
>> On 7 September 2017 at 07:06, Rebecca Cran <rebecca@bluestop.org> wrote:
>> That does make sense: to see the difference you just need to copy data between areas of memory.
> 
> I can't help but be reminded of Linus comment over on
> https://bugzilla.redhat.com/show_bug.cgi?id=638477#c46 .

Hmm, are you suggesting by that it's not something we should try and optimize within fio?

I can totally understand that, and I'd be willing to put off any further work on this until/if we run into issues testing the performance of future NVDIMM-P (i.e. Storage Class Memory) devices.

-- 
Rebecca 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-09-07  6:52                 ` Rebecca Cran
@ 2017-09-07 20:24                   ` Sitsofe Wheeler
  2017-09-11 23:03                     ` Sitsofe Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Sitsofe Wheeler @ 2017-09-07 20:24 UTC (permalink / raw)
  To: Rebecca Cran; +Cc: Elliott, Robert (Persistent Memory), Jens Axboe, fio

On 7 September 2017 at 07:52, Rebecca Cran <rebecca@bluestop.org> wrote:
>
>> On Sep 7, 2017, at 12:28 AM, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
>>
>>> On 7 September 2017 at 07:06, Rebecca Cran <rebecca@bluestop.org> wrote:
>>> That does make sense: to see the difference you just need to copy data between areas of memory.
>>
>> I can't help but be reminded of Linus comment over on
>> https://bugzilla.redhat.com/show_bug.cgi?id=638477#c46 .
>
> Hmm, are you suggesting by that it's not something we should try and optimize within fio?

No the opposite - that a non libc memcpy may out perform the libc one
(even if it looks simpler in some cases)!

> I can totally understand that, and I'd be willing to put off any further work on this until/if we run into issues testing the performance of future NVDIMM-P (i.e. Storage Class Memory) devices.

It's not my intent to put you off - all your ideas sound good!

-- 
Sitsofe | http://sucs.org/~sits/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Optimizing mmap_queue on AVX/AVX2 CPUs
  2017-09-07 20:24                   ` Sitsofe Wheeler
@ 2017-09-11 23:03                     ` Sitsofe Wheeler
  0 siblings, 0 replies; 12+ messages in thread
From: Sitsofe Wheeler @ 2017-09-11 23:03 UTC (permalink / raw)
  To: Rebecca Cran; +Cc: Elliott, Robert (Persistent Memory), Jens Axboe, fio

On 7 September 2017 at 21:24, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
> On 7 September 2017 at 07:52, Rebecca Cran <rebecca@bluestop.org> wrote:
>>
>>> On Sep 7, 2017, at 12:28 AM, Sitsofe Wheeler <sitsofe@gmail.com> wrote:
>>>
>>>> On 7 September 2017 at 07:06, Rebecca Cran <rebecca@bluestop.org> wrote:
>>>> That does make sense: to see the difference you just need to copy data between areas of memory.
>>>
>>> I can't help but be reminded of Linus comment over on
>>> https://bugzilla.redhat.com/show_bug.cgi?id=638477#c46 .
>>
>> Hmm, are you suggesting by that it's not something we should try and optimize within fio?
>
> No the opposite - that a non libc memcpy may out perform the libc one
> (even if it looks simpler in some cases)!
>
>> I can totally understand that, and I'd be willing to put off any further work on this until/if we run into issues testing the performance of future NVDIMM-P (i.e. Storage Class Memory) devices.
>
> It's not my intent to put you off - all your ideas sound good!

A faster memcpy looks like something of a holy grail and there's all
sort of replacements floating around the net. The most interesting
thing I've come across so far is that sometimes memove is faster than
memcpy. Seems very system dependent but here's what the Eigen project
did: https://bitbucket.org/eigen/eigen/pull-requests/292/adds-a-fast-memcpy-function-to-eigen/diff
.

-- 
Sitsofe | http://sucs.org/~sits/


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-09-11 23:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-26  1:46 Optimizing mmap_queue on AVX/AVX2 CPUs Rebecca Cran
2017-08-29 15:33 ` Jens Axboe
2017-08-30 20:57   ` Elliott, Robert (Persistent Memory)
2017-09-06 18:31     ` Rebecca Cran
2017-09-06 20:20       ` Sitsofe Wheeler
2017-09-06 20:54         ` Rebecca Cran
2017-09-07  6:00           ` Sitsofe Wheeler
2017-09-07  6:06             ` Rebecca Cran
2017-09-07  6:28               ` Sitsofe Wheeler
2017-09-07  6:52                 ` Rebecca Cran
2017-09-07 20:24                   ` Sitsofe Wheeler
2017-09-11 23:03                     ` Sitsofe Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.