ARM NEON optimisations for gf-complete/jerasure/ceph-erasure

All of lore.kernel.org
 help / color / mirror / Atom feed

* ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
@ 2014-09-04 14:42 Janne Grunau
  2014-09-04 15:21 ` Loic Dachary
  2014-09-04 15:57 ` Loic Dachary
  0 siblings, 2 replies; 7+ messages in thread
From: Janne Grunau @ 2014-09-04 14:42 UTC (permalink / raw)
  To: ceph-devel

Hi,

I've started writing ARM/AArch64 NEON optimizations for gf-complete.  
http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept 
AArch64 NEON optimisations for w8.

Implemented methods are so far the carry-less/polynomial multiplication 
and the split table. The polynomial multiplication is reasonable fast 
for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since 
NEON has a 8-bit to 16-bit SIMD polynomial multiplication.

The split table method is still faster though, 5700MB/s on the same CPU.  
I'm actually surprised by that since it is faster (per cycle) than the 
Core i7-3770 from gf-complete's manual (page 14). That suggests that 
SSE3 code might not be optimal.

I'm currently working on integrating NEON into the build system and then 
will extend the existing code to work on ARMv7-a too. Those two are 
straight forward. There are a couple of other issues I would like to 
discuss before I start to work on them.

The #if/#ifdefs in the source are starting to make the source hard to 
read then more than one optimization is added. Separating arch specific
implementations from each other and from the generic implementation 
works reasonable well for the multimedia related projects I have 
experience with (libav/FFmpeg, x264). There would be arch specific init 
functions which set the appropriate function pointers. The neon 
optimisations would then live in w8_arm.c which would be only compiled 
for arm. If someone has another idea how to avoid the #ifdefs I'm open 
for that too.

I'm currently using the SSE/NOSSE region option which is bogus. I'm 
wondering whether I should just rename that SIMD/NOSIMD (not really true 
since the carry less operations for w64 and w128 only use the SIMD 
instruction set but are single data). That would need to have backward 
compatibility for SSE/NOSSE. The other option would be to add 
NEON/NONEON flags.

I'm sure I find other issues to discuss when I start integrating the 
NEON optimisations into jerasure and ceph.

thanks

Janne

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
  2014-09-04 14:42 ARM NEON optimisations for gf-complete/jerasure/ceph-erasure Janne Grunau
@ 2014-09-04 15:21 ` Loic Dachary
       [not found]   ` <CA+AFVBg1oKix1U=qYdLBQ+j4MPYek-npz5UkkFh+dtR_UADUxw@mail.gmail.com>
  2014-09-04 15:57 ` Loic Dachary
  1 sibling, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2014-09-04 15:21 UTC (permalink / raw)
  To: Janne Grunau; +Cc: ceph-devel, Ethan Miller, Kevin Greenan

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

Hi Janne,

This is great news :-) Added Ethan & Kevin to the discussion.

Cheers

On 04/09/2014 16:42, Janne Grunau wrote:
> Hi,
> 
> I've started writing ARM/AArch64 NEON optimizations for gf-complete.  
> http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept 
> AArch64 NEON optimisations for w8.
> 
> Implemented methods are so far the carry-less/polynomial multiplication 
> and the split table. The polynomial multiplication is reasonable fast 
> for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since 
> NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
> 
> The split table method is still faster though, 5700MB/s on the same CPU.  
> I'm actually surprised by that since it is faster (per cycle) than the 
> Core i7-3770 from gf-complete's manual (page 14). That suggests that 
> SSE3 code might not be optimal.
> 
> I'm currently working on integrating NEON into the build system and then 
> will extend the existing code to work on ARMv7-a too. Those two are 
> straight forward. There are a couple of other issues I would like to 
> discuss before I start to work on them.
> 
> The #if/#ifdefs in the source are starting to make the source hard to 
> read then more than one optimization is added. Separating arch specific
> implementations from each other and from the generic implementation 
> works reasonable well for the multimedia related projects I have 
> experience with (libav/FFmpeg, x264). There would be arch specific init 
> functions which set the appropriate function pointers. The neon 
> optimisations would then live in w8_arm.c which would be only compiled 
> for arm. If someone has another idea how to avoid the #ifdefs I'm open 
> for that too.
> 
> I'm currently using the SSE/NOSSE region option which is bogus. I'm 
> wondering whether I should just rename that SIMD/NOSIMD (not really true 
> since the carry less operations for w64 and w128 only use the SIMD 
> instruction set but are single data). That would need to have backward 
> compatibility for SSE/NOSSE. The other option would be to add 
> NEON/NONEON flags.
> 
> I'm sure I find other issues to discuss when I start integrating the 
> NEON optimisations into jerasure and ceph.
> 
> thanks
> 
> Janne
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
  2014-09-04 14:42 ARM NEON optimisations for gf-complete/jerasure/ceph-erasure Janne Grunau
  2014-09-04 15:21 ` Loic Dachary
@ 2014-09-04 15:57 ` Loic Dachary
  2014-09-05  0:27   ` Ethan L. Miller
  1 sibling, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2014-09-04 15:57 UTC (permalink / raw)
  To: Janne Grunau, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 2859 bytes --]

Hi Janne,

On 04/09/2014 16:42, Janne Grunau wrote:
> Hi,
> 
> I've started writing ARM/AArch64 NEON optimizations for gf-complete.  
> http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept 
> AArch64 NEON optimisations for w8.
> 
> Implemented methods are so far the carry-less/polynomial multiplication 
> and the split table. The polynomial multiplication is reasonable fast 
> for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since 
> NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
> 
> The split table method is still faster though, 5700MB/s on the same CPU.  
> I'm actually surprised by that since it is faster (per cycle) than the 
> Core i7-3770 from gf-complete's manual (page 14). That suggests that 
> SSE3 code might not be optimal.
> 
> I'm currently working on integrating NEON into the build system and then 
> will extend the existing code to work on ARMv7-a too. Those two are 
> straight forward. There are a couple of other issues I would like to 
> discuss before I start to work on them.
> 
> The #if/#ifdefs in the source are starting to make the source hard to 
> read then more than one optimization is added. Separating arch specific
> implementations from each other and from the generic implementation 
> works reasonable well for the multimedia related projects I have 
> experience with (libav/FFmpeg, x264). There would be arch specific init 
> functions which set the appropriate function pointers. The neon 
> optimisations would then live in w8_arm.c which would be only compiled 
> for arm. If someone has another idea how to avoid the #ifdefs I'm open 
> for that too.

Would it be possible to make use of ifunc ( https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bifunc_007d-attribute-2529 ) to chose the function depending on CPU features ?

http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options

http://www.spinics.net/lists/ceph-devel/msg18452.html

Cheers

> I'm currently using the SSE/NOSSE region option which is bogus. I'm 
> wondering whether I should just rename that SIMD/NOSIMD (not really true 
> since the carry less operations for w64 and w128 only use the SIMD 
> instruction set but are single data). That would need to have backward 
> compatibility for SSE/NOSSE. The other option would be to add 
> NEON/NONEON flags.
> 
> I'm sure I find other issues to discuss when I start integrating the 
> NEON optimisations into jerasure and ceph.
> 
> thanks
> 
> Janne
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
  2014-09-04 15:57 ` Loic Dachary
@ 2014-09-05  0:27   ` Ethan L. Miller
  2014-09-05  7:51     ` Janne Grunau
  0 siblings, 1 reply; 7+ messages in thread
From: Ethan L. Miller @ 2014-09-05  0:27 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Janne Grunau, Ceph Development, Kevin Greenan, Ethan Miller

Yes, it's possible to use CPU flags to allow the use of advanced
instruction sets automatically.  The difficulty is that, if those
instructions aren't available, it's not clear which of the "basic"
approaches to use, since performance can vary based on a lot of
factors.  Even with advanced instructions, there are often multiple
reasonable approaches to take, as Janne's email makes clear, so it's
impossible to say "this algorithm is always best".

We can certainly set up a default approach if we want, though, that
can be overridden by compile-time flags.

Incidentally, I'm starting to work on coding a version of gf-complete
(and associated erasure coding functions) in C++ using templates,
which will hopefully allow us to better separate out different
implementations.  We could still have run-time dispatch for the
desired routines, but templates should allow for more compact code and
better isolation of architecture-specific code.  The big drawback is
that C++ code isn't typically used in the kernel....

ethan

On Thu, Sep 4, 2014 at 8:57 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi Janne,
>
> On 04/09/2014 16:42, Janne Grunau wrote:
>> Hi,
>>
>> I've started writing ARM/AArch64 NEON optimizations for gf-complete.
>> http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
>> AArch64 NEON optimisations for w8.
>>
>> Implemented methods are so far the carry-less/polynomial multiplication
>> and the split table. The polynomial multiplication is reasonable fast
>> for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
>> NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
>>
>> The split table method is still faster though, 5700MB/s on the same CPU.
>> I'm actually surprised by that since it is faster (per cycle) than the
>> Core i7-3770 from gf-complete's manual (page 14). That suggests that
>> SSE3 code might not be optimal.
>>
>> I'm currently working on integrating NEON into the build system and then
>> will extend the existing code to work on ARMv7-a too. Those two are
>> straight forward. There are a couple of other issues I would like to
>> discuss before I start to work on them.
>>
>> The #if/#ifdefs in the source are starting to make the source hard to
>> read then more than one optimization is added. Separating arch specific
>> implementations from each other and from the generic implementation
>> works reasonable well for the multimedia related projects I have
>> experience with (libav/FFmpeg, x264). There would be arch specific init
>> functions which set the appropriate function pointers. The neon
>> optimisations would then live in w8_arm.c which would be only compiled
>> for arm. If someone has another idea how to avoid the #ifdefs I'm open
>> for that too.
>
> Would it be possible to make use of ifunc ( https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bifunc_007d-attribute-2529 ) to chose the function depending on CPU features ?
>
> http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options
>
> http://www.spinics.net/lists/ceph-devel/msg18452.html
>
> Cheers
>
>> I'm currently using the SSE/NOSSE region option which is bogus. I'm
>> wondering whether I should just rename that SIMD/NOSIMD (not really true
>> since the carry less operations for w64 and w128 only use the SIMD
>> instruction set but are single data). That would need to have backward
>> compatibility for SSE/NOSSE. The other option would be to add
>> NEON/NONEON flags.
>>
>> I'm sure I find other issues to discuss when I start integrating the
>> NEON optimisations into jerasure and ceph.
>>
>> thanks
>>
>> Janne
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>



-- 
( Ethan L. Miller               Email: elm@cs.ucsc.edu            )
( Professor, Computer Science   Web: http://www.cs.ucsc.edu/~elm/ )
( University of California      Phone: +1 831 459-1222            )
( Santa Cruz, CA 95064 USA      Fax:   +1 831 459-1041            )
( PGP keyprint: 76C7 D699 1FF6 A1A4 B7A1 9629 2EBF 1273 A6ED 6A09 )
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
  2014-09-05  0:27   ` Ethan L. Miller
@ 2014-09-05  7:51     ` Janne Grunau
  0 siblings, 0 replies; 7+ messages in thread
From: Janne Grunau @ 2014-09-05  7:51 UTC (permalink / raw)
  To: elm; +Cc: Loic Dachary, Ceph Development, Kevin Greenan, Ethan Miller

Hi,

On 2014-09-04 17:27:16 -0700, Ethan L. Miller wrote:
> Yes, it's possible to use CPU flags to allow the use of advanced
> instruction sets automatically.

runtime detection of supported instructions sets is with the current 
function pointer approach possible too.

>  The difficulty is that, if those
> instructions aren't available, it's not clear which of the "basic"
> approaches to use, since performance can vary based on a lot of
> factors.  Even with advanced instructions, there are often multiple
> reasonable approaches to take, as Janne's email makes clear, so it's
> impossible to say "this algorithm is always best".

I agree that the current approach fits the model of implementations with 
different cpu/memory use better. Using ifunc would be mostly orthogonal 
to the issue of badly structured code.

> We can certainly set up a default approach if we want, though, that
> can be overridden by compile-time flags.

I don't think this would be an improvement.

> Incidentally, I'm starting to work on coding a version of gf-complete
> (and associated erasure coding functions) in C++ using templates,
> which will hopefully allow us to better separate out different
> implementations.  We could still have run-time dispatch for the
> desired routines, but templates should allow for more compact code and
> better isolation of architecture-specific code.  The big drawback is
> that C++ code isn't typically used in the kernel....

One possible simplification for the carry less multiplication would be 
relying on inlining and optimisations of compile time constants.  

Implement one function which does a variable number of polynomial 
reductions. The current functions would then just be thin wrappers which 
call the general function with a compile time constant for the number of 
reductions. Forced inlining and dead code removal will optimize branches 
away. The same method could be used to avoid the duplication of the 
inner loop for the optional xor with the destination.

Janne

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
       [not found]   ` <CA+AFVBg1oKix1U=qYdLBQ+j4MPYek-npz5UkkFh+dtR_UADUxw@mail.gmail.com>
@ 2014-09-18 10:11     ` Janne Grunau
  2014-10-10 14:01     ` Janne Grunau
  1 sibling, 0 replies; 7+ messages in thread
From: Janne Grunau @ 2014-09-18 10:11 UTC (permalink / raw)
  To: Kevin Greenan; +Cc: Loic Dachary, Ceph Development, Ethan Miller

Hi Kevin,

On 2014-09-16 11:25:12 -0700, Kevin Greenan wrote:
> 
> I feel that separating the arch-specific implementations out and have a
> default 'generic' implementation would be a huge improvement.  Note that
> gf-complete was in active development for some time before including the
> SIMD code.  In hindsight, we should have done this separation back in 2012,
> but had some time pressure due to a paper deadline and limited time
> available to the contributors.
> 
> Also, I agree w.r.t. the preprocessor stuff.  Going with SIMD/NOSIMD is
> fine by me.

I'll rename than and start implementing neon optimized function in their 
own files.
 
> Also, there should be very little "SIMD" work with jerasure, as gf-complete
> is the Galois field backend, so I would not worry too much about that.

I noticed, I have hooked my neon code already locally in ceph with 
touching jerasure.

> That covers "clean-up" work.  We can discuss the best way to choose the
> underlying implementation (looks like we have a bunch of options) as this
> work is completed.
> 
> With this in mind, what work were you planning to do?  I can try to free up
> cycles to help, but that may not happen for a few weeks.

Primarily NEON optimisations for gf-complete/ceph. Shouldn't take more 
than a few days though.

> One last thing...  If you do have code you want to push upstream, please
> submit a pull request(s) to our main bitbucket repo.
> 
> Make sense?

yes, thanks.

Janne

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
       [not found]   ` <CA+AFVBg1oKix1U=qYdLBQ+j4MPYek-npz5UkkFh+dtR_UADUxw@mail.gmail.com>
  2014-09-18 10:11     ` Janne Grunau
@ 2014-10-10 14:01     ` Janne Grunau
  1 sibling, 0 replies; 7+ messages in thread
From: Janne Grunau @ 2014-10-10 14:01 UTC (permalink / raw)
  To: Kevin Greenan; +Cc: Loic Dachary, Ceph Development, Ethan Miller

Hi Kevin,

On 2014-09-16 11:25:12 -0700, Kevin Greenan wrote:
> 
> I feel that separating the arch-specific implementations out and have a
> default 'generic' implementation would be a huge improvement.  Note that
> gf-complete was in active development for some time before including the
> SIMD code.  In hindsight, we should have done this separation back in 2012,
> but had some time pressure due to a paper deadline and limited time
> available to the contributors.
> 
> Also, I agree w.r.t. the preprocessor stuff.  Going with SIMD/NOSIMD is
> fine by me.

I created a pull request with my neon optimisations, the SSE -> SIMD 
rename and some minor fixes.

The neon methods all reside in their own files, I didn't come up with 
good solution for the init / scratch_size functions, so I added 
arm-specific defines there.

> Also, there should be very little "SIMD" work with jerasure, as gf-complete
> is the Galois field backend, so I would not worry too much about that.

Yes, there was no SIMD work in jerasure.

Please have a look at 
https://bitbucket.org/jimplank/gf-complete/pull-request/25/arm-neon-optimisations 
I'll be available to address review comments and suggestions.

regards

Janne

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-10-10 14:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-04 14:42 ARM NEON optimisations for gf-complete/jerasure/ceph-erasure Janne Grunau
2014-09-04 15:21 ` Loic Dachary
     [not found]   ` <CA+AFVBg1oKix1U=qYdLBQ+j4MPYek-npz5UkkFh+dtR_UADUxw@mail.gmail.com>
2014-09-18 10:11     ` Janne Grunau
2014-10-10 14:01     ` Janne Grunau
2014-09-04 15:57 ` Loic Dachary
2014-09-05  0:27   ` Ethan L. Miller
2014-09-05  7:51     ` Janne Grunau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.