* ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
@ 2014-09-04 14:42 Janne Grunau
2014-09-04 15:21 ` Loic Dachary
2014-09-04 15:57 ` Loic Dachary
0 siblings, 2 replies; 7+ messages in thread
From: Janne Grunau @ 2014-09-04 14:42 UTC (permalink / raw)
To: ceph-devel
Hi,
I've started writing ARM/AArch64 NEON optimizations for gf-complete.
http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
AArch64 NEON optimisations for w8.
Implemented methods are so far the carry-less/polynomial multiplication
and the split table. The polynomial multiplication is reasonable fast
for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
The split table method is still faster though, 5700MB/s on the same CPU.
I'm actually surprised by that since it is faster (per cycle) than the
Core i7-3770 from gf-complete's manual (page 14). That suggests that
SSE3 code might not be optimal.
I'm currently working on integrating NEON into the build system and then
will extend the existing code to work on ARMv7-a too. Those two are
straight forward. There are a couple of other issues I would like to
discuss before I start to work on them.
The #if/#ifdefs in the source are starting to make the source hard to
read then more than one optimization is added. Separating arch specific
implementations from each other and from the generic implementation
works reasonable well for the multimedia related projects I have
experience with (libav/FFmpeg, x264). There would be arch specific init
functions which set the appropriate function pointers. The neon
optimisations would then live in w8_arm.c which would be only compiled
for arm. If someone has another idea how to avoid the #ifdefs I'm open
for that too.
I'm currently using the SSE/NOSSE region option which is bogus. I'm
wondering whether I should just rename that SIMD/NOSIMD (not really true
since the carry less operations for w64 and w128 only use the SIMD
instruction set but are single data). That would need to have backward
compatibility for SSE/NOSSE. The other option would be to add
NEON/NONEON flags.
I'm sure I find other issues to discuss when I start integrating the
NEON optimisations into jerasure and ceph.
thanks
Janne
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
2014-09-04 14:42 ARM NEON optimisations for gf-complete/jerasure/ceph-erasure Janne Grunau
@ 2014-09-04 15:21 ` Loic Dachary
[not found] ` <CA+AFVBg1oKix1U=qYdLBQ+j4MPYek-npz5UkkFh+dtR_UADUxw@mail.gmail.com>
2014-09-04 15:57 ` Loic Dachary
1 sibling, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2014-09-04 15:21 UTC (permalink / raw)
To: Janne Grunau; +Cc: ceph-devel, Ethan Miller, Kevin Greenan
[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]
Hi Janne,
This is great news :-) Added Ethan & Kevin to the discussion.
Cheers
On 04/09/2014 16:42, Janne Grunau wrote:
> Hi,
>
> I've started writing ARM/AArch64 NEON optimizations for gf-complete.
> http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
> AArch64 NEON optimisations for w8.
>
> Implemented methods are so far the carry-less/polynomial multiplication
> and the split table. The polynomial multiplication is reasonable fast
> for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
> NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
>
> The split table method is still faster though, 5700MB/s on the same CPU.
> I'm actually surprised by that since it is faster (per cycle) than the
> Core i7-3770 from gf-complete's manual (page 14). That suggests that
> SSE3 code might not be optimal.
>
> I'm currently working on integrating NEON into the build system and then
> will extend the existing code to work on ARMv7-a too. Those two are
> straight forward. There are a couple of other issues I would like to
> discuss before I start to work on them.
>
> The #if/#ifdefs in the source are starting to make the source hard to
> read then more than one optimization is added. Separating arch specific
> implementations from each other and from the generic implementation
> works reasonable well for the multimedia related projects I have
> experience with (libav/FFmpeg, x264). There would be arch specific init
> functions which set the appropriate function pointers. The neon
> optimisations would then live in w8_arm.c which would be only compiled
> for arm. If someone has another idea how to avoid the #ifdefs I'm open
> for that too.
>
> I'm currently using the SSE/NOSSE region option which is bogus. I'm
> wondering whether I should just rename that SIMD/NOSIMD (not really true
> since the carry less operations for w64 and w128 only use the SIMD
> instruction set but are single data). That would need to have backward
> compatibility for SSE/NOSSE. The other option would be to add
> NEON/NONEON flags.
>
> I'm sure I find other issues to discuss when I start integrating the
> NEON optimisations into jerasure and ceph.
>
> thanks
>
> Janne
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
2014-09-04 14:42 ARM NEON optimisations for gf-complete/jerasure/ceph-erasure Janne Grunau
2014-09-04 15:21 ` Loic Dachary
@ 2014-09-04 15:57 ` Loic Dachary
2014-09-05 0:27 ` Ethan L. Miller
1 sibling, 1 reply; 7+ messages in thread
From: Loic Dachary @ 2014-09-04 15:57 UTC (permalink / raw)
To: Janne Grunau, ceph-devel
[-- Attachment #1: Type: text/plain, Size: 2859 bytes --]
Hi Janne,
On 04/09/2014 16:42, Janne Grunau wrote:
> Hi,
>
> I've started writing ARM/AArch64 NEON optimizations for gf-complete.
> http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
> AArch64 NEON optimisations for w8.
>
> Implemented methods are so far the carry-less/polynomial multiplication
> and the split table. The polynomial multiplication is reasonable fast
> for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
> NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
>
> The split table method is still faster though, 5700MB/s on the same CPU.
> I'm actually surprised by that since it is faster (per cycle) than the
> Core i7-3770 from gf-complete's manual (page 14). That suggests that
> SSE3 code might not be optimal.
>
> I'm currently working on integrating NEON into the build system and then
> will extend the existing code to work on ARMv7-a too. Those two are
> straight forward. There are a couple of other issues I would like to
> discuss before I start to work on them.
>
> The #if/#ifdefs in the source are starting to make the source hard to
> read then more than one optimization is added. Separating arch specific
> implementations from each other and from the generic implementation
> works reasonable well for the multimedia related projects I have
> experience with (libav/FFmpeg, x264). There would be arch specific init
> functions which set the appropriate function pointers. The neon
> optimisations would then live in w8_arm.c which would be only compiled
> for arm. If someone has another idea how to avoid the #ifdefs I'm open
> for that too.
Would it be possible to make use of ifunc ( https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bifunc_007d-attribute-2529 ) to chose the function depending on CPU features ?
http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options
http://www.spinics.net/lists/ceph-devel/msg18452.html
Cheers
> I'm currently using the SSE/NOSSE region option which is bogus. I'm
> wondering whether I should just rename that SIMD/NOSIMD (not really true
> since the carry less operations for w64 and w128 only use the SIMD
> instruction set but are single data). That would need to have backward
> compatibility for SSE/NOSSE. The other option would be to add
> NEON/NONEON flags.
>
> I'm sure I find other issues to discuss when I start integrating the
> NEON optimisations into jerasure and ceph.
>
> thanks
>
> Janne
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
2014-09-04 15:57 ` Loic Dachary
@ 2014-09-05 0:27 ` Ethan L. Miller
2014-09-05 7:51 ` Janne Grunau
0 siblings, 1 reply; 7+ messages in thread
From: Ethan L. Miller @ 2014-09-05 0:27 UTC (permalink / raw)
To: Loic Dachary; +Cc: Janne Grunau, Ceph Development, Kevin Greenan, Ethan Miller
Yes, it's possible to use CPU flags to allow the use of advanced
instruction sets automatically. The difficulty is that, if those
instructions aren't available, it's not clear which of the "basic"
approaches to use, since performance can vary based on a lot of
factors. Even with advanced instructions, there are often multiple
reasonable approaches to take, as Janne's email makes clear, so it's
impossible to say "this algorithm is always best".
We can certainly set up a default approach if we want, though, that
can be overridden by compile-time flags.
Incidentally, I'm starting to work on coding a version of gf-complete
(and associated erasure coding functions) in C++ using templates,
which will hopefully allow us to better separate out different
implementations. We could still have run-time dispatch for the
desired routines, but templates should allow for more compact code and
better isolation of architecture-specific code. The big drawback is
that C++ code isn't typically used in the kernel....
ethan
On Thu, Sep 4, 2014 at 8:57 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi Janne,
>
> On 04/09/2014 16:42, Janne Grunau wrote:
>> Hi,
>>
>> I've started writing ARM/AArch64 NEON optimizations for gf-complete.
>> http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
>> AArch64 NEON optimisations for w8.
>>
>> Implemented methods are so far the carry-less/polynomial multiplication
>> and the split table. The polynomial multiplication is reasonable fast
>> for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
>> NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
>>
>> The split table method is still faster though, 5700MB/s on the same CPU.
>> I'm actually surprised by that since it is faster (per cycle) than the
>> Core i7-3770 from gf-complete's manual (page 14). That suggests that
>> SSE3 code might not be optimal.
>>
>> I'm currently working on integrating NEON into the build system and then
>> will extend the existing code to work on ARMv7-a too. Those two are
>> straight forward. There are a couple of other issues I would like to
>> discuss before I start to work on them.
>>
>> The #if/#ifdefs in the source are starting to make the source hard to
>> read then more than one optimization is added. Separating arch specific
>> implementations from each other and from the generic implementation
>> works reasonable well for the multimedia related projects I have
>> experience with (libav/FFmpeg, x264). There would be arch specific init
>> functions which set the appropriate function pointers. The neon
>> optimisations would then live in w8_arm.c which would be only compiled
>> for arm. If someone has another idea how to avoid the #ifdefs I'm open
>> for that too.
>
> Would it be possible to make use of ifunc ( https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bifunc_007d-attribute-2529 ) to chose the function depending on CPU features ?
>
> http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options
>
> http://www.spinics.net/lists/ceph-devel/msg18452.html
>
> Cheers
>
>> I'm currently using the SSE/NOSSE region option which is bogus. I'm
>> wondering whether I should just rename that SIMD/NOSIMD (not really true
>> since the carry less operations for w64 and w128 only use the SIMD
>> instruction set but are single data). That would need to have backward
>> compatibility for SSE/NOSSE. The other option would be to add
>> NEON/NONEON flags.
>>
>> I'm sure I find other issues to discuss when I start integrating the
>> NEON optimisations into jerasure and ceph.
>>
>> thanks
>>
>> Janne
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
( Ethan L. Miller Email: elm@cs.ucsc.edu )
( Professor, Computer Science Web: http://www.cs.ucsc.edu/~elm/ )
( University of California Phone: +1 831 459-1222 )
( Santa Cruz, CA 95064 USA Fax: +1 831 459-1041 )
( PGP keyprint: 76C7 D699 1FF6 A1A4 B7A1 9629 2EBF 1273 A6ED 6A09 )
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
2014-09-05 0:27 ` Ethan L. Miller
@ 2014-09-05 7:51 ` Janne Grunau
0 siblings, 0 replies; 7+ messages in thread
From: Janne Grunau @ 2014-09-05 7:51 UTC (permalink / raw)
To: elm; +Cc: Loic Dachary, Ceph Development, Kevin Greenan, Ethan Miller
Hi,
On 2014-09-04 17:27:16 -0700, Ethan L. Miller wrote:
> Yes, it's possible to use CPU flags to allow the use of advanced
> instruction sets automatically.
runtime detection of supported instructions sets is with the current
function pointer approach possible too.
> The difficulty is that, if those
> instructions aren't available, it's not clear which of the "basic"
> approaches to use, since performance can vary based on a lot of
> factors. Even with advanced instructions, there are often multiple
> reasonable approaches to take, as Janne's email makes clear, so it's
> impossible to say "this algorithm is always best".
I agree that the current approach fits the model of implementations with
different cpu/memory use better. Using ifunc would be mostly orthogonal
to the issue of badly structured code.
> We can certainly set up a default approach if we want, though, that
> can be overridden by compile-time flags.
I don't think this would be an improvement.
> Incidentally, I'm starting to work on coding a version of gf-complete
> (and associated erasure coding functions) in C++ using templates,
> which will hopefully allow us to better separate out different
> implementations. We could still have run-time dispatch for the
> desired routines, but templates should allow for more compact code and
> better isolation of architecture-specific code. The big drawback is
> that C++ code isn't typically used in the kernel....
One possible simplification for the carry less multiplication would be
relying on inlining and optimisations of compile time constants.
Implement one function which does a variable number of polynomial
reductions. The current functions would then just be thin wrappers which
call the general function with a compile time constant for the number of
reductions. Forced inlining and dead code removal will optimize branches
away. The same method could be used to avoid the duplication of the
inner loop for the optional xor with the destination.
Janne
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
[not found] ` <CA+AFVBg1oKix1U=qYdLBQ+j4MPYek-npz5UkkFh+dtR_UADUxw@mail.gmail.com>
@ 2014-09-18 10:11 ` Janne Grunau
2014-10-10 14:01 ` Janne Grunau
1 sibling, 0 replies; 7+ messages in thread
From: Janne Grunau @ 2014-09-18 10:11 UTC (permalink / raw)
To: Kevin Greenan; +Cc: Loic Dachary, Ceph Development, Ethan Miller
Hi Kevin,
On 2014-09-16 11:25:12 -0700, Kevin Greenan wrote:
>
> I feel that separating the arch-specific implementations out and have a
> default 'generic' implementation would be a huge improvement. Note that
> gf-complete was in active development for some time before including the
> SIMD code. In hindsight, we should have done this separation back in 2012,
> but had some time pressure due to a paper deadline and limited time
> available to the contributors.
>
> Also, I agree w.r.t. the preprocessor stuff. Going with SIMD/NOSIMD is
> fine by me.
I'll rename than and start implementing neon optimized function in their
own files.
> Also, there should be very little "SIMD" work with jerasure, as gf-complete
> is the Galois field backend, so I would not worry too much about that.
I noticed, I have hooked my neon code already locally in ceph with
touching jerasure.
> That covers "clean-up" work. We can discuss the best way to choose the
> underlying implementation (looks like we have a bunch of options) as this
> work is completed.
>
> With this in mind, what work were you planning to do? I can try to free up
> cycles to help, but that may not happen for a few weeks.
Primarily NEON optimisations for gf-complete/ceph. Shouldn't take more
than a few days though.
> One last thing... If you do have code you want to push upstream, please
> submit a pull request(s) to our main bitbucket repo.
>
> Make sense?
yes, thanks.
Janne
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
[not found] ` <CA+AFVBg1oKix1U=qYdLBQ+j4MPYek-npz5UkkFh+dtR_UADUxw@mail.gmail.com>
2014-09-18 10:11 ` Janne Grunau
@ 2014-10-10 14:01 ` Janne Grunau
1 sibling, 0 replies; 7+ messages in thread
From: Janne Grunau @ 2014-10-10 14:01 UTC (permalink / raw)
To: Kevin Greenan; +Cc: Loic Dachary, Ceph Development, Ethan Miller
Hi Kevin,
On 2014-09-16 11:25:12 -0700, Kevin Greenan wrote:
>
> I feel that separating the arch-specific implementations out and have a
> default 'generic' implementation would be a huge improvement. Note that
> gf-complete was in active development for some time before including the
> SIMD code. In hindsight, we should have done this separation back in 2012,
> but had some time pressure due to a paper deadline and limited time
> available to the contributors.
>
> Also, I agree w.r.t. the preprocessor stuff. Going with SIMD/NOSIMD is
> fine by me.
I created a pull request with my neon optimisations, the SSE -> SIMD
rename and some minor fixes.
The neon methods all reside in their own files, I didn't come up with
good solution for the init / scratch_size functions, so I added
arm-specific defines there.
> Also, there should be very little "SIMD" work with jerasure, as gf-complete
> is the Galois field backend, so I would not worry too much about that.
Yes, there was no SIMD work in jerasure.
Please have a look at
https://bitbucket.org/jimplank/gf-complete/pull-request/25/arm-neon-optimisations
I'll be available to address review comments and suggestions.
regards
Janne
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-10-10 14:01 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-04 14:42 ARM NEON optimisations for gf-complete/jerasure/ceph-erasure Janne Grunau
2014-09-04 15:21 ` Loic Dachary
[not found] ` <CA+AFVBg1oKix1U=qYdLBQ+j4MPYek-npz5UkkFh+dtR_UADUxw@mail.gmail.com>
2014-09-18 10:11 ` Janne Grunau
2014-10-10 14:01 ` Janne Grunau
2014-09-04 15:57 ` Loic Dachary
2014-09-05 0:27 ` Ethan L. Miller
2014-09-05 7:51 ` Janne Grunau
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.