All of lore.kernel.org
 help / color / mirror / Atom feed
* jerasure/gf-complete segmentation violation
@ 2014-04-02 17:35 Loic Dachary
  2014-04-02 17:51 ` Loic Dachary
       [not found] ` <CA+AFVBg00yTzu-VGxSURDxv_UWOmZJEF+077txButeoSkoQuBg@mail.gmail.com>
  0 siblings, 2 replies; 8+ messages in thread
From: Loic Dachary @ 2014-04-02 17:35 UTC (permalink / raw)
  To: Kevin Greenan; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 2062 bytes --]

Hi Kevin,

In the context of http://tracker.ceph.com/issues/7914 we're trying to figure out why jerasure dumps core. We don't know how to reproduce it yet (ran dozens of identical tests suites with no such crash in the past few days, which is to be expected for rare bugs because the test suite introduces random errors / failures on purpose). 

The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:

#0  0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1  0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x0000000000000000 in ?? ()
#5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, 
    size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
#6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
    at erasure-code/jerasure/jerasure/src/jerasure.c:310
...

Note that this jerasure/gf-complete combination has been compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are jerasure v2 and gf-complete v1, only slightly modified as found in https://github.com/ceph/jerasure/tree/v2-ceph and https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a pending pull request under https://bitbucket.org/jimplank/gf-complete https://bitbucket.org/jimplank/jerasure, nothing you've not seen before). 

#5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607

and then it dives into gf-complete and most probably destroyed part of the stack when corrupting memory. I'll be chasing this tomorrow. If you have a brilliant idea on why that happens, I'll take it ;-) 

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: jerasure/gf-complete segmentation violation
  2014-04-02 17:35 jerasure/gf-complete segmentation violation Loic Dachary
@ 2014-04-02 17:51 ` Loic Dachary
  2014-04-02 22:57   ` Loic Dachary
       [not found] ` <CA+AFVBg00yTzu-VGxSURDxv_UWOmZJEF+077txButeoSkoQuBg@mail.gmail.com>
  1 sibling, 1 reply; 8+ messages in thread
From: Loic Dachary @ 2014-04-02 17:51 UTC (permalink / raw)
  To: Kevin Greenan; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 3072 bytes --]

Given the parameters to jerasure_matrix_dotprod the code path should be:

   https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L338 (because nbytes == 2048)
   https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L332 
   https://github.com/ceph/gf-complete/blob/v1-ceph/src/gf_w32.c#L569 (because INTEL_SSE4_PCLMUL has been used at compile time and the CPUID detected at runtime has the required features as selected in https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodePluginSelectJerasure.cc#L49 )
   
what should happen after that ? h->prim_poly will select something but what exactly... Could it be that the lack of stack means https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L332 references a NULL or invalid gfp_array[32] ? Or could it be that src/dest pointers are pointing to invalid memory ?

Bugs that can't be reproduced are the best ;-)
   
On 02/04/2014 19:35, Loic Dachary wrote:> Hi Kevin,
> 
> In the context of http://tracker.ceph.com/issues/7914 we're trying to figure out why jerasure dumps core. We don't know how to reproduce it yet (ran dozens of identical tests suites with no such crash in the past few days, which is to be expected for rare bugs because the test suite introduces random errors / failures on purpose). 
> 
> The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:
> 
> #0  0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
> #1  0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59
> #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
> #3  <signal handler called>
> #4  0x0000000000000000 in ?? ()
> #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, 
>     size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
> #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
>     at erasure-code/jerasure/jerasure/src/jerasure.c:310
> ...
> 
> Note that this jerasure/gf-complete combination has been compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are jerasure v2 and gf-complete v1, only slightly modified as found in https://github.com/ceph/jerasure/tree/v2-ceph and https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a pending pull request under https://bitbucket.org/jimplank/gf-complete https://bitbucket.org/jimplank/jerasure, nothing you've not seen before). 
> 
> #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
> 
> and then it dives into gf-complete and most probably destroyed part of the stack when corrupting memory. I'll be chasing this tomorrow. If you have a brilliant idea on why that happens, I'll take it ;-) 
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: jerasure/gf-complete segmentation violation
       [not found] ` <CA+AFVBg00yTzu-VGxSURDxv_UWOmZJEF+077txButeoSkoQuBg@mail.gmail.com>
@ 2014-04-02 17:56   ` Loic Dachary
  2014-04-02 18:01     ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Loic Dachary @ 2014-04-02 17:56 UTC (permalink / raw)
  To: Kevin Greenan; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 3124 bytes --]



On 02/04/2014 19:44, Kevin Greenan wrote:
> Hey Loic,
> 
> Are you ensuring that Jerasure (actually gf-complete) is getting memory buffers aligned on 16-byte boundaries?  Without looking too deep, that is the first thing I would check.
> 

Yes

https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L32
https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L242
https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L65
https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L108

I'll re-read this logic tomorrow just to be sure.

Cheers

> I can have a deeper look later today or tomorrow.
> 
> -kevin
> 
> 
> On Wed, Apr 2, 2014 at 10:35 AM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> 
>     Hi Kevin,
> 
>     In the context of http://tracker.ceph.com/issues/7914 we're trying to figure out why jerasure dumps core. We don't know how to reproduce it yet (ran dozens of identical tests suites with no such crash in the past few days, which is to be expected for rare bugs because the test suite introduces random errors / failures on purpose).
> 
>     The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:
> 
>     #0  0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>     #1  0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59
>     #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
>     #3  <signal handler called>
>     #4  0x0000000000000000 in ?? ()
>     #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10,
>         size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
>     #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
>         at erasure-code/jerasure/jerasure/src/jerasure.c:310
>     ...
> 
>     Note that this jerasure/gf-complete combination has been compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are jerasure v2 and gf-complete v1, only slightly modified as found in https://github.com/ceph/jerasure/tree/v2-ceph and https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a pending pull request under https://bitbucket.org/jimplank/gf-complete https://bitbucket.org/jimplank/jerasure, nothing you've not seen before).
> 
>     #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
> 
>     and then it dives into gf-complete and most probably destroyed part of the stack when corrupting memory. I'll be chasing this tomorrow. If you have a brilliant idea on why that happens, I'll take it ;-)
> 
>     Cheers
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: jerasure/gf-complete segmentation violation
  2014-04-02 17:56   ` Loic Dachary
@ 2014-04-02 18:01     ` Sage Weil
       [not found]       ` <CA+AFVBgVXsTLJuGh-FrJMx3ee11Ztf=g+B9gnHybg9EXwunfnw@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2014-04-02 18:01 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Kevin Greenan, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3575 bytes --]

On Wed, 2 Apr 2014, Loic Dachary wrote:
> 
> 
> On 02/04/2014 19:44, Kevin Greenan wrote:
> > Hey Loic,
> > 
> > Are you ensuring that Jerasure (actually gf-complete) is getting memory buffers aligned on 16-byte boundaries?  Without looking too deep, that is the first thing I would check.
> > 
> 
> Yes
> 
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L32
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L242
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L65
> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L108
> 

In this case they are 2K aligned:

(gdb) p data_ptrs[0]
$1 = 0x3e46000 "I'm the", ' ' <repeats 16 times>, "3th object!"
(gdb) p data_ptrs[1]
$2 = 0x3e46800 'z' <repeats 200 times>...
(gdb) p coding_ptrs[0]
$3 = 0x338e000 "I'm the", ' ' <repeats 16 times>, "3th object!"

sage

> I'll re-read this logic tomorrow just to be sure.
> 
> Cheers
> 
> > I can have a deeper look later today or tomorrow.
> > 
> > -kevin
> > 
> > 
> > On Wed, Apr 2, 2014 at 10:35 AM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> > 
> >     Hi Kevin,
> > 
> >     In the context of http://tracker.ceph.com/issues/7914 we're trying to figure out why jerasure dumps core. We don't know how to reproduce it yet (ran dozens of identical tests suites with no such crash in the past few days, which is to be expected for rare bugs because the test suite introduces random errors / failures on purpose).
> > 
> >     The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:
> > 
> >     #0  0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
> >     #1  0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59
> >     #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
> >     #3  <signal handler called>
> >     #4  0x0000000000000000 in ?? ()
> >     #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10,
> >         size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
> >     #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
> >         at erasure-code/jerasure/jerasure/src/jerasure.c:310
> >     ...
> > 
> >     Note that this jerasure/gf-complete combination has been compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are jerasure v2 and gf-complete v1, only slightly modified as found in https://github.com/ceph/jerasure/tree/v2-ceph and https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a pending pull request under https://bitbucket.org/jimplank/gf-complete https://bitbucket.org/jimplank/jerasure, nothing you've not seen before).
> > 
> >     #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
> > 
> >     and then it dives into gf-complete and most probably destroyed part of the stack when corrupting memory. I'll be chasing this tomorrow. If you have a brilliant idea on why that happens, I'll take it ;-)
> > 
> >     Cheers
> > 
> >     --
> >     Loïc Dachary, Artisan Logiciel Libre
> > 
> > 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: jerasure/gf-complete segmentation violation
  2014-04-02 17:51 ` Loic Dachary
@ 2014-04-02 22:57   ` Loic Dachary
  0 siblings, 0 replies; 8+ messages in thread
From: Loic Dachary @ 2014-04-02 22:57 UTC (permalink / raw)
  To: Kevin Greenan; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 3306 bytes --]

Here is the stack trace on a successfull run, borrowed from the unit tests, to confirm the code path : http://tracker.ceph.com/issues/7914#note-27

On 02/04/2014 19:51, Loic Dachary wrote:
> Given the parameters to jerasure_matrix_dotprod the code path should be:
> 
>    https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L338 (because nbytes == 2048)
>    https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L332 
>    https://github.com/ceph/gf-complete/blob/v1-ceph/src/gf_w32.c#L569 (because INTEL_SSE4_PCLMUL has been used at compile time and the CPUID detected at runtime has the required features as selected in https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodePluginSelectJerasure.cc#L49 )
>    
> what should happen after that ? h->prim_poly will select something but what exactly... Could it be that the lack of stack means https://github.com/ceph/jerasure/blob/v2-ceph/src/galois.c#L332 references a NULL or invalid gfp_array[32] ? Or could it be that src/dest pointers are pointing to invalid memory ?
> 
> Bugs that can't be reproduced are the best ;-)
>    
> On 02/04/2014 19:35, Loic Dachary wrote:> Hi Kevin,
>>
>> In the context of http://tracker.ceph.com/issues/7914 we're trying to figure out why jerasure dumps core. We don't know how to reproduce it yet (ran dozens of identical tests suites with no such crash in the past few days, which is to be expected for rare bugs because the test suite introduces random errors / failures on purpose). 
>>
>> The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:
>>
>> #0  0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>> #1  0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59
>> #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
>> #3  <signal handler called>
>> #4  0x0000000000000000 in ?? ()
>> #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, 
>>     size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
>> #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
>>     at erasure-code/jerasure/jerasure/src/jerasure.c:310
>> ...
>>
>> Note that this jerasure/gf-complete combination has been compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are jerasure v2 and gf-complete v1, only slightly modified as found in https://github.com/ceph/jerasure/tree/v2-ceph and https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a pending pull request under https://bitbucket.org/jimplank/gf-complete https://bitbucket.org/jimplank/jerasure, nothing you've not seen before). 
>>
>> #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
>>
>> and then it dives into gf-complete and most probably destroyed part of the stack when corrupting memory. I'll be chasing this tomorrow. If you have a brilliant idea on why that happens, I'll take it ;-) 
>>
>> Cheers
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: jerasure/gf-complete segmentation violation
       [not found]       ` <CA+AFVBgVXsTLJuGh-FrJMx3ee11Ztf=g+B9gnHybg9EXwunfnw@mail.gmail.com>
@ 2014-04-06 10:12         ` Loic Dachary
       [not found]           ` <D590780E-5F28-4ADA-B9F5-E2E14C9C0D27@gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Loic Dachary @ 2014-04-06 10:12 UTC (permalink / raw)
  To: Kevin Greenan; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 4745 bytes --]

Hi,

An illegal instruction this time http://tracker.ceph.com/issues/7914#note-31 . Since the workload is slightly different, I'm trying to run it 30 times and see if that triggers the problem. 

Cheers

On 02/04/2014 20:15, Kevin Greenan wrote:
> OK, it looks like this happens when the GF backend is first initialized (unless, like Loic pointed out, something is corrupted).
> 
> Is this consistently happening for carry-free multiply and w=32 (i.e. gf_w32_cfm_init)?
> 
> Can you send me a core + binary, so I can dig in gdb?
> 
> -kevin
> 
> 
> On Wed, Apr 2, 2014 at 11:01 AM, Sage Weil <sage@inktank.com <mailto:sage@inktank.com>> wrote:
> 
>     On Wed, 2 Apr 2014, Loic Dachary wrote:
>     >
>     >
>     > On 02/04/2014 19:44, Kevin Greenan wrote:
>     > > Hey Loic,
>     > >
>     > > Are you ensuring that Jerasure (actually gf-complete) is getting memory buffers aligned on 16-byte boundaries?  Without looking too deep, that is the first thing I would check.
>     > >
>     >
>     > Yes
>     >
>     > https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L32
>     > https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L242
>     > https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L65
>     > https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L108
>     >
> 
>     In this case they are 2K aligned:
> 
>     (gdb) p data_ptrs[0]
>     $1 = 0x3e46000 "I'm the", ' ' <repeats 16 times>, "3th object!"
>     (gdb) p data_ptrs[1]
>     $2 = 0x3e46800 'z' <repeats 200 times>...
>     (gdb) p coding_ptrs[0]
>     $3 = 0x338e000 "I'm the", ' ' <repeats 16 times>, "3th object!"
> 
>     sage
> 
>     > I'll re-read this logic tomorrow just to be sure.
>     >
>     > Cheers
>     >
>     > > I can have a deeper look later today or tomorrow.
>     > >
>     > > -kevin
>     > >
>     > >
>     > > On Wed, Apr 2, 2014 at 10:35 AM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>> wrote:
>     > >
>     > >     Hi Kevin,
>     > >
>     > >     In the context of http://tracker.ceph.com/issues/7914 we're trying to figure out why jerasure dumps core. We don't know how to reproduce it yet (ran dozens of identical tests suites with no such crash in the past few days, which is to be expected for rare bugs because the test suite introduces random errors / failures on purpose).
>     > >
>     > >     The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:
>     > >
>     > >     #0  0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>     > >     #1  0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59
>     > >     #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
>     > >     #3  <signal handler called>
>     > >     #4  0x0000000000000000 in ?? ()
>     > >     #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10,
>     > >         size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
>     > >     #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
>     > >         at erasure-code/jerasure/jerasure/src/jerasure.c:310
>     > >     ...
>     > >
>     > >     Note that this jerasure/gf-complete combination has been compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are jerasure v2 and gf-complete v1, only slightly modified as found in https://github.com/ceph/jerasure/tree/v2-ceph and https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a pending pull request under https://bitbucket.org/jimplank/gf-complete https://bitbucket.org/jimplank/jerasure, nothing you've not seen before).
>     > >
>     > >     #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
>     > >
>     > >     and then it dives into gf-complete and most probably destroyed part of the stack when corrupting memory. I'll be chasing this tomorrow. If you have a brilliant idea on why that happens, I'll take it ;-)
>     > >
>     > >     Cheers
>     > >
>     > >     --
>     > >     Loïc Dachary, Artisan Logiciel Libre
>     > >
>     > >
>     >
>     > --
>     > Loïc Dachary, Artisan Logiciel Libre
>     >
>     > 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: jerasure/gf-complete segmentation violation
       [not found]               ` <CA+AFVBjomjD_oReuEcrkpR-y5CSLw7cCjOEa3T+XHHGieT+=Hg@mail.gmail.com>
@ 2014-04-07 18:29                 ` Loic Dachary
  2014-04-07 18:56                   ` Loic Dachary
  0 siblings, 1 reply; 8+ messages in thread
From: Loic Dachary @ 2014-04-07 18:29 UTC (permalink / raw)
  To: Kevin Greenan; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 7359 bytes --]

[re-adding the list for the record]

On 07/04/2014 19:53, Kevin Greenan wrote:> Hey Loic,
> 
> BTW, you can get an illegal instruction fault if you are calling an intrinsic that is not supported on a particular platform.  Is the code being compiled on a platform that is different than the machines in your test harness?
> 

The plugin is compiled with three kinds of flags:

https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/Makefile.am#L50

at runtime the appropriate binary is loaded depending on the CPU features

https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/ErasureCodePluginSelectJerasure.cc#L42

and the logs confirm that jerasure_sse4 is used in this particular case. All tests were run on machines tested to have the required CPU features in

https://github.com/ceph/ceph/blob/firefly/src/arch/intel.c#L10

Do you see something missing ?

Cheers

> -kevin
> 
> 
> On Sun, Apr 6, 2014 at 12:06 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> 
> 
> 
>     On 06/04/2014 18:28, Kevin Greenan wrote:
>     > Hey Loic,
>     >
>     > Did this stuff start happening after a specific commit (or commits)?  I see this bug was opened 6 days ago and some changes to your fork as of 7 days ago...
>     >
>     > Or is this the first time you have run these tests with the new Jerasure backend?
> 
>     It's the first time we run tests with gf-complete / jerasure optimized (i.e. all flags from https://github.com/ceph/ceph/blob/master/m4/ax_intel.m4 are set because the compiler knows how and it's targeting x86_64). Before that and during three or four weeks we ran jerasure / gf-complete without any optimization. Before that we ran the previous jerasure version without gf-complete.
> 
>     Cheers
> 
>     >
>     > Thanks,
>     > -kevin
>     >
>     >
>     > On Apr 6, 2014, at 3:12 AM, Loic Dachary wrote:
>     >
>     >> Hi,
>     >>
>     >> An illegal instruction this time http://tracker.ceph.com/issues/7914#note-31 . Since the workload is slightly different, I'm trying to run it 30 times and see if that triggers the problem.
>     >>
>     >> Cheers
>     >>
>     >> On 02/04/2014 20:15, Kevin Greenan wrote:
>     >>> OK, it looks like this happens when the GF backend is first initialized (unless, like Loic pointed out, something is corrupted).
>     >>>
>     >>> Is this consistently happening for carry-free multiply and w=32 (i.e. gf_w32_cfm_init)?
>     >>>
>     >>> Can you send me a core + binary, so I can dig in gdb?
>     >>>
>     >>> -kevin
>     >>>
>     >>>
>     >>> On Wed, Apr 2, 2014 at 11:01 AM, Sage Weil <sage@inktank.com <mailto:sage@inktank.com> <mailto:sage@inktank.com <mailto:sage@inktank.com>>> wrote:
>     >>>
>     >>>    On Wed, 2 Apr 2014, Loic Dachary wrote:
>     >>>>
>     >>>>
>     >>>> On 02/04/2014 19:44, Kevin Greenan wrote:
>     >>>>> Hey Loic,
>     >>>>>
>     >>>>> Are you ensuring that Jerasure (actually gf-complete) is getting memory buffers aligned on 16-byte boundaries?  Without looking too deep, that is the first thing I would check.
>     >>>>>
>     >>>>
>     >>>> Yes
>     >>>>
>     >>>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L32
>     >>>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L242
>     >>>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L65
>     >>>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L108
>     >>>>
>     >>>
>     >>>    In this case they are 2K aligned:
>     >>>
>     >>>    (gdb) p data_ptrs[0]
>     >>>    $1 = 0x3e46000 "I'm the", ' ' <repeats 16 times>, "3th object!"
>     >>>    (gdb) p data_ptrs[1]
>     >>>    $2 = 0x3e46800 'z' <repeats 200 times>...
>     >>>    (gdb) p coding_ptrs[0]
>     >>>    $3 = 0x338e000 "I'm the", ' ' <repeats 16 times>, "3th object!"
>     >>>
>     >>>    sage
>     >>>
>     >>>> I'll re-read this logic tomorrow just to be sure.
>     >>>>
>     >>>> Cheers
>     >>>>
>     >>>>> I can have a deeper look later today or tomorrow.
>     >>>>>
>     >>>>> -kevin
>     >>>>>
>     >>>>>
>     >>>>> On Wed, Apr 2, 2014 at 10:35 AM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>> <mailto:loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>>> wrote:
>     >>>>>
>     >>>>>    Hi Kevin,
>     >>>>>
>     >>>>>    In the context of http://tracker.ceph.com/issues/7914 we're trying to figure out why jerasure dumps core. We don't know how to reproduce it yet (ran dozens of identical tests suites with no such crash in the past few days, which is to be expected for rare bugs because the test suite introduces random errors / failures on purpose).
>     >>>>>
>     >>>>>    The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:
>     >>>>>
>     >>>>>    #0  0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>     >>>>>    #1  0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59
>     >>>>>    #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
>     >>>>>    #3  <signal handler called>
>     >>>>>    #4  0x0000000000000000 in ?? ()
>     >>>>>    #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10,
>     >>>>>        size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
>     >>>>>    #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
>     >>>>>        at erasure-code/jerasure/jerasure/src/jerasure.c:310
>     >>>>>    ...
>     >>>>>
>     >>>>>    Note that this jerasure/gf-complete combination has been compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are jerasure v2 and gf-complete v1, only slightly modified as found in https://github.com/ceph/jerasure/tree/v2-ceph and https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a pending pull request under https://bitbucket.org/jimplank/gf-complete https://bitbucket.org/jimplank/jerasure, nothing you've not seen before).
>     >>>>>
>     >>>>>    #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
>     >>>>>
>     >>>>>    and then it dives into gf-complete and most probably destroyed part of the stack when corrupting memory. I'll be chasing this tomorrow. If you have a brilliant idea on why that happens, I'll take it ;-)
>     >>>>>
>     >>>>>    Cheers
>     >>>>>
>     >>>>>    --
>     >>>>>    Loïc Dachary, Artisan Logiciel Libre
>     >>>>>
>     >>>>>
>     >>>>
>     >>>> --
>     >>>> Loïc Dachary, Artisan Logiciel Libre
>     >>>>
>     >>>>
>     >>>
>     >>>
>     >>
>     >> --
>     >> Loïc Dachary, Artisan Logiciel Libre
>     >>
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: jerasure/gf-complete segmentation violation
  2014-04-07 18:29                 ` Loic Dachary
@ 2014-04-07 18:56                   ` Loic Dachary
  0 siblings, 0 replies; 8+ messages in thread
From: Loic Dachary @ 2014-04-07 18:56 UTC (permalink / raw)
  To: Kevin Greenan; +Cc: Ceph Development

[-- Attachment #1: Type: text/plain, Size: 7892 bytes --]

Hi Kevin,

In galois.c gfp_array is a global variable . If galois_w16_region_xor is called from two different threads, there is a race condition. 

   http://tracker.ceph.com/issues/7914#note-39

If you agree that it's a plausible explanation to the crashes, I'll start work to improve jerasure thread safety.

Cheers

On 07/04/2014 20:29, Loic Dachary wrote:
> [re-adding the list for the record]
> 
> On 07/04/2014 19:53, Kevin Greenan wrote:> Hey Loic,
>>
>> BTW, you can get an illegal instruction fault if you are calling an intrinsic that is not supported on a particular platform.  Is the code being compiled on a platform that is different than the machines in your test harness?
>>
> 
> The plugin is compiled with three kinds of flags:
> 
> https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/Makefile.am#L50
> 
> at runtime the appropriate binary is loaded depending on the CPU features
> 
> https://github.com/ceph/ceph/blob/firefly/src/erasure-code/jerasure/ErasureCodePluginSelectJerasure.cc#L42
> 
> and the logs confirm that jerasure_sse4 is used in this particular case. All tests were run on machines tested to have the required CPU features in
> 
> https://github.com/ceph/ceph/blob/firefly/src/arch/intel.c#L10
> 
> Do you see something missing ?
> 
> Cheers
> 
>> -kevin
>>
>>
>> On Sun, Apr 6, 2014 at 12:06 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>
>>
>>
>>     On 06/04/2014 18:28, Kevin Greenan wrote:
>>     > Hey Loic,
>>     >
>>     > Did this stuff start happening after a specific commit (or commits)?  I see this bug was opened 6 days ago and some changes to your fork as of 7 days ago...
>>     >
>>     > Or is this the first time you have run these tests with the new Jerasure backend?
>>
>>     It's the first time we run tests with gf-complete / jerasure optimized (i.e. all flags from https://github.com/ceph/ceph/blob/master/m4/ax_intel.m4 are set because the compiler knows how and it's targeting x86_64). Before that and during three or four weeks we ran jerasure / gf-complete without any optimization. Before that we ran the previous jerasure version without gf-complete.
>>
>>     Cheers
>>
>>     >
>>     > Thanks,
>>     > -kevin
>>     >
>>     >
>>     > On Apr 6, 2014, at 3:12 AM, Loic Dachary wrote:
>>     >
>>     >> Hi,
>>     >>
>>     >> An illegal instruction this time http://tracker.ceph.com/issues/7914#note-31 . Since the workload is slightly different, I'm trying to run it 30 times and see if that triggers the problem.
>>     >>
>>     >> Cheers
>>     >>
>>     >> On 02/04/2014 20:15, Kevin Greenan wrote:
>>     >>> OK, it looks like this happens when the GF backend is first initialized (unless, like Loic pointed out, something is corrupted).
>>     >>>
>>     >>> Is this consistently happening for carry-free multiply and w=32 (i.e. gf_w32_cfm_init)?
>>     >>>
>>     >>> Can you send me a core + binary, so I can dig in gdb?
>>     >>>
>>     >>> -kevin
>>     >>>
>>     >>>
>>     >>> On Wed, Apr 2, 2014 at 11:01 AM, Sage Weil <sage@inktank.com <mailto:sage@inktank.com> <mailto:sage@inktank.com <mailto:sage@inktank.com>>> wrote:
>>     >>>
>>     >>>    On Wed, 2 Apr 2014, Loic Dachary wrote:
>>     >>>>
>>     >>>>
>>     >>>> On 02/04/2014 19:44, Kevin Greenan wrote:
>>     >>>>> Hey Loic,
>>     >>>>>
>>     >>>>> Are you ensuring that Jerasure (actually gf-complete) is getting memory buffers aligned on 16-byte boundaries?  Without looking too deep, that is the first thing I would check.
>>     >>>>>
>>     >>>>
>>     >>>> Yes
>>     >>>>
>>     >>>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L32
>>     >>>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L242
>>     >>>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L65
>>     >>>> https://github.com/ceph/ceph/blob/master/src/erasure-code/jerasure/ErasureCodeJerasure.cc#L108
>>     >>>>
>>     >>>
>>     >>>    In this case they are 2K aligned:
>>     >>>
>>     >>>    (gdb) p data_ptrs[0]
>>     >>>    $1 = 0x3e46000 "I'm the", ' ' <repeats 16 times>, "3th object!"
>>     >>>    (gdb) p data_ptrs[1]
>>     >>>    $2 = 0x3e46800 'z' <repeats 200 times>...
>>     >>>    (gdb) p coding_ptrs[0]
>>     >>>    $3 = 0x338e000 "I'm the", ' ' <repeats 16 times>, "3th object!"
>>     >>>
>>     >>>    sage
>>     >>>
>>     >>>> I'll re-read this logic tomorrow just to be sure.
>>     >>>>
>>     >>>> Cheers
>>     >>>>
>>     >>>>> I can have a deeper look later today or tomorrow.
>>     >>>>>
>>     >>>>> -kevin
>>     >>>>>
>>     >>>>>
>>     >>>>> On Wed, Apr 2, 2014 at 10:35 AM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>> <mailto:loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>>> wrote:
>>     >>>>>
>>     >>>>>    Hi Kevin,
>>     >>>>>
>>     >>>>>    In the context of http://tracker.ceph.com/issues/7914 we're trying to figure out why jerasure dumps core. We don't know how to reproduce it yet (ran dozens of identical tests suites with no such crash in the past few days, which is to be expected for rare bugs because the test suite introduces random errors / failures on purpose).
>>     >>>>>
>>     >>>>>    The full stack trace is at http://tracker.ceph.com/issues/7914#note-24 but the relevant part is here:
>>     >>>>>
>>     >>>>>    #0  0x00007f4756779b7b in raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
>>     >>>>>    #1  0x0000000000981b4e in reraise_fatal (signum=11) at global/signal_handler.cc:59
>>     >>>>>    #2  handle_fatal_signal (signum=11) at global/signal_handler.cc:105
>>     >>>>>    #3  <signal handler called>
>>     >>>>>    #4  0x0000000000000000 in ?? ()
>>     >>>>>    #5  0x00007f47385ae6b1 in jerasure_matrix_dotprod (k=2, w=8, matrix_row=0x31513a8, src_ids=0x0, dest_id=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10,
>>     >>>>>        size=2048) at erasure-code/jerasure/jerasure/src/jerasure.c:607
>>     >>>>>    #6  0x00007f47385ae7d6 in jerasure_matrix_encode (k=2, m=1, w=8, matrix=<optimized out>, data_ptrs=0x7f4741ec7a00, coding_ptrs=0x7f4741ec7a10, size=2048)
>>     >>>>>        at erasure-code/jerasure/jerasure/src/jerasure.c:310
>>     >>>>>    ...
>>     >>>>>
>>     >>>>>    Note that this jerasure/gf-complete combination has been compiled with SSE4.1, SSE4.2, PCLMUL, SSSE3, SSE3, SSE2, SSE flags activated. These are jerasure v2 and gf-complete v1, only slightly modified as found in https://github.com/ceph/jerasure/tree/v2-ceph and https://github.com/ceph/gf-complete/tree/v1-ceph (all commits there have a pending pull request under https://bitbucket.org/jimplank/gf-complete https://bitbucket.org/jimplank/jerasure, nothing you've not seen before).
>>     >>>>>
>>     >>>>>    #5 is https://github.com/ceph/jerasure/blob/v2-ceph/src/jerasure.c#L607
>>     >>>>>
>>     >>>>>    and then it dives into gf-complete and most probably destroyed part of the stack when corrupting memory. I'll be chasing this tomorrow. If you have a brilliant idea on why that happens, I'll take it ;-)
>>     >>>>>
>>     >>>>>    Cheers
>>     >>>>>
>>     >>>>>    --
>>     >>>>>    Loïc Dachary, Artisan Logiciel Libre
>>     >>>>>
>>     >>>>>
>>     >>>>
>>     >>>> --
>>     >>>> Loïc Dachary, Artisan Logiciel Libre
>>     >>>>
>>     >>>>
>>     >>>
>>     >>>
>>     >>
>>     >> --
>>     >> Loïc Dachary, Artisan Logiciel Libre
>>     >>
>>     >
>>
>>     --
>>     Loïc Dachary, Artisan Logiciel Libre
>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-04-07 18:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-02 17:35 jerasure/gf-complete segmentation violation Loic Dachary
2014-04-02 17:51 ` Loic Dachary
2014-04-02 22:57   ` Loic Dachary
     [not found] ` <CA+AFVBg00yTzu-VGxSURDxv_UWOmZJEF+077txButeoSkoQuBg@mail.gmail.com>
2014-04-02 17:56   ` Loic Dachary
2014-04-02 18:01     ` Sage Weil
     [not found]       ` <CA+AFVBgVXsTLJuGh-FrJMx3ee11Ztf=g+B9gnHybg9EXwunfnw@mail.gmail.com>
2014-04-06 10:12         ` Loic Dachary
     [not found]           ` <D590780E-5F28-4ADA-B9F5-E2E14C9C0D27@gmail.com>
     [not found]             ` <5341A5C3.8090802@dachary.org>
     [not found]               ` <CA+AFVBjomjD_oReuEcrkpR-y5CSLw7cCjOEa3T+XHHGieT+=Hg@mail.gmail.com>
2014-04-07 18:29                 ` Loic Dachary
2014-04-07 18:56                   ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.