All of lore.kernel.org
 help / color / mirror / Atom feed
* radosgw crash within libfcgi
@ 2015-06-24 17:09 GuangYang
  2015-06-24 18:40 ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 8+ messages in thread
From: GuangYang @ 2015-06-24 17:09 UTC (permalink / raw)
  To: ceph-devel, ceph-users, yehuda

Hello Cephers,
Recently we have several radosgw daemon crashes with the same following kernel log:

Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip 00007ffa069996f2 sp 00007ff55c432710 error 6 in libfcgi.so.0.0.0[7ffa06995000+a000] in libfcgi.so.0.0.0[7ffa06995000+a000]

Looking at the assembly, it seems crashing at this point - http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which confused me. I tried to see if there is any other reference holding the FCGX_Request which release the handle without any luck.

There are also other observations:
 1> Several radosgw daemon across different hosts crashed around the same time.
 2> Apache's error log has some fcgi error complaining ##idle timeout## during the time.

Does anyone experience similar issue? 

Thanks,
Guang  		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: radosgw crash within libfcgi
  2015-06-24 17:09 radosgw crash within libfcgi GuangYang
@ 2015-06-24 18:40 ` Yehuda Sadeh-Weinraub
  2015-06-24 20:53   ` GuangYang
  0 siblings, 1 reply; 8+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-06-24 18:40 UTC (permalink / raw)
  To: GuangYang; +Cc: ceph-devel, ceph-users



----- Original Message -----
> From: "GuangYang" <yguang11@outlook.com>
> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com, yehuda@redhat.com
> Sent: Wednesday, June 24, 2015 10:09:58 AM
> Subject: radosgw crash within libfcgi
> 
> Hello Cephers,
> Recently we have several radosgw daemon crashes with the same following
> kernel log:
> 
> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
> 00007ffa069996f2 sp 00007ff55c432710 error 6 in
> libfcgi.so.0.0.0[7ffa06995000+a000] in libfcgi.so.0.0.0[7ffa06995000+a000]
> 
> Looking at the assembly, it seems crashing at this point -
> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which
> confused me. I tried to see if there is any other reference holding the
> FCGX_Request which release the handle without any luck.
> 
> There are also other observations:
>  1> Several radosgw daemon across different hosts crashed around the same
>  time.
>  2> Apache's error log has some fcgi error complaining ##idle timeout##
>  during the time.
> 
> Does anyone experience similar issue?
> 

In the past we've had issues with libfcgi that were related to the number of open fds on the process (> 1024). The issue was a buggy libfcgi that was using select() instead of poll(), so this might be the issue you're noticing.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: radosgw crash within libfcgi
  2015-06-24 18:40 ` Yehuda Sadeh-Weinraub
@ 2015-06-24 20:53   ` GuangYang
  2015-06-24 21:04     ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 8+ messages in thread
From: GuangYang @ 2015-06-24 20:53 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub; +Cc: ceph-devel, ceph-users

Thanks Yehuda for the response.

We already patched libfcgi to use poll instead of select to overcome the limitation.

Thanks,
Guang


----------------------------------------
> Date: Wed, 24 Jun 2015 14:40:25 -0400
> From: yehuda@redhat.com
> To: yguang11@outlook.com
> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> Subject: Re: radosgw crash within libfcgi
>
>
>
> ----- Original Message -----
>> From: "GuangYang" <yguang11@outlook.com>
>> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com, yehuda@redhat.com
>> Sent: Wednesday, June 24, 2015 10:09:58 AM
>> Subject: radosgw crash within libfcgi
>>
>> Hello Cephers,
>> Recently we have several radosgw daemon crashes with the same following
>> kernel log:
>>
>> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
>> 00007ffa069996f2 sp 00007ff55c432710 error 6 in
>> libfcgi.so.0.0.0[7ffa06995000+a000] in libfcgi.so.0.0.0[7ffa06995000+a000]
>>
>> Looking at the assembly, it seems crashing at this point -
>> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which
>> confused me. I tried to see if there is any other reference holding the
>> FCGX_Request which release the handle without any luck.
>>
>> There are also other observations:
>> 1> Several radosgw daemon across different hosts crashed around the same
>> time.
>> 2> Apache's error log has some fcgi error complaining ##idle timeout##
>> during the time.
>>
>> Does anyone experience similar issue?
>>
>
> In the past we've had issues with libfcgi that were related to the number of open fds on the process (> 1024). The issue was a buggy libfcgi that was using select() instead of poll(), so this might be the issue you're noticing.
>
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: radosgw crash within libfcgi
  2015-06-24 20:53   ` GuangYang
@ 2015-06-24 21:04     ` Yehuda Sadeh-Weinraub
  2015-06-24 21:12       ` GuangYang
  0 siblings, 1 reply; 8+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-06-24 21:04 UTC (permalink / raw)
  To: GuangYang; +Cc: ceph-devel, ceph-users



----- Original Message -----
> From: "GuangYang" <yguang11@outlook.com>
> To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> Sent: Wednesday, June 24, 2015 1:53:20 PM
> Subject: RE: radosgw crash within libfcgi
> 
> Thanks Yehuda for the response.
> 
> We already patched libfcgi to use poll instead of select to overcome the
> limitation.
> 
> Thanks,
> Guang
> 
> 
> ----------------------------------------
> > Date: Wed, 24 Jun 2015 14:40:25 -0400
> > From: yehuda@redhat.com
> > To: yguang11@outlook.com
> > CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> > Subject: Re: radosgw crash within libfcgi
> >
> >
> >
> > ----- Original Message -----
> >> From: "GuangYang" <yguang11@outlook.com>
> >> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com,
> >> yehuda@redhat.com
> >> Sent: Wednesday, June 24, 2015 10:09:58 AM
> >> Subject: radosgw crash within libfcgi
> >>
> >> Hello Cephers,
> >> Recently we have several radosgw daemon crashes with the same following
> >> kernel log:
> >>
> >> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
> >> 00007ffa069996f2 sp 00007ff55c432710 error 6 in

error 6 is sigabrt, right? With invalid pointer I'd expect to get segfault. Is the pointer actually invalid?

Yehuda


> >> libfcgi.so.0.0.0[7ffa06995000+a000] in libfcgi.so.0.0.0[7ffa06995000+a000]
> >>
> >> Looking at the assembly, it seems crashing at this point -
> >> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which
> >> confused me. I tried to see if there is any other reference holding the
> >> FCGX_Request which release the handle without any luck.
> >>
> >> There are also other observations:
> >> 1> Several radosgw daemon across different hosts crashed around the same
> >> time.
> >> 2> Apache's error log has some fcgi error complaining ##idle timeout##
> >> during the time.
> >>
> >> Does anyone experience similar issue?
> >>
> >
> > In the past we've had issues with libfcgi that were related to the number
> > of open fds on the process (> 1024). The issue was a buggy libfcgi that
> > was using select() instead of poll(), so this might be the issue you're
> > noticing.
> >
> > Yehuda
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>  		 	   		  N嫥叉靣笡y氊b瞂千v豝�藓{.n�壏渮榏z鳐妠ay�蕠跈�jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸"濟!秈
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: radosgw crash within libfcgi
  2015-06-24 21:04     ` Yehuda Sadeh-Weinraub
@ 2015-06-24 21:12       ` GuangYang
  2015-06-24 21:21         ` Yehuda Sadeh-Weinraub
  0 siblings, 1 reply; 8+ messages in thread
From: GuangYang @ 2015-06-24 21:12 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub; +Cc: ceph-devel, ceph-users

----------------------------------------
> Date: Wed, 24 Jun 2015 17:04:05 -0400
> From: yehuda@redhat.com
> To: yguang11@outlook.com
> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> Subject: Re: radosgw crash within libfcgi
>
>
>
> ----- Original Message -----
>> From: "GuangYang" <yguang11@outlook.com>
>> To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
>> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
>> Sent: Wednesday, June 24, 2015 1:53:20 PM
>> Subject: RE: radosgw crash within libfcgi
>>
>> Thanks Yehuda for the response.
>>
>> We already patched libfcgi to use poll instead of select to overcome the
>> limitation.
>>
>> Thanks,
>> Guang
>>
>>
>> ----------------------------------------
>>> Date: Wed, 24 Jun 2015 14:40:25 -0400
>>> From: yehuda@redhat.com
>>> To: yguang11@outlook.com
>>> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
>>> Subject: Re: radosgw crash within libfcgi
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: "GuangYang" <yguang11@outlook.com>
>>>> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com,
>>>> yehuda@redhat.com
>>>> Sent: Wednesday, June 24, 2015 10:09:58 AM
>>>> Subject: radosgw crash within libfcgi
>>>>
>>>> Hello Cephers,
>>>> Recently we have several radosgw daemon crashes with the same following
>>>> kernel log:
>>>>
>>>> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
>>>> 00007ffa069996f2 sp 00007ff55c432710 error 6 in
>
> error 6 is sigabrt, right? With invalid pointer I'd expect to get segfault. Is the pointer actually invalid?
With (ip - {address_load_the_sharded_library}) to get the instruction which caused this crash, the objdump shows the crash happened at instruction 46f2 (see below), which was to assign '-1' to the CGX_Request::ipcFd to -1, but I don't quite understand how/why it could crash there.

0000000000004690 <FCGX_Free>:
    4690:       48 89 5c 24 f0          mov    %rbx,-0x10(%rsp)
    4695:       48 89 6c 24 f8          mov    %rbp,-0x8(%rsp)
    469a:       48 83 ec 18             sub    $0x18,%rsp
    469e:       48 85 ff                test   %rdi,%rdi
    46a1:       48 89 fb                mov    %rdi,%rbx
    46a4:       89 f5                   mov    %esi,%ebp
    46a6:       74 28                   je     46d0 <FCGX_Free+0x40>
    46a8:       48 8d 7f 08             lea    0x8(%rdi),%rdi
    46ac:       e8 67 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
    46b1:       48 8d 7b 10             lea    0x10(%rbx),%rdi
    46b5:       e8 5e e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
    46ba:       48 8d 7b 18             lea    0x18(%rbx),%rdi
    46be:       e8 55 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
    46c3:       48 8d 7b 28             lea    0x28(%rbx),%rdi
    46c7:       e8 d4 f4 ff ff          callq  3ba0 <FCGX_PutS+0x40>
    46cc:       85 ed                   test   %ebp,%ebp
    46ce:       75 10                   jne    46e0 <FCGX_Free+0x50>
    46d0:       48 8b 5c 24 08          mov    0x8(%rsp),%rbx
    46d5:       48 8b 6c 24 10          mov    0x10(%rsp),%rbp
    46da:       48 83 c4 18             add    $0x18,%rsp
    46de:       c3                      retq   
    46df:       90                      nop
    46e0:       31 f6                   xor    %esi,%esi
    46e2:       83 7b 4c 00             cmpl   $0x0,0x4c(%rbx)
    46e6:       8b 7b 30                mov    0x30(%rbx),%edi
    46e9:       40 0f 94 c6             sete   %sil
    46ed:       e8 86 e6 ff ff          callq  2d78 <OS_IpcClose@plt>
    46f2:       c7 43 30 ff ff ff ff    movl   $0xffffffff,0x30(%rbx)
>
> Yehuda
>
>
>>>> libfcgi.so.0.0.0[7ffa06995000+a000] in libfcgi.so.0.0.0[7ffa06995000+a000]
>>>>
>>>> Looking at the assembly, it seems crashing at this point -
>>>> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which
>>>> confused me. I tried to see if there is any other reference holding the
>>>> FCGX_Request which release the handle without any luck.
>>>>
>>>> There are also other observations:
>>>> 1> Several radosgw daemon across different hosts crashed around the same
>>>> time.
>>>> 2> Apache's error log has some fcgi error complaining ##idle timeout##
>>>> during the time.
>>>>
>>>> Does anyone experience similar issue?
>>>>
>>>
>>> In the past we've had issues with libfcgi that were related to the number
>>> of open fds on the process (> 1024). The issue was a buggy libfcgi that
>>> was using select() instead of poll(), so this might be the issue you're
>>> noticing.
>>>
>>> Yehuda
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> N嫥叉靣笡y氊b瞂千v豝�藓{.n�壏渮榏z鳐妠ay�蕠跈�jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸"濟!秈
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: radosgw crash within libfcgi
  2015-06-24 21:12       ` GuangYang
@ 2015-06-24 21:21         ` Yehuda Sadeh-Weinraub
  2015-06-24 21:40           ` Yehuda Sadeh-Weinraub
       [not found]           ` <935875968.19447031.1435180864559.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 8+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-06-24 21:21 UTC (permalink / raw)
  To: GuangYang; +Cc: ceph-devel, ceph-users



----- Original Message -----
> From: "GuangYang" <yguang11@outlook.com>
> To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> Sent: Wednesday, June 24, 2015 2:12:23 PM
> Subject: RE: radosgw crash within libfcgi
> 
> ----------------------------------------
> > Date: Wed, 24 Jun 2015 17:04:05 -0400
> > From: yehuda@redhat.com
> > To: yguang11@outlook.com
> > CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> > Subject: Re: radosgw crash within libfcgi
> >
> >
> >
> > ----- Original Message -----
> >> From: "GuangYang" <yguang11@outlook.com>
> >> To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> >> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> >> Sent: Wednesday, June 24, 2015 1:53:20 PM
> >> Subject: RE: radosgw crash within libfcgi
> >>
> >> Thanks Yehuda for the response.
> >>
> >> We already patched libfcgi to use poll instead of select to overcome the
> >> limitation.
> >>
> >> Thanks,
> >> Guang
> >>
> >>
> >> ----------------------------------------
> >>> Date: Wed, 24 Jun 2015 14:40:25 -0400
> >>> From: yehuda@redhat.com
> >>> To: yguang11@outlook.com
> >>> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> >>> Subject: Re: radosgw crash within libfcgi
> >>>
> >>>
> >>>
> >>> ----- Original Message -----
> >>>> From: "GuangYang" <yguang11@outlook.com>
> >>>> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com,
> >>>> yehuda@redhat.com
> >>>> Sent: Wednesday, June 24, 2015 10:09:58 AM
> >>>> Subject: radosgw crash within libfcgi
> >>>>
> >>>> Hello Cephers,
> >>>> Recently we have several radosgw daemon crashes with the same following
> >>>> kernel log:
> >>>>
> >>>> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
> >>>> 00007ffa069996f2 sp 00007ff55c432710 error 6 in
> >
> > error 6 is sigabrt, right? With invalid pointer I'd expect to get segfault.
> > Is the pointer actually invalid?
> With (ip - {address_load_the_sharded_library}) to get the instruction which
> caused this crash, the objdump shows the crash happened at instruction 46f2
> (see below), which was to assign '-1' to the CGX_Request::ipcFd to -1, but I
> don't quite understand how/why it could crash there.
> 
> 0000000000004690 <FCGX_Free>:
>     4690:       48 89 5c 24 f0          mov    %rbx,-0x10(%rsp)
>     4695:       48 89 6c 24 f8          mov    %rbp,-0x8(%rsp)
>     469a:       48 83 ec 18             sub    $0x18,%rsp
>     469e:       48 85 ff                test   %rdi,%rdi
>     46a1:       48 89 fb                mov    %rdi,%rbx
>     46a4:       89 f5                   mov    %esi,%ebp
>     46a6:       74 28                   je     46d0 <FCGX_Free+0x40>
>     46a8:       48 8d 7f 08             lea    0x8(%rdi),%rdi
>     46ac:       e8 67 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
>     46b1:       48 8d 7b 10             lea    0x10(%rbx),%rdi
>     46b5:       e8 5e e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
>     46ba:       48 8d 7b 18             lea    0x18(%rbx),%rdi
>     46be:       e8 55 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
>     46c3:       48 8d 7b 28             lea    0x28(%rbx),%rdi
>     46c7:       e8 d4 f4 ff ff          callq  3ba0 <FCGX_PutS+0x40>
>     46cc:       85 ed                   test   %ebp,%ebp
>     46ce:       75 10                   jne    46e0 <FCGX_Free+0x50>
>     46d0:       48 8b 5c 24 08          mov    0x8(%rsp),%rbx
>     46d5:       48 8b 6c 24 10          mov    0x10(%rsp),%rbp
>     46da:       48 83 c4 18             add    $0x18,%rsp
>     46de:       c3                      retq
>     46df:       90                      nop
>     46e0:       31 f6                   xor    %esi,%esi
>     46e2:       83 7b 4c 00             cmpl   $0x0,0x4c(%rbx)
>     46e6:       8b 7b 30                mov    0x30(%rbx),%edi
>     46e9:       40 0f 94 c6             sete   %sil
>     46ed:       e8 86 e6 ff ff          callq  2d78 <OS_IpcClose@plt>
>     46f2:       c7 43 30 ff ff ff ff    movl   $0xffffffff,0x30(%rbx)

info registers?

Not too familiar with the specific message, but it could be that OS_IpcClose() aborts (not highly unlikely) and it only dumps the return address of the current function (shouldn't be referenced as ip though).

What's rbx? Is the memory at %rbx + 0x30 valid?

Also, did you by any chance upgrade the binaries while the code was running? is the code running over nfs?

Yehuda

> >
> > Yehuda
> >
> >
> >>>> libfcgi.so.0.0.0[7ffa06995000+a000] in
> >>>> libfcgi.so.0.0.0[7ffa06995000+a000]
> >>>>
> >>>> Looking at the assembly, it seems crashing at this point -
> >>>> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which
> >>>> confused me. I tried to see if there is any other reference holding the
> >>>> FCGX_Request which release the handle without any luck.
> >>>>
> >>>> There are also other observations:
> >>>> 1> Several radosgw daemon across different hosts crashed around the same
> >>>> time.
> >>>> 2> Apache's error log has some fcgi error complaining ##idle timeout##
> >>>> during the time.
> >>>>
> >>>> Does anyone experience similar issue?
> >>>>
> >>>
> >>> In the past we've had issues with libfcgi that were related to the number
> >>> of open fds on the process (> 1024). The issue was a buggy libfcgi that
> >>> was using select() instead of poll(), so this might be the issue you're
> >>> noticing.
> >>>
> >>> Yehuda
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> N嫥叉靣笡y氊b瞂千v豝�藓{.n�壏渮榏z鳐妠ay�蕠跈�jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸"濟!秈
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: radosgw crash within libfcgi
  2015-06-24 21:21         ` Yehuda Sadeh-Weinraub
@ 2015-06-24 21:40           ` Yehuda Sadeh-Weinraub
       [not found]           ` <935875968.19447031.1435180864559.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 8+ messages in thread
From: Yehuda Sadeh-Weinraub @ 2015-06-24 21:40 UTC (permalink / raw)
  To: GuangYang; +Cc: ceph-devel, ceph-users

Also, looking at the code, I see an extra call to FCGX_Finish_r():

diff --git a/src/rgw/rgw_main.cc b/src/rgw/rgw_main.cc
index 9a8aa5f..0aa7ded 100644
--- a/src/rgw/rgw_main.cc
+++ b/src/rgw/rgw_main.cc
@@ -669,8 +669,6 @@ void RGWFCGXProcess::handle_request(RGWRequest *r)
     dout(20) << "process_request() returned " << ret << dendl;
   }
 
-  FCGX_Finish_r(fcgx);
-
   delete req;
 }
 

Maybe this is a problem on the specific libfcgi version that you're using?

----- Original Message -----
> From: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> To: "GuangYang" <yguang11@outlook.com>
> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> Sent: Wednesday, June 24, 2015 2:21:04 PM
> Subject: Re: radosgw crash within libfcgi
> 
> 
> 
> ----- Original Message -----
> > From: "GuangYang" <yguang11@outlook.com>
> > To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> > Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> > Sent: Wednesday, June 24, 2015 2:12:23 PM
> > Subject: RE: radosgw crash within libfcgi
> > 
> > ----------------------------------------
> > > Date: Wed, 24 Jun 2015 17:04:05 -0400
> > > From: yehuda@redhat.com
> > > To: yguang11@outlook.com
> > > CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> > > Subject: Re: radosgw crash within libfcgi
> > >
> > >
> > >
> > > ----- Original Message -----
> > >> From: "GuangYang" <yguang11@outlook.com>
> > >> To: "Yehuda Sadeh-Weinraub" <yehuda@redhat.com>
> > >> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
> > >> Sent: Wednesday, June 24, 2015 1:53:20 PM
> > >> Subject: RE: radosgw crash within libfcgi
> > >>
> > >> Thanks Yehuda for the response.
> > >>
> > >> We already patched libfcgi to use poll instead of select to overcome the
> > >> limitation.
> > >>
> > >> Thanks,
> > >> Guang
> > >>
> > >>
> > >> ----------------------------------------
> > >>> Date: Wed, 24 Jun 2015 14:40:25 -0400
> > >>> From: yehuda@redhat.com
> > >>> To: yguang11@outlook.com
> > >>> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> > >>> Subject: Re: radosgw crash within libfcgi
> > >>>
> > >>>
> > >>>
> > >>> ----- Original Message -----
> > >>>> From: "GuangYang" <yguang11@outlook.com>
> > >>>> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com,
> > >>>> yehuda@redhat.com
> > >>>> Sent: Wednesday, June 24, 2015 10:09:58 AM
> > >>>> Subject: radosgw crash within libfcgi
> > >>>>
> > >>>> Hello Cephers,
> > >>>> Recently we have several radosgw daemon crashes with the same
> > >>>> following
> > >>>> kernel log:
> > >>>>
> > >>>> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
> > >>>> 00007ffa069996f2 sp 00007ff55c432710 error 6 in
> > >
> > > error 6 is sigabrt, right? With invalid pointer I'd expect to get
> > > segfault.
> > > Is the pointer actually invalid?
> > With (ip - {address_load_the_sharded_library}) to get the instruction which
> > caused this crash, the objdump shows the crash happened at instruction 46f2
> > (see below), which was to assign '-1' to the CGX_Request::ipcFd to -1, but
> > I
> > don't quite understand how/why it could crash there.
> > 
> > 0000000000004690 <FCGX_Free>:
> >     4690:       48 89 5c 24 f0          mov    %rbx,-0x10(%rsp)
> >     4695:       48 89 6c 24 f8          mov    %rbp,-0x8(%rsp)
> >     469a:       48 83 ec 18             sub    $0x18,%rsp
> >     469e:       48 85 ff                test   %rdi,%rdi
> >     46a1:       48 89 fb                mov    %rdi,%rbx
> >     46a4:       89 f5                   mov    %esi,%ebp
> >     46a6:       74 28                   je     46d0 <FCGX_Free+0x40>
> >     46a8:       48 8d 7f 08             lea    0x8(%rdi),%rdi
> >     46ac:       e8 67 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
> >     46b1:       48 8d 7b 10             lea    0x10(%rbx),%rdi
> >     46b5:       e8 5e e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
> >     46ba:       48 8d 7b 18             lea    0x18(%rbx),%rdi
> >     46be:       e8 55 e3 ff ff          callq  2a18 <FCGX_FreeStream@plt>
> >     46c3:       48 8d 7b 28             lea    0x28(%rbx),%rdi
> >     46c7:       e8 d4 f4 ff ff          callq  3ba0 <FCGX_PutS+0x40>
> >     46cc:       85 ed                   test   %ebp,%ebp
> >     46ce:       75 10                   jne    46e0 <FCGX_Free+0x50>
> >     46d0:       48 8b 5c 24 08          mov    0x8(%rsp),%rbx
> >     46d5:       48 8b 6c 24 10          mov    0x10(%rsp),%rbp
> >     46da:       48 83 c4 18             add    $0x18,%rsp
> >     46de:       c3                      retq
> >     46df:       90                      nop
> >     46e0:       31 f6                   xor    %esi,%esi
> >     46e2:       83 7b 4c 00             cmpl   $0x0,0x4c(%rbx)
> >     46e6:       8b 7b 30                mov    0x30(%rbx),%edi
> >     46e9:       40 0f 94 c6             sete   %sil
> >     46ed:       e8 86 e6 ff ff          callq  2d78 <OS_IpcClose@plt>
> >     46f2:       c7 43 30 ff ff ff ff    movl   $0xffffffff,0x30(%rbx)
> 
> info registers?
> 
> Not too familiar with the specific message, but it could be that
> OS_IpcClose() aborts (not highly unlikely) and it only dumps the return
> address of the current function (shouldn't be referenced as ip though).
> 
> What's rbx? Is the memory at %rbx + 0x30 valid?
> 
> Also, did you by any chance upgrade the binaries while the code was running?
> is the code running over nfs?
> 
> Yehuda
> 
> > >
> > > Yehuda
> > >
> > >
> > >>>> libfcgi.so.0.0.0[7ffa06995000+a000] in
> > >>>> libfcgi.so.0.0.0[7ffa06995000+a000]
> > >>>>
> > >>>> Looking at the assembly, it seems crashing at this point -
> > >>>> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035,
> > >>>> which
> > >>>> confused me. I tried to see if there is any other reference holding
> > >>>> the
> > >>>> FCGX_Request which release the handle without any luck.
> > >>>>
> > >>>> There are also other observations:
> > >>>> 1> Several radosgw daemon across different hosts crashed around the
> > >>>> same
> > >>>> time.
> > >>>> 2> Apache's error log has some fcgi error complaining ##idle timeout##
> > >>>> during the time.
> > >>>>
> > >>>> Does anyone experience similar issue?
> > >>>>
> > >>>
> > >>> In the past we've had issues with libfcgi that were related to the
> > >>> number
> > >>> of open fds on the process (> 1024). The issue was a buggy libfcgi that
> > >>> was using select() instead of poll(), so this might be the issue you're
> > >>> noticing.
> > >>>
> > >>> Yehuda
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>> in
> > >>> the body of a message to majordomo@vger.kernel.org
> > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >> N嫥叉靣笡y氊b瞂千v豝�藓{.n�壏渮榏z鳐妠ay�蕠跈�jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸"濟!秈
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: radosgw crash within libfcgi
       [not found]           ` <935875968.19447031.1435180864559.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-06-26 17:38             ` GuangYang
  0 siblings, 0 replies; 8+ messages in thread
From: GuangYang @ 2015-06-26 17:38 UTC (permalink / raw)
  To: Yehuda Sadeh-Weinraub
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-idqoXFIVOFJgJs9I8MT0rw


[-- Attachment #1.1: Type: text/plain, Size: 6579 bytes --]

Sadly we don't have a core dump when the crash happened, so that we are not able to dump the registers..

The latest status - we changed the rgw thread number from 600 to 300, and we haven't seen the same crash since, but still it is hard to tell if that is related and how it is related..

Thanks,
Guang

> Date: Wed, 24 Jun 2015 17:21:04 -0400
> From: yehuda@redhat.com
> To: yguang11@outlook.com
> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
> Subject: Re: radosgw crash within libfcgi
> 
> 
> 
> ----- Original Message -----
>> From: "GuangYang" 
>> To: "Yehuda Sadeh-Weinraub" 
>> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
>> Sent: Wednesday, June 24, 2015 2:12:23 PM
>> Subject: RE: radosgw crash within libfcgi
>> 
>> ----------------------------------------
>>> Date: Wed, 24 Jun 2015 17:04:05 -0400
>>> From: yehuda@redhat.com
>>> To: yguang11@outlook.com
>>> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
>>> Subject: Re: radosgw crash within libfcgi
>>>
>>>
>>>
>>> ----- Original Message -----
>>>> From: "GuangYang" 
>>>> To: "Yehuda Sadeh-Weinraub" 
>>>> Cc: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com
>>>> Sent: Wednesday, June 24, 2015 1:53:20 PM
>>>> Subject: RE: radosgw crash within libfcgi
>>>>
>>>> Thanks Yehuda for the response.
>>>>
>>>> We already patched libfcgi to use poll instead of select to overcome the
>>>> limitation.
>>>>
>>>> Thanks,
>>>> Guang
>>>>
>>>>
>>>> ----------------------------------------
>>>>> Date: Wed, 24 Jun 2015 14:40:25 -0400
>>>>> From: yehuda@redhat.com
>>>>> To: yguang11@outlook.com
>>>>> CC: ceph-devel@vger.kernel.org; ceph-users@lists.ceph.com
>>>>> Subject: Re: radosgw crash within libfcgi
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "GuangYang" 
>>>>>> To: ceph-devel@vger.kernel.org, ceph-users@lists.ceph.com,
>>>>>> yehuda@redhat.com
>>>>>> Sent: Wednesday, June 24, 2015 10:09:58 AM
>>>>>> Subject: radosgw crash within libfcgi
>>>>>>
>>>>>> Hello Cephers,
>>>>>> Recently we have several radosgw daemon crashes with the same following
>>>>>> kernel log:
>>>>>>
>>>>>> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip
>>>>>> 00007ffa069996f2 sp 00007ff55c432710 error 6 in
>>>
>>> error 6 is sigabrt, right? With invalid pointer I'd expect to get segfault.
>>> Is the pointer actually invalid?
>> With (ip - {address_load_the_sharded_library}) to get the instruction which
>> caused this crash, the objdump shows the crash happened at instruction 46f2
>> (see below), which was to assign '-1' to the CGX_Request::ipcFd to -1, but I
>> don't quite understand how/why it could crash there.
>> 
>> 0000000000004690 :
>>     4690:       48 89 5c 24 f0          mov    %rbx,-0x10(%rsp)
>>     4695:       48 89 6c 24 f8          mov    %rbp,-0x8(%rsp)
>>     469a:       48 83 ec 18             sub    $0x18,%rsp
>>     469e:       48 85 ff                test   %rdi,%rdi
>>     46a1:       48 89 fb                mov    %rdi,%rbx
>>     46a4:       89 f5                   mov    %esi,%ebp
>>     46a6:       74 28                   je     46d0 
>>     46a8:       48 8d 7f 08             lea    0x8(%rdi),%rdi
>>     46ac:       e8 67 e3 ff ff          callq  2a18 
>>     46b1:       48 8d 7b 10             lea    0x10(%rbx),%rdi
>>     46b5:       e8 5e e3 ff ff          callq  2a18 
>>     46ba:       48 8d 7b 18             lea    0x18(%rbx),%rdi
>>     46be:       e8 55 e3 ff ff          callq  2a18 
>>     46c3:       48 8d 7b 28             lea    0x28(%rbx),%rdi
>>     46c7:       e8 d4 f4 ff ff          callq  3ba0 
>>     46cc:       85 ed                   test   %ebp,%ebp
>>     46ce:       75 10                   jne    46e0 
>>     46d0:       48 8b 5c 24 08          mov    0x8(%rsp),%rbx
>>     46d5:       48 8b 6c 24 10          mov    0x10(%rsp),%rbp
>>     46da:       48 83 c4 18             add    $0x18,%rsp
>>     46de:       c3                      retq
>>     46df:       90                      nop
>>     46e0:       31 f6                   xor    %esi,%esi
>>     46e2:       83 7b 4c 00             cmpl   $0x0,0x4c(%rbx)
>>     46e6:       8b 7b 30                mov    0x30(%rbx),%edi
>>     46e9:       40 0f 94 c6             sete   %sil
>>     46ed:       e8 86 e6 ff ff          callq  2d78 
>>     46f2:       c7 43 30 ff ff ff ff    movl   $0xffffffff,0x30(%rbx)
> 
> info registers?
> 
> Not too familiar with the specific message, but it could be that OS_IpcClose() aborts (not highly unlikely) and it only dumps the return address of the current function (shouldn't be referenced as ip though).
> 
> What's rbx? Is the memory at %rbx + 0x30 valid?
> 
> Also, did you by any chance upgrade the binaries while the code was running? is the code running over nfs?
> 
> Yehuda
> 
>>>
>>> Yehuda
>>>
>>>
>>>>>> libfcgi.so.0.0.0[7ffa06995000+a000] in
>>>>>> libfcgi.so.0.0.0[7ffa06995000+a000]
>>>>>>
>>>>>> Looking at the assembly, it seems crashing at this point -
>>>>>> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which
>>>>>> confused me. I tried to see if there is any other reference holding the
>>>>>> FCGX_Request which release the handle without any luck.
>>>>>>
>>>>>> There are also other observations:
>>>>>> 1> Several radosgw daemon across different hosts crashed around the same
>>>>>> time.
>>>>>> 2> Apache's error log has some fcgi error complaining ##idle timeout##
>>>>>> during the time.
>>>>>>
>>>>>> Does anyone experience similar issue?
>>>>>>
>>>>>
>>>>> In the past we've had issues with libfcgi that were related to the number
>>>>> of open fds on the process (> 1024). The issue was a buggy libfcgi that
>>>>> was using select() instead of poll(), so this might be the issue you're
>>>>> noticing.
>>>>>
>>>>> Yehuda
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> N嫥叉靣笡y氊b瞂千v豝�藓{.n�壏渮榏z鳐妠ay�蕠跈�jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸"濟!秈
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  

[-- Attachment #1.2: Type: text/html, Size: 7412 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-06-26 17:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-24 17:09 radosgw crash within libfcgi GuangYang
2015-06-24 18:40 ` Yehuda Sadeh-Weinraub
2015-06-24 20:53   ` GuangYang
2015-06-24 21:04     ` Yehuda Sadeh-Weinraub
2015-06-24 21:12       ` GuangYang
2015-06-24 21:21         ` Yehuda Sadeh-Weinraub
2015-06-24 21:40           ` Yehuda Sadeh-Weinraub
     [not found]           ` <935875968.19447031.1435180864559.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-26 17:38             ` GuangYang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.