All of lore.kernel.org
 help / color / mirror / Atom feed
* OSD crashes (10.2.9)
@ 2017-09-19 15:28 Wyllys Ingersoll
  2017-09-19 15:57 ` Nathan Cutler
  2017-09-19 17:08 ` Sage Weil
  0 siblings, 2 replies; 7+ messages in thread
From: Wyllys Ingersoll @ 2017-09-19 15:28 UTC (permalink / raw)
  To: Ceph Development

Im seeing this stack trace in a lot of my OSDs (21 out of 92).  I
suspect its a corrupt leveldb or journal, but not sure how to debug it
further.  Any suggestions on how to debug further?

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (()+0x984c4e) [0x56032b65ec4e]
 2: (()+0x11390) [0x7f89adce8390]
 3: (gsignal()+0x38) [0x7f89abc86428]
 4: (abort()+0x16a) [0x7f89abc8802a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x26b) [0x56032b75f0db]
 6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
const*, long)+0x259) [0x56032b69b2d9]
 7: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x56032b69bc06]
 8: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x56032b69c45c]
 9: (CephContextServiceThread::entry()+0x167) [0x56032b777777]
 10: (()+0x76ba) [0x7f89adcde6ba]
 11: (clone()+0x6d) [0x7f89abd5782d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD crashes (10.2.9)
  2017-09-19 15:28 OSD crashes (10.2.9) Wyllys Ingersoll
@ 2017-09-19 15:57 ` Nathan Cutler
  2017-09-19 16:01   ` Wyllys Ingersoll
  2017-09-19 17:08 ` Sage Weil
  1 sibling, 1 reply; 7+ messages in thread
From: Nathan Cutler @ 2017-09-19 15:57 UTC (permalink / raw)
  To: Wyllys Ingersoll, Ceph Development

Which version of leveldb?

On 09/19/2017 05:28 PM, Wyllys Ingersoll wrote:
> Im seeing this stack trace in a lot of my OSDs (21 out of 92).  I
> suspect its a corrupt leveldb or journal, but not sure how to debug it
> further.  Any suggestions on how to debug further?
> 
>   ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
>   1: (()+0x984c4e) [0x56032b65ec4e]
>   2: (()+0x11390) [0x7f89adce8390]
>   3: (gsignal()+0x38) [0x7f89abc86428]
>   4: (abort()+0x16a) [0x7f89abc8802a]
>   5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x26b) [0x56032b75f0db]
>   6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*, long)+0x259) [0x56032b69b2d9]
>   7: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x56032b69bc06]
>   8: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x56032b69c45c]
>   9: (CephContextServiceThread::entry()+0x167) [0x56032b777777]
>   10: (()+0x76ba) [0x7f89adcde6ba]
>   11: (clone()+0x6d) [0x7f89abd5782d]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Nathan Cutler
Software Engineer Distributed Storage
SUSE LINUX, s.r.o.
Tel.: +420 284 084 037

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD crashes (10.2.9)
  2017-09-19 15:57 ` Nathan Cutler
@ 2017-09-19 16:01   ` Wyllys Ingersoll
  0 siblings, 0 replies; 7+ messages in thread
From: Wyllys Ingersoll @ 2017-09-19 16:01 UTC (permalink / raw)
  To: Nathan Cutler; +Cc: Ceph Development

The libleveldb library is 1.18-5,is that what you are looking for?

$ dpkg -l|grep leveldb
ii  libleveldb1v5:amd64                  1.18-5
  amd64        fast key-value storage library

On Tue, Sep 19, 2017 at 11:57 AM, Nathan Cutler <ncutler@suse.cz> wrote:
> Which version of leveldb?
>
>
> On 09/19/2017 05:28 PM, Wyllys Ingersoll wrote:
>>
>> Im seeing this stack trace in a lot of my OSDs (21 out of 92).  I
>> suspect its a corrupt leveldb or journal, but not sure how to debug it
>> further.  Any suggestions on how to debug further?
>>
>>   ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
>>   1: (()+0x984c4e) [0x56032b65ec4e]
>>   2: (()+0x11390) [0x7f89adce8390]
>>   3: (gsignal()+0x38) [0x7f89abc86428]
>>   4: (abort()+0x16a) [0x7f89abc8802a]
>>   5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x26b) [0x56032b75f0db]
>>   6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
>> const*, long)+0x259) [0x56032b69b2d9]
>>   7: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x56032b69bc06]
>>   8: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x56032b69c45c]
>>   9: (CephContextServiceThread::entry()+0x167) [0x56032b777777]
>>   10: (()+0x76ba) [0x7f89adcde6ba]
>>   11: (clone()+0x6d) [0x7f89abd5782d]
>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Nathan Cutler
> Software Engineer Distributed Storage
> SUSE LINUX, s.r.o.
> Tel.: +420 284 084 037

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD crashes (10.2.9)
  2017-09-19 15:28 OSD crashes (10.2.9) Wyllys Ingersoll
  2017-09-19 15:57 ` Nathan Cutler
@ 2017-09-19 17:08 ` Sage Weil
  2017-09-19 17:16   ` Wyllys Ingersoll
  1 sibling, 1 reply; 7+ messages in thread
From: Sage Weil @ 2017-09-19 17:08 UTC (permalink / raw)
  To: Wyllys Ingersoll; +Cc: Ceph Development

On Tue, 19 Sep 2017, Wyllys Ingersoll wrote:
> Im seeing this stack trace in a lot of my OSDs (21 out of 92).  I
> suspect its a corrupt leveldb or journal, but not sure how to debug it
> further.  Any suggestions on how to debug further?
> 
>  ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
>  1: (()+0x984c4e) [0x56032b65ec4e]
>  2: (()+0x11390) [0x7f89adce8390]
>  3: (gsignal()+0x38) [0x7f89abc86428]
>  4: (abort()+0x16a) [0x7f89abc8802a]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x26b) [0x56032b75f0db]

The assertion itself is a few lines earlier in the log.. can you include 
that please?

Thanks!
sage

>  6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*, long)+0x259) [0x56032b69b2d9]
>  7: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x56032b69bc06]
>  8: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x56032b69c45c]
>  9: (CephContextServiceThread::entry()+0x167) [0x56032b777777]
>  10: (()+0x76ba) [0x7f89adcde6ba]
>  11: (clone()+0x6d) [0x7f89abd5782d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD crashes (10.2.9)
  2017-09-19 17:08 ` Sage Weil
@ 2017-09-19 17:16   ` Wyllys Ingersoll
  2017-09-19 17:18     ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: Wyllys Ingersoll @ 2017-09-19 17:16 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

It appears to just be getting an abort signal, I dont see any other assertions.



--- begin dump of recent events ---
   -40> 2017-09-19 12:18:26.520895 7f2d927bd700  5 osd.81 pg_epoch:
239987 pg[22.15b( empty lb MIN (bitwise) local-les=194057 n=0 ec=19250
les/c/f 239869/239869/0 239984/239984/233424) [62,81,74]/[62,29,74]
r=-1 lpr=239984 pi=178346-239983/179 crt=0'0 remapped NOTIFY] exit
Started/Stray 7.133544 10 0.000349
   -39> 2017-09-19 12:18:26.520976 7f2d927bd700  5 osd.81 pg_epoch:
239987 pg[22.15b( empty lb MIN (bitwise) local-les=194057 n=0 ec=19250
les/c/f 239869/239869/0 239984/239984/233424) [62,81,74]/[62,29,74]
r=-1 lpr=239984 pi=178346-239983/179 crt=0'0 remapped NOTIFY] exit
Started 7.133652 0 0.000000
   -38> 2017-09-19 12:18:26.520984 7f2d927bd700  5 osd.81 pg_epoch:
239987 pg[22.15b( empty lb MIN (bitwise) local-les=194057 n=0 ec=19250
les/c/f 239869/239869/0 239984/239984/233424) [62,81,74]/[62,29,74]
r=-1 lpr=239984 pi=178346-239983/179 crt=0'0 remapped NOTIFY] enter
Reset
   -37> 2017-09-19 12:18:26.521294 7f2d93fc0700  5 write_log with:
dirty_to: 4294967295'18446744073709551615, dirty_from:
4294967295'18446744073709551615, dirty_divergent_priors: true,
divergent_priors: 0, writeout_from: 4294967295'18446744073709551615,
trimmed:
   -36> 2017-09-19 12:18:26.521885 7f2d937bf700  5 osd.81 pg_epoch:
239989 pg[10.19d( empty lb MIN (bitwise) local-les=194033 n=0 ec=1077
les/c/f 239874/239878/0 239984/239984/236323) [72,81,88]/[72,95,88]
r=-1 lpr=239984 pi=172390-239983/422 crt=0'0 remapped NOTIFY] exit
Started/Stray 7.126071 12 0.000463
   -35> 2017-09-19 12:18:26.521901 7f2d937bf700  5 osd.81 pg_epoch:
239989 pg[10.19d( empty lb MIN (bitwise) local-les=194033 n=0 ec=1077
les/c/f 239874/239878/0 239984/239984/236323) [72,81,88]/[72,95,88]
r=-1 lpr=239984 pi=172390-239983/422 crt=0'0 remapped NOTIFY] exit
Started 7.126112 0 0.000000
   -34> 2017-09-19 12:18:26.521907 7f2d937bf700  5 osd.81 pg_epoch:
239989 pg[10.19d( empty lb MIN (bitwise) local-les=194033 n=0 ec=1077
les/c/f 239874/239878/0 239984/239984/236323) [72,81,88]/[72,95,88]
r=-1 lpr=239984 pi=172390-239983/422 crt=0'0 remapped NOTIFY] enter
Reset
   -33> 2017-09-19 12:18:26.523389 7f2d927bd700  5 osd.81 pg_epoch:
239989 pg[22.15b( empty lb MIN (bitwise) local-les=194057 n=0 ec=19250
les/c/f 239869/239869/0 239984/239987/233424) [62,81,74]/[62,74,29]
r=-1 lpr=239987 pi=178346-239986/180 crt=0'0 remapped NOTIFY] exit
Reset 0.002402 3 0.000578
   -32> 2017-09-19 12:18:26.523499 7f2d927bd700  5 osd.81 pg_epoch:
239989 pg[22.15b( empty lb MIN (bitwise) local-les=194057 n=0 ec=19250
les/c/f 239869/239869/0 239984/239987/233424) [62,81,74]/[62,74,29]
r=-1 lpr=239987 pi=178346-239986/180 crt=0'0 remapped NOTIFY] enter
Started
   -31> 2017-09-19 12:18:26.523537 7f2d927bd700  5 osd.81 pg_epoch:
239989 pg[22.15b( empty lb MIN (bitwise) local-les=194057 n=0 ec=19250
les/c/f 239869/239869/0 239984/239987/233424) [62,81,74]/[62,74,29]
r=-1 lpr=239987 pi=178346-239986/180 crt=0'0 remapped NOTIFY] enter
Start
   -30> 2017-09-19 12:18:26.523572 7f2d927bd700  1 osd.81 pg_epoch:
239989 pg[22.15b( empty lb MIN (bitwise) local-les=194057 n=0 ec=19250
les/c/f 239869/239869/0 239984/239987/233424) [62,81,74]/[62,74,29]
r=-1 lpr=239987 pi=178346-239986/180 crt=0'0 remapped NOTIFY]
state<Start>: transitioning to Stray
   -29> 2017-09-19 12:18:26.523619 7f2d927bd700  5 osd.81 pg_epoch:
239989 pg[22.15b( empty lb MIN (bitwise) local-les=194057 n=0 ec=19250
les/c/f 239869/239869/0 239984/239987/233424) [62,81,74]/[62,74,29]
r=-1 lpr=239987 pi=178346-239986/180 crt=0'0 remapped NOTIFY] exit
Start 0.000081 0 0.000000
   -28> 2017-09-19 12:18:26.523657 7f2d927bd700  5 osd.81 pg_epoch:
239989 pg[22.15b( empty lb MIN (bitwise) local-les=194057 n=0 ec=19250
les/c/f 239869/239869/0 239984/239987/233424) [62,81,74]/[62,74,29]
r=-1 lpr=239987 pi=178346-239986/180 crt=0'0 remapped NOTIFY] enter
Started/Stray
   -27> 2017-09-19 12:18:26.524220 7f2d937bf700  5 osd.81 pg_epoch:
239989 pg[10.19d( empty lb MIN (bitwise) local-les=194033 n=0 ec=1077
les/c/f 239874/239878/0 239984/239989/236323) [72,81,88]/[72,88,95]
r=-1 lpr=239989 pi=172390-239988/423 crt=0'0 remapped NOTIFY] exit
Reset 0.002312 1 0.000056
   -26> 2017-09-19 12:18:26.524230 7f2d937bf700  5 osd.81 pg_epoch:
239989 pg[10.19d( empty lb MIN (bitwise) local-les=194033 n=0 ec=1077
les/c/f 239874/239878/0 239984/239989/236323) [72,81,88]/[72,88,95]
r=-1 lpr=239989 pi=172390-239988/423 crt=0'0 remapped NOTIFY] enter
Started
   -25> 2017-09-19 12:18:26.524235 7f2d937bf700  5 osd.81 pg_epoch:
239989 pg[10.19d( empty lb MIN (bitwise) local-les=194033 n=0 ec=1077
les/c/f 239874/239878/0 239984/239989/236323) [72,81,88]/[72,88,95]
r=-1 lpr=239989 pi=172390-239988/423 crt=0'0 remapped NOTIFY] enter
Start
   -24> 2017-09-19 12:18:26.524258 7f2d937bf700  1 osd.81 pg_epoch:
239989 pg[10.19d( empty lb MIN (bitwise) local-les=194033 n=0 ec=1077
les/c/f 239874/239878/0 239984/239989/236323) [72,81,88]/[72,88,95]
r=-1 lpr=239989 pi=172390-239988/423 crt=0'0 remapped NOTIFY]
state<Start>: transitioning to Stray
   -23> 2017-09-19 12:18:26.524297 7f2d937bf700  5 osd.81 pg_epoch:
239989 pg[10.19d( empty lb MIN (bitwise) local-les=194033 n=0 ec=1077
les/c/f 239874/239878/0 239984/239989/236323) [72,81,88]/[72,88,95]
r=-1 lpr=239989 pi=172390-239988/423 crt=0'0 remapped NOTIFY] exit
Start 0.000060 0 0.000000
   -22> 2017-09-19 12:18:26.524332 7f2d937bf700  5 osd.81 pg_epoch:
239989 pg[10.19d( empty lb MIN (bitwise) local-les=194033 n=0 ec=1077
les/c/f 239874/239878/0 239984/239989/236323) [72,81,88]/[72,88,95]
r=-1 lpr=239989 pi=172390-239988/423 crt=0'0 remapped NOTIFY] enter
Started/Stray
   -21> 2017-09-19 12:18:26.585924 7f2d82937700  1 --
10.3.1.105:6817/45761 <== osd.4 10.16.51.102:0/558150 2 ====
osd_ping(ping e239991 stamp 2017-09-19 12:18:26.584753) v2 ==== 47+0+0
(722370431 0 0) 0x561d02827600 con 0x561d02b49900
   -20> 2017-09-19 12:18:26.585966 7f2d82937700  1 --
10.3.1.105:6817/45761 --> 10.16.51.102:0/558150 -- osd_ping(ping_reply
e239989 stamp 2017-09-19 12:18:26.584753) v2 -- ?+0 0x561d02827c00 con
0x561d02b49900
   -19> 2017-09-19 12:18:26.585926 7f2d82836700  1 --
10.16.51.105:6817/45761 <== osd.4 10.16.51.102:0/558150 2 ====
osd_ping(ping e239991 stamp 2017-09-19 12:18:26.584753) v2 ==== 47+0+0
(722370431 0 0) 0x561d02827800 con 0x561d04b7e000
   -18> 2017-09-19 12:18:26.586004 7f2d82836700  1 --
10.16.51.105:6817/45761 --> 10.16.51.102:0/558150 --
osd_ping(ping_reply e239989 stamp 2017-09-19 12:18:26.584753) v2 --
?+0 0x561d02828000 con 0x561d04b7e000
   -17> 2017-09-19 12:18:26.598246 7f2d61cb1700  1 --
10.3.1.105:6817/45761 <== osd.31 10.3.1.102:0/555749 2 ====
osd_ping(ping e239991 stamp 2017-09-19 12:18:26.597198) v2 ==== 47+0+0
(2473246502 0 0) 0x561d02828200 con 0x561d030e5780
   -16> 2017-09-19 12:18:26.598274 7f2d61cb1700  1 --
10.3.1.105:6817/45761 --> 10.3.1.102:0/555749 -- osd_ping(ping_reply
e239989 stamp 2017-09-19 12:18:26.597198) v2 -- ?+0 0x561d02828800 con
0x561d030e5780
   -15> 2017-09-19 12:18:26.598481 7f2d61db2700  1 --
10.16.51.105:6817/45761 <== osd.31 10.3.1.102:0/555749 2 ====
osd_ping(ping e239991 stamp 2017-09-19 12:18:26.597198) v2 ==== 47+0+0
(2473246502 0 0) 0x561d02828400 con 0x561d02ebac00
   -14> 2017-09-19 12:18:26.598495 7f2d61db2700  1 --
10.16.51.105:6817/45761 --> 10.3.1.102:0/555749 -- osd_ping(ping_reply
e239989 stamp 2017-09-19 12:18:26.597198) v2 -- ?+0 0x561d02828c00 con
0x561d02ebac00
   -13> 2017-09-19 12:18:26.664660 7f2d6b9c9700  1 --
10.3.1.105:6817/45761 <== osd.25 10.16.51.102:0/591839 3 ====
osd_ping(ping e239990 stamp 2017-09-19 12:18:26.663309) v2 ==== 47+0+0
(174834353 0 0) 0x561a9072ae00 con 0x561d01150400
   -12> 2017-09-19 12:18:26.664669 7f2d6b8c8700  1 --
10.16.51.105:6817/45761 <== osd.25 10.16.51.102:0/591839 3 ====
osd_ping(ping e239990 stamp 2017-09-19 12:18:26.663309) v2 ==== 47+0+0
(174834353 0 0) 0x561d02bd2200 con 0x561d01150b80
   -11> 2017-09-19 12:18:26.664685 7f2d6b9c9700  1 --
10.3.1.105:6817/45761 --> 10.16.51.102:0/591839 -- osd_ping(ping_reply
e239989 stamp 2017-09-19 12:18:26.663309) v2 -- ?+0 0x561ac20a0800 con
0x561d01150400
   -10> 2017-09-19 12:18:26.664712 7f2d6b8c8700  1 --
10.16.51.105:6817/45761 --> 10.16.51.102:0/591839 --
osd_ping(ping_reply e239989 stamp 2017-09-19 12:18:26.663309) v2 --
?+0 0x561d261f7c00 con 0x561d01150b80
    -9> 2017-09-19 12:18:26.668533 7f2d63797700  1 --
10.16.51.105:6817/45761 <== osd.10 10.16.51.101:0/314610 4 ====
osd_ping(ping e239991 stamp 2017-09-19 12:18:26.667188) v2 ==== 47+0+0
(968170766 0 0) 0x561d07ced000 con 0x561d02d8f800
    -8> 2017-09-19 12:18:26.668556 7f2d63797700  1 --
10.16.51.105:6817/45761 --> 10.16.51.101:0/314610 --
osd_ping(ping_reply e239989 stamp 2017-09-19 12:18:26.667188) v2 --
?+0 0x561cfd02e800 con 0x561d02d8f800
    -7> 2017-09-19 12:18:26.674422 7f2e07129700  1 --
10.3.1.105:6817/45761 <== osd.10 10.16.51.101:0/314610 4 ====
osd_ping(ping e239991 stamp 2017-09-19 12:18:26.667188) v2 ==== 47+0+0
(968170766 0 0) 0x561d07ceda00 con 0x561a9acff180
    -6> 2017-09-19 12:18:26.674442 7f2e07129700  1 --
10.3.1.105:6817/45761 --> 10.16.51.101:0/314610 -- osd_ping(ping_reply
e239989 stamp 2017-09-19 12:18:26.667188) v2 -- ?+0 0x561cfd02e200 con
0x561a9acff180
    -5> 2017-09-19 12:18:26.682821 7f2d9efd6700  1 --
10.16.51.105:6816/45761 <== mon.2 10.16.51.23:6789/0 20 ====
osd_map(239990..239992 src has 198325..239992) v3 ==== 1217+0+0
(3438528651 0 0) 0x561d04adac80 con 0x561cf9548c00
    -4> 2017-09-19 12:18:26.816837 7f2dccac0700  1 --
10.3.1.105:6817/45761 <== osd.43 10.3.1.103:0/509597 2 ====
osd_ping(ping e239990 stamp 2017-09-19 12:18:26.813161) v2 ==== 47+0+0
(1181431656 0 0) 0x561d2437c400 con 0x561d02ebb080
    -3> 2017-09-19 12:18:26.816862 7f2dccac0700  1 --
10.3.1.105:6817/45761 --> 10.3.1.103:0/509597 -- osd_ping(ping_reply
e239989 stamp 2017-09-19 12:18:26.813161) v2 -- ?+0 0x561d2437c800 con
0x561d02ebb080
    -2> 2017-09-19 12:18:26.816895 7f2dc336d700  1 --
10.16.51.105:6817/45761 <== osd.43 10.3.1.103:0/509597 2 ====
osd_ping(ping e239990 stamp 2017-09-19 12:18:26.813161) v2 ==== 47+0+0
(1181431656 0 0) 0x561d2437be00 con 0x561d030c8880
    -1> 2017-09-19 12:18:26.816904 7f2dc336d700  1 --
10.16.51.105:6817/45761 --> 10.3.1.103:0/509597 -- osd_ping(ping_reply
e239989 stamp 2017-09-19 12:18:26.813161) v2 -- ?+0 0x561d2437c200 con
0x561d030c8880
     0> 2017-09-19 12:18:26.842937 7f2d95fc4700 -1 *** Caught signal
(Aborted) **
 in thread 7f2d95fc4700 thread_name:tp_osd

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (()+0x984c4e) [0x561a4df72c4e]
 2: (()+0x11390) [0x7f2e23d10390]
 3: (gsignal()+0x38) [0x7f2e21cae428]
 4: (abort()+0x16a) [0x7f2e21cb002a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x26b) [0x561a4e0730db]
 6: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x2e6) [0x561a4da6e706]
 7: (boost::statechart::simple_state<PG::RecoveryState::Stray,
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x33e) [0x561a4da9f1ce]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x69) [0x561a4da7f229]
 9: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>,
PG::RecoveryCtx*)+0x395) [0x561a4da52cb5]
 10: (OSD::process_peering_events(std::__cxx11::list<PG*,
std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2d4)
[0x561a4d99e854]
 11: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*,
ThreadPool::TPHandle&)+0x25) [0x561a4d9e74c5]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x561a4e0650c1]
 13: (ThreadPool::WorkThread::entry()+0x10) [0x561a4e0661c0]
 14: (()+0x76ba) [0x7f2e23d066ba]
 15: (clone()+0x6d) [0x7f2e21d7f82d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.


--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   0/ 1 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 1 ms
   0/ 1 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   0/ 1 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  99/99 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.81.log
--- end dump of recent events ---



On Tue, Sep 19, 2017 at 1:08 PM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 19 Sep 2017, Wyllys Ingersoll wrote:
>> Im seeing this stack trace in a lot of my OSDs (21 out of 92).  I
>> suspect its a corrupt leveldb or journal, but not sure how to debug it
>> further.  Any suggestions on how to debug further?
>>
>>  ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
>>  1: (()+0x984c4e) [0x56032b65ec4e]
>>  2: (()+0x11390) [0x7f89adce8390]
>>  3: (gsignal()+0x38) [0x7f89abc86428]
>>  4: (abort()+0x16a) [0x7f89abc8802a]
>>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x26b) [0x56032b75f0db]
>
> The assertion itself is a few lines earlier in the log.. can you include
> that please?
>
> Thanks!
> sage
>
>>  6: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
>> const*, long)+0x259) [0x56032b69b2d9]
>>  7: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x56032b69bc06]
>>  8: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x56032b69c45c]
>>  9: (CephContextServiceThread::entry()+0x167) [0x56032b777777]
>>  10: (()+0x76ba) [0x7f89adcde6ba]
>>  11: (clone()+0x6d) [0x7f89abd5782d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD crashes (10.2.9)
  2017-09-19 17:16   ` Wyllys Ingersoll
@ 2017-09-19 17:18     ` Sage Weil
  2017-09-19 17:28       ` Wyllys Ingersoll
  0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2017-09-19 17:18 UTC (permalink / raw)
  To: Wyllys Ingersoll; +Cc: Ceph Development

On Tue, 19 Sep 2017, Wyllys Ingersoll wrote:
> It appears to just be getting an abort signal, I dont see any other assertions.

It may be a ways up in the log if the OSD was busy.  Search for the thread 
id from this line

>      0> 2017-09-19 12:18:26.842937 7f2d95fc4700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f2d95fc4700 thread_name:tp_osd

(7f2d95fc4700 in this case) backwards to find the failed assertion message.

Thanks!
sage

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD crashes (10.2.9)
  2017-09-19 17:18     ` Sage Weil
@ 2017-09-19 17:28       ` Wyllys Ingersoll
  0 siblings, 0 replies; 7+ messages in thread
From: Wyllys Ingersoll @ 2017-09-19 17:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

Hows this?


   -15> 2017-09-19 12:18:26.517719 7f2d9d7d3700  1 --
10.3.1.105:6816/45761 <== osd.45 10.3.1.105:6800/2648 13 ====
pg_log(7.a8 epoch 239985 log log((18780'92006,31991'95006],
crt=31991'95006) query_epoch 239985) v4 ==== 529591+0+0 (953547831 0
0) 0x561d172fe700 con 0x561cffb0a780
   -14> 2017-09-19 12:18:26.517727 7f2d9d7d3700  5 -- op tracker --
seq: 1179, time: 2017-09-19 12:18:26.517727, event: started, op:
pg_log(7.a8 epoch 239985 log log((18780'92006,31991'95006],
crt=31991'95006) query_epoch 239985)
   -13> 2017-09-19 12:18:26.517807 7f2d9d7d3700  5 osd.81 pg_epoch:
239985 pg[7.a8(unlocked)] enter Initial
   -12> 2017-09-19 12:18:26.517854 7f2d9d7d3700  5 write_log with:
dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
dirty_divergent_priors: false, divergent_priors: 0, writeout_from:
4294967295'18446744073709551615, trimmed:
   -11> 2017-09-19 12:18:26.517863 7f2d9d7d3700  5 osd.81 pg_epoch:
239985 pg[7.a8( empty local-les=0 n=0 ec=961 les/c/f 239883/232911/0
239984/239984/234230) [83,81,38]/[45,74,80] r=-1 lpr=0
pi=232839-239983/18 crt=0'0 inactive] exit Initial 0.000056 0 0.000000
   -10> 2017-09-19 12:18:26.517873 7f2d9d7d3700  5 osd.81 pg_epoch:
239985 pg[7.a8( empty local-les=0 n=0 ec=961 les/c/f 239883/232911/0
239984/239984/234230) [83,81,38]/[45,74,80] r=-1 lpr=0
pi=232839-239983/18 crt=0'0 inactive] enter Reset
    -9> 2017-09-19 12:18:26.517880 7f2d9d7d3700  5 osd.81 pg_epoch:
239985 pg[7.a8( empty local-les=0 n=0 ec=961 les/c/f 239883/232911/0
239984/239984/234230) [83,81,38]/[45,74,80] r=-1 lpr=239985
pi=232839-239983/18 crt=0'0 inactive] exit Reset 0.000006 1 0.000016
    -8> 2017-09-19 12:18:26.517885 7f2d9d7d3700  5 osd.81 pg_epoch:
239985 pg[7.a8( empty local-les=0 n=0 ec=961 les/c/f 239883/232911/0
239984/239984/234230) [83,81,38]/[45,74,80] r=-1 lpr=239985
pi=232839-239983/18 crt=0'0 inactive] enter Started
    -7> 2017-09-19 12:18:26.517889 7f2d9d7d3700  5 osd.81 pg_epoch:
239985 pg[7.a8( empty local-les=0 n=0 ec=961 les/c/f 239883/232911/0
239984/239984/234230) [83,81,38]/[45,74,80] r=-1 lpr=239985
pi=232839-239983/18 crt=0'0 inactive] enter Start
    -6> 2017-09-19 12:18:26.517893 7f2d9d7d3700  1 osd.81 pg_epoch:
239985 pg[7.a8( empty local-les=0 n=0 ec=961 les/c/f 239883/232911/0
239984/239984/234230) [83,81,38]/[45,74,80] r=-1 lpr=239985
pi=232839-239983/18 crt=0'0 inactive] state<Start>: transitioning to
Stray
    -5> 2017-09-19 12:18:26.517898 7f2d9d7d3700  5 osd.81 pg_epoch:
239985 pg[7.a8( empty local-les=0 n=0 ec=961 les/c/f 239883/232911/0
239984/239984/234230) [83,81,38]/[45,74,80] r=-1 lpr=239985
pi=232839-239983/18 crt=0'0 inactive] exit Start 0.000008 0 0.000000
    -4> 2017-09-19 12:18:26.517903 7f2d9d7d3700  5 osd.81 pg_epoch:
239985 pg[7.a8( empty local-les=0 n=0 ec=961 les/c/f 239883/232911/0
239984/239984/234230) [83,81,38]/[45,74,80] r=-1 lpr=239985
pi=232839-239983/18 crt=0'0 inactive] enter Started/Stray
    -3> 2017-09-19 12:18:26.520197 7f2d93fc0700  5 osd.81 pg_epoch:
239989 pg[1.6b3( v 219971'243654 (215190'240654,219971'243654] lb MIN
(bitwise) local-les=239985 n=0 ec=23117 les/c/f 239868/239869/0
239984/239984/234718) [81,63,73]/[40,39,45] r=-1 lpr=239985
pi=239867-239983/1 crt=0'0 lcod 0'0 inactive] exit Started/Stray
0.009491 6 0.000045
   -2> 2017-09-19 12:18:26.520215 7f2d93fc0700  5 osd.81 pg_epoch:
239989 pg[1.6b3( v 219971'243654 (215190'240654,219971'243654] lb MIN
(bitwise) local-les=239985 n=0 ec=23117 les/c/f 239868/239869/0
239984/239984/234718) [81,63,73]/[40,39,45] r=-1 lpr=239985
pi=239867-239983/1 crt=0'0 lcod 0'0 inactive] enter
Started/ReplicaActive
    -1> 2017-09-19 12:18:26.520223 7f2d93fc0700  5 osd.81 pg_epoch:
239989 pg[1.6b3( v 219971'243654 (215190'240654,219971'243654] lb MIN
(bitwise) local-les=239985 n=0 ec=23117 les/c/f 239868/239869/0
239984/239984/234718) [81,63,73]/[40,39,45] r=-1 lpr=239985
pi=239867-239983/1 crt=0'0 lcod 0'0 inactive] enter
Started/ReplicaActive/RepNotRecovering
     0> 2017-09-19 12:18:26.520292 7f2d95fc4700 -1 osd/PGLog.h: In
function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const
pg_log_t&)' thread 7f2d95fc4700 time 2017-09-19 12:18:26.515587
osd/PGLog.h: 110: FAILED assert(rollback_info_trimmed_to == head)

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0x561a4e072ef0]
 2: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x2e6) [0x561a4da6e706]
 3: (boost::statechart::simple_state<PG::RecoveryState::Stray,
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x33e) [0x561a4da9f1ce]
 4: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x69) [0x561a4da7f229]
 5: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>,
PG::RecoveryCtx*)+0x395) [0x561a4da52cb5]
 6: (OSD::process_peering_events(std::__cxx11::list<PG*,
std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x2d4)
[0x561a4d99e854]
 7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*,
ThreadPool::TPHandle&)+0x25) [0x561a4d9e74c5]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x561a4e0650c1]
 9: (ThreadPool::WorkThread::entry()+0x10) [0x561a4e0661c0]
 10: (()+0x76ba) [0x7f2e23d066ba]
 11: (clone()+0x6d) [0x7f2e21d7f82d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.


On Tue, Sep 19, 2017 at 1:18 PM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 19 Sep 2017, Wyllys Ingersoll wrote:
>> It appears to just be getting an abort signal, I dont see any other assertions.
>
> It may be a ways up in the log if the OSD was busy.  Search for the thread
> id from this line
>
>>      0> 2017-09-19 12:18:26.842937 7f2d95fc4700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f2d95fc4700 thread_name:tp_osd
>
> (7f2d95fc4700 in this case) backwards to find the failed assertion message.
>
> Thanks!
> sage

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-09-19 17:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-19 15:28 OSD crashes (10.2.9) Wyllys Ingersoll
2017-09-19 15:57 ` Nathan Cutler
2017-09-19 16:01   ` Wyllys Ingersoll
2017-09-19 17:08 ` Sage Weil
2017-09-19 17:16   ` Wyllys Ingersoll
2017-09-19 17:18     ` Sage Weil
2017-09-19 17:28       ` Wyllys Ingersoll

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.