All of lore.kernel.org
 help / color / mirror / Atom feed
* OSD crashes
@ 2017-10-10 14:36 Wyllys Ingersoll
  2017-10-10 14:54 ` kefu chai
  0 siblings, 1 reply; 7+ messages in thread
From: Wyllys Ingersoll @ 2017-10-10 14:36 UTC (permalink / raw)
  To: Ceph Development

Im seeing the following OSD crashes on my system that is in a heavy
recovery state.

Ceph 10.2.9
Ubuntu 16.04.2
XFS disks with both journal and data on the same dmcrypt protected devices.


   -13> 2017-10-10 10:33:44.202555 7f49da1158c0  5 osd.78 pg_epoch:
288706 pg[23.3bc(unlocked)] enter Initial
   -12> 2017-10-10 10:33:44.204120 7f49da1158c0  5 osd.78 pg_epoch:
288706 pg[23.3bc( v 29854'429 (0'0,29854'429] local-les=285261 n=4
ec=19254 les/c/f 285261/285281/0 285343/285343/285343) [101,39,100]
r=-1 lpr=0 pi=203138-285342/152 crt=29854'429 lcod 0'0 inactive NOTIFY
NIBBLEWISE] exit Initial 0.001559 0 0.000000
   -11> 2017-10-10 10:33:44.204139 7f49da1158c0  5 osd.78 pg_epoch:
288706 pg[23.3bc( v 29854'429 (0'0,29854'429] local-les=285261 n=4
ec=19254 les/c/f 285261/285281/0 285343/285343/285343) [101,39,100]
r=-1 lpr=0 pi=203138-285342/152 crt=29854'429 lcod 0'0 inactive NOTIFY
NIBBLEWISE] enter Reset
   -10> 2017-10-10 10:33:44.233836 7f49da1158c0  5 osd.78 pg_epoch:
288730 pg[9.8(unlocked)] enter Initial
    -9> 2017-10-10 10:33:44.245781 7f49da1158c0  5 osd.78 pg_epoch:
288730 pg[9.8( v 113941'62509 (35637'59509,113941'62509]
local-les=288727 n=26 ec=1076 les/c/f 288727/288730/0
288719/288725/279537) [78,81,100] r=0 lpr=0 crt=113941'62509 lcod 0'0
mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.011945 0 0.000000
    -8> 2017-10-10 10:33:44.245803 7f49da1158c0  5 osd.78 pg_epoch:
288730 pg[9.8( v 113941'62509 (35637'59509,113941'62509]
local-les=288727 n=26 ec=1076 les/c/f 288727/288730/0
288719/288725/279537) [78,81,100] r=0 lpr=0 crt=113941'62509 lcod 0'0
mlcod 0'0 inactive NIBBLEWISE] enter Reset
    -7> 2017-10-10 10:33:44.509240 7f49da1158c0  5 osd.78 pg_epoch:
288753 pg[1.5e7(unlocked)] enter Initial
    -6> 2017-10-10 10:33:47.185265 7f49da1158c0  5 osd.78 pg_epoch:
288753 pg[1.5e7( v 286018'307337 (208416'292664,286018'307337]
local-les=279555 n=8426 ec=23117 les/c/f 279555/279564/0
279532/279544/279544) [78,34,30] r=0 lpr=0 crt=286018'307337 lcod 0'0
mlcod 0'0 inactive NIBBLEWISE] exit Initial 2.676025 0 0.000000
    -5> 2017-10-10 10:33:47.185302 7f49da1158c0  5 osd.78 pg_epoch:
288753 pg[1.5e7( v 286018'307337 (208416'292664,286018'307337]
local-les=279555 n=8426 ec=23117 les/c/f 279555/279564/0
279532/279544/279544) [78,34,30] r=0 lpr=0 crt=286018'307337 lcod 0'0
mlcod 0'0 inactive NIBBLEWISE] enter Reset
    -4> 2017-10-10 10:33:47.345265 7f49da1158c0  5 osd.78 pg_epoch:
288706 pg[2.36a(unlocked)] enter Initial
    -3> 2017-10-10 10:33:47.360864 7f49da1158c0  5 osd.78 pg_epoch:
288706 pg[2.36a( v 279380'86262 (36401'83241,279380'86262]
local-les=285038 n=56 ec=23131 les/c/f 285038/285160/0
284933/284985/284985) [2,78,59] r=1 lpr=0 pi=284823-284984/2
crt=279380'86262 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial
0.015599 0 0.000000
    -2> 2017-10-10 10:33:47.360893 7f49da1158c0  5 osd.78 pg_epoch:
288706 pg[2.36a( v 279380'86262 (36401'83241,279380'86262]
local-les=285038 n=56 ec=23131 les/c/f 285038/285160/0
284933/284985/284985) [2,78,59] r=1 lpr=0 pi=284823-284984/2
crt=279380'86262 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset
    -1> 2017-10-10 10:33:47.589722 7f49da1158c0  5 osd.78 pg_epoch:
288663 pg[1.2ad(unlocked)] enter Initial
     0> 2017-10-10 10:33:48.931168 7f49da1158c0 -1 *** Caught signal
(Aborted) **
 in thread 7f49da1158c0 thread_name:ceph-osd

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (()+0x984c4e) [0x5597b21e7c4e]
 2: (()+0x11390) [0x7f49d8fd3390]
 3: (gsignal()+0x38) [0x7f49d6f71428]
 4: (abort()+0x16a) [0x7f49d6f7302a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f49d78b384d]
 6: (()+0x8d6b6) [0x7f49d78b16b6]
 7: (()+0x8d701) [0x7f49d78b1701]
 8: (()+0x8d919) [0x7f49d78b1919]
 9: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x146)
[0x5597b22f0f86]
 10: (ceph::buffer::copy(char const*, unsigned int)+0x15) [0x5597b22f10f5]
 11: (ceph::buffer::ptr::ptr(char const*, unsigned int)+0x18) [0x5597b22f1128]
 12: (LevelDBStore::to_bufferlist(leveldb::Slice)+0x75) [0x5597b20a09b5]
 13: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::value()+0x32)
[0x5597b20a4232]
 14: (KeyValueDB::IteratorImpl::value()+0x22) [0x5597b1c843f2]
 15: (DBObjectMap::DBObjectMapIteratorImpl::value()+0x25) [0x5597b204cbd5]
 16: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t,
pg_info_t const&, std::map<eversion_t, hobject_t,
std::less<eversion_t>, std::allocator<std::pair<eversion_t const,
hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&,
std::__cxx11::basic_ostringstream<char, std::char_traits<char>,
std::allocator<char> >&, bool, DoutPrefixProvider const*,
std::set<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > >,
std::allocator<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > > >*)+0xb99)
[0x5597b1e92a19]
 17: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x313) [0x5597b1cc0fb3]
 18: (OSD::load_pgs()+0x87a) [0x5597b1bfb96a]
 19: (OSD::init()+0x2026) [0x5597b1c06c56]
 20: (main()+0x2ef1) [0x5597b1b78391]
 21: (__libc_start_main()+0xf0) [0x7f49d6f5c830]
 22: (_start()+0x29) [0x5597b1bb9b99]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD crashes
  2017-10-10 14:36 OSD crashes Wyllys Ingersoll
@ 2017-10-10 14:54 ` kefu chai
  2017-10-10 15:00   ` Wyllys Ingersoll
  0 siblings, 1 reply; 7+ messages in thread
From: kefu chai @ 2017-10-10 14:54 UTC (permalink / raw)
  To: Wyllys Ingersoll; +Cc: Ceph Development

On Tue, Oct 10, 2017 at 10:36 PM, Wyllys Ingersoll
<wyllys.ingersoll@keepertech.com> wrote:
> Im seeing the following OSD crashes on my system that is in a heavy
> recovery state.
>
> Ceph 10.2.9
> Ubuntu 16.04.2
> XFS disks with both journal and data on the same dmcrypt protected devices.
>
>
>    -13> 2017-10-10 10:33:44.202555 7f49da1158c0  5 osd.78 pg_epoch:
> 288706 pg[23.3bc(unlocked)] enter Initial
>    -12> 2017-10-10 10:33:44.204120 7f49da1158c0  5 osd.78 pg_epoch:
> 288706 pg[23.3bc( v 29854'429 (0'0,29854'429] local-les=285261 n=4
> ec=19254 les/c/f 285261/285281/0 285343/285343/285343) [101,39,100]
> r=-1 lpr=0 pi=203138-285342/152 crt=29854'429 lcod 0'0 inactive NOTIFY
> NIBBLEWISE] exit Initial 0.001559 0 0.000000
>    -11> 2017-10-10 10:33:44.204139 7f49da1158c0  5 osd.78 pg_epoch:
> 288706 pg[23.3bc( v 29854'429 (0'0,29854'429] local-les=285261 n=4
> ec=19254 les/c/f 285261/285281/0 285343/285343/285343) [101,39,100]
> r=-1 lpr=0 pi=203138-285342/152 crt=29854'429 lcod 0'0 inactive NOTIFY
> NIBBLEWISE] enter Reset
>    -10> 2017-10-10 10:33:44.233836 7f49da1158c0  5 osd.78 pg_epoch:
> 288730 pg[9.8(unlocked)] enter Initial
>     -9> 2017-10-10 10:33:44.245781 7f49da1158c0  5 osd.78 pg_epoch:
> 288730 pg[9.8( v 113941'62509 (35637'59509,113941'62509]
> local-les=288727 n=26 ec=1076 les/c/f 288727/288730/0
> 288719/288725/279537) [78,81,100] r=0 lpr=0 crt=113941'62509 lcod 0'0
> mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.011945 0 0.000000
>     -8> 2017-10-10 10:33:44.245803 7f49da1158c0  5 osd.78 pg_epoch:
> 288730 pg[9.8( v 113941'62509 (35637'59509,113941'62509]
> local-les=288727 n=26 ec=1076 les/c/f 288727/288730/0
> 288719/288725/279537) [78,81,100] r=0 lpr=0 crt=113941'62509 lcod 0'0
> mlcod 0'0 inactive NIBBLEWISE] enter Reset
>     -7> 2017-10-10 10:33:44.509240 7f49da1158c0  5 osd.78 pg_epoch:
> 288753 pg[1.5e7(unlocked)] enter Initial
>     -6> 2017-10-10 10:33:47.185265 7f49da1158c0  5 osd.78 pg_epoch:
> 288753 pg[1.5e7( v 286018'307337 (208416'292664,286018'307337]
> local-les=279555 n=8426 ec=23117 les/c/f 279555/279564/0
> 279532/279544/279544) [78,34,30] r=0 lpr=0 crt=286018'307337 lcod 0'0
> mlcod 0'0 inactive NIBBLEWISE] exit Initial 2.676025 0 0.000000
>     -5> 2017-10-10 10:33:47.185302 7f49da1158c0  5 osd.78 pg_epoch:
> 288753 pg[1.5e7( v 286018'307337 (208416'292664,286018'307337]
> local-les=279555 n=8426 ec=23117 les/c/f 279555/279564/0
> 279532/279544/279544) [78,34,30] r=0 lpr=0 crt=286018'307337 lcod 0'0
> mlcod 0'0 inactive NIBBLEWISE] enter Reset
>     -4> 2017-10-10 10:33:47.345265 7f49da1158c0  5 osd.78 pg_epoch:
> 288706 pg[2.36a(unlocked)] enter Initial
>     -3> 2017-10-10 10:33:47.360864 7f49da1158c0  5 osd.78 pg_epoch:
> 288706 pg[2.36a( v 279380'86262 (36401'83241,279380'86262]
> local-les=285038 n=56 ec=23131 les/c/f 285038/285160/0
> 284933/284985/284985) [2,78,59] r=1 lpr=0 pi=284823-284984/2
> crt=279380'86262 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial
> 0.015599 0 0.000000
>     -2> 2017-10-10 10:33:47.360893 7f49da1158c0  5 osd.78 pg_epoch:
> 288706 pg[2.36a( v 279380'86262 (36401'83241,279380'86262]
> local-les=285038 n=56 ec=23131 les/c/f 285038/285160/0
> 284933/284985/284985) [2,78,59] r=1 lpr=0 pi=284823-284984/2
> crt=279380'86262 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset
>     -1> 2017-10-10 10:33:47.589722 7f49da1158c0  5 osd.78 pg_epoch:
> 288663 pg[1.2ad(unlocked)] enter Initial
>      0> 2017-10-10 10:33:48.931168 7f49da1158c0 -1 *** Caught signal
> (Aborted) **
>  in thread 7f49da1158c0 thread_name:ceph-osd
>
>  ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
>  1: (()+0x984c4e) [0x5597b21e7c4e]
>  2: (()+0x11390) [0x7f49d8fd3390]
>  3: (gsignal()+0x38) [0x7f49d6f71428]
>  4: (abort()+0x16a) [0x7f49d6f7302a]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f49d78b384d]

looks like an uncaught bad_alloc() exception when we are reading the
pg log from the underlying omap backed by leveldb. it would be weird
if the leveldb returns a HUGE slice that fails the allocator.  Wyllys,
how did the memory usage look like when the ceph-osd crashes?

>  6: (()+0x8d6b6) [0x7f49d78b16b6]
>  7: (()+0x8d701) [0x7f49d78b1701]
>  8: (()+0x8d919) [0x7f49d78b1919]
>  9: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x146)
> [0x5597b22f0f86]
>  10: (ceph::buffer::copy(char const*, unsigned int)+0x15) [0x5597b22f10f5]
>  11: (ceph::buffer::ptr::ptr(char const*, unsigned int)+0x18) [0x5597b22f1128]
>  12: (LevelDBStore::to_bufferlist(leveldb::Slice)+0x75) [0x5597b20a09b5]
>  13: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::value()+0x32)
> [0x5597b20a4232]
>  14: (KeyValueDB::IteratorImpl::value()+0x22) [0x5597b1c843f2]
>  15: (DBObjectMap::DBObjectMapIteratorImpl::value()+0x25) [0x5597b204cbd5]
>  16: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t,
> pg_info_t const&, std::map<eversion_t, hobject_t,
> std::less<eversion_t>, std::allocator<std::pair<eversion_t const,
> hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&,
> std::__cxx11::basic_ostringstream<char, std::char_traits<char>,
> std::allocator<char> >&, bool, DoutPrefixProvider const*,
> std::set<std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > >,
> std::allocator<std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > > >*)+0xb99)
> [0x5597b1e92a19]
>  17: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x313) [0x5597b1cc0fb3]
>  18: (OSD::load_pgs()+0x87a) [0x5597b1bfb96a]
>  19: (OSD::init()+0x2026) [0x5597b1c06c56]
>  20: (main()+0x2ef1) [0x5597b1b78391]
>  21: (__libc_start_main()+0xf0) [0x7f49d6f5c830]
>  22: (_start()+0x29) [0x5597b1bb9b99]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD crashes
  2017-10-10 14:54 ` kefu chai
@ 2017-10-10 15:00   ` Wyllys Ingersoll
  0 siblings, 0 replies; 7+ messages in thread
From: Wyllys Ingersoll @ 2017-10-10 15:00 UTC (permalink / raw)
  To: kefu chai; +Cc: Ceph Development

Memory usage is through the roof because the cluster is in a heavy
rebalance operation and all of the OSDs are gobbling up as much memory
as they can.

The storage server in this case has 128GB RAM, and 11 OSDs, each about
3-4 TB in size.

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           125G        117G        1.2G         94M        6.6G        5.2G
Swap:          975M        975M        360K



On Tue, Oct 10, 2017 at 10:54 AM, kefu chai <tchaikov@gmail.com> wrote:
> On Tue, Oct 10, 2017 at 10:36 PM, Wyllys Ingersoll
> <wyllys.ingersoll@keepertech.com> wrote:
>> Im seeing the following OSD crashes on my system that is in a heavy
>> recovery state.
>>
>> Ceph 10.2.9
>> Ubuntu 16.04.2
>> XFS disks with both journal and data on the same dmcrypt protected devices.
>>
>>
>>    -13> 2017-10-10 10:33:44.202555 7f49da1158c0  5 osd.78 pg_epoch:
>> 288706 pg[23.3bc(unlocked)] enter Initial
>>    -12> 2017-10-10 10:33:44.204120 7f49da1158c0  5 osd.78 pg_epoch:
>> 288706 pg[23.3bc( v 29854'429 (0'0,29854'429] local-les=285261 n=4
>> ec=19254 les/c/f 285261/285281/0 285343/285343/285343) [101,39,100]
>> r=-1 lpr=0 pi=203138-285342/152 crt=29854'429 lcod 0'0 inactive NOTIFY
>> NIBBLEWISE] exit Initial 0.001559 0 0.000000
>>    -11> 2017-10-10 10:33:44.204139 7f49da1158c0  5 osd.78 pg_epoch:
>> 288706 pg[23.3bc( v 29854'429 (0'0,29854'429] local-les=285261 n=4
>> ec=19254 les/c/f 285261/285281/0 285343/285343/285343) [101,39,100]
>> r=-1 lpr=0 pi=203138-285342/152 crt=29854'429 lcod 0'0 inactive NOTIFY
>> NIBBLEWISE] enter Reset
>>    -10> 2017-10-10 10:33:44.233836 7f49da1158c0  5 osd.78 pg_epoch:
>> 288730 pg[9.8(unlocked)] enter Initial
>>     -9> 2017-10-10 10:33:44.245781 7f49da1158c0  5 osd.78 pg_epoch:
>> 288730 pg[9.8( v 113941'62509 (35637'59509,113941'62509]
>> local-les=288727 n=26 ec=1076 les/c/f 288727/288730/0
>> 288719/288725/279537) [78,81,100] r=0 lpr=0 crt=113941'62509 lcod 0'0
>> mlcod 0'0 inactive NIBBLEWISE] exit Initial 0.011945 0 0.000000
>>     -8> 2017-10-10 10:33:44.245803 7f49da1158c0  5 osd.78 pg_epoch:
>> 288730 pg[9.8( v 113941'62509 (35637'59509,113941'62509]
>> local-les=288727 n=26 ec=1076 les/c/f 288727/288730/0
>> 288719/288725/279537) [78,81,100] r=0 lpr=0 crt=113941'62509 lcod 0'0
>> mlcod 0'0 inactive NIBBLEWISE] enter Reset
>>     -7> 2017-10-10 10:33:44.509240 7f49da1158c0  5 osd.78 pg_epoch:
>> 288753 pg[1.5e7(unlocked)] enter Initial
>>     -6> 2017-10-10 10:33:47.185265 7f49da1158c0  5 osd.78 pg_epoch:
>> 288753 pg[1.5e7( v 286018'307337 (208416'292664,286018'307337]
>> local-les=279555 n=8426 ec=23117 les/c/f 279555/279564/0
>> 279532/279544/279544) [78,34,30] r=0 lpr=0 crt=286018'307337 lcod 0'0
>> mlcod 0'0 inactive NIBBLEWISE] exit Initial 2.676025 0 0.000000
>>     -5> 2017-10-10 10:33:47.185302 7f49da1158c0  5 osd.78 pg_epoch:
>> 288753 pg[1.5e7( v 286018'307337 (208416'292664,286018'307337]
>> local-les=279555 n=8426 ec=23117 les/c/f 279555/279564/0
>> 279532/279544/279544) [78,34,30] r=0 lpr=0 crt=286018'307337 lcod 0'0
>> mlcod 0'0 inactive NIBBLEWISE] enter Reset
>>     -4> 2017-10-10 10:33:47.345265 7f49da1158c0  5 osd.78 pg_epoch:
>> 288706 pg[2.36a(unlocked)] enter Initial
>>     -3> 2017-10-10 10:33:47.360864 7f49da1158c0  5 osd.78 pg_epoch:
>> 288706 pg[2.36a( v 279380'86262 (36401'83241,279380'86262]
>> local-les=285038 n=56 ec=23131 les/c/f 285038/285160/0
>> 284933/284985/284985) [2,78,59] r=1 lpr=0 pi=284823-284984/2
>> crt=279380'86262 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial
>> 0.015599 0 0.000000
>>     -2> 2017-10-10 10:33:47.360893 7f49da1158c0  5 osd.78 pg_epoch:
>> 288706 pg[2.36a( v 279380'86262 (36401'83241,279380'86262]
>> local-les=285038 n=56 ec=23131 les/c/f 285038/285160/0
>> 284933/284985/284985) [2,78,59] r=1 lpr=0 pi=284823-284984/2
>> crt=279380'86262 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset
>>     -1> 2017-10-10 10:33:47.589722 7f49da1158c0  5 osd.78 pg_epoch:
>> 288663 pg[1.2ad(unlocked)] enter Initial
>>      0> 2017-10-10 10:33:48.931168 7f49da1158c0 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f49da1158c0 thread_name:ceph-osd
>>
>>  ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
>>  1: (()+0x984c4e) [0x5597b21e7c4e]
>>  2: (()+0x11390) [0x7f49d8fd3390]
>>  3: (gsignal()+0x38) [0x7f49d6f71428]
>>  4: (abort()+0x16a) [0x7f49d6f7302a]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f49d78b384d]
>
> looks like an uncaught bad_alloc() exception when we are reading the
> pg log from the underlying omap backed by leveldb. it would be weird
> if the leveldb returns a HUGE slice that fails the allocator.  Wyllys,
> how did the memory usage look like when the ceph-osd crashes?
>
>>  6: (()+0x8d6b6) [0x7f49d78b16b6]
>>  7: (()+0x8d701) [0x7f49d78b1701]
>>  8: (()+0x8d919) [0x7f49d78b1919]
>>  9: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x146)
>> [0x5597b22f0f86]
>>  10: (ceph::buffer::copy(char const*, unsigned int)+0x15) [0x5597b22f10f5]
>>  11: (ceph::buffer::ptr::ptr(char const*, unsigned int)+0x18) [0x5597b22f1128]
>>  12: (LevelDBStore::to_bufferlist(leveldb::Slice)+0x75) [0x5597b20a09b5]
>>  13: (LevelDBStore::LevelDBWholeSpaceIteratorImpl::value()+0x32)
>> [0x5597b20a4232]
>>  14: (KeyValueDB::IteratorImpl::value()+0x22) [0x5597b1c843f2]
>>  15: (DBObjectMap::DBObjectMapIteratorImpl::value()+0x25) [0x5597b204cbd5]
>>  16: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t,
>> pg_info_t const&, std::map<eversion_t, hobject_t,
>> std::less<eversion_t>, std::allocator<std::pair<eversion_t const,
>> hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&,
>> std::__cxx11::basic_ostringstream<char, std::char_traits<char>,
>> std::allocator<char> >&, bool, DoutPrefixProvider const*,
>> std::set<std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> > >,
>> std::allocator<std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> > > >*)+0xb99)
>> [0x5597b1e92a19]
>>  17: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x313) [0x5597b1cc0fb3]
>>  18: (OSD::load_pgs()+0x87a) [0x5597b1bfb96a]
>>  19: (OSD::init()+0x2026) [0x5597b1c06c56]
>>  20: (main()+0x2ef1) [0x5597b1b78391]
>>  21: (__libc_start_main()+0xf0) [0x7f49d6f5c830]
>>  22: (_start()+0x29) [0x5597b1bb9b99]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Regards
> Kefu Chai

^ permalink raw reply	[flat|nested] 7+ messages in thread

* OSD crashes
@ 2017-09-18 18:24 Wyllys Ingersoll
  0 siblings, 0 replies; 7+ messages in thread
From: Wyllys Ingersoll @ 2017-09-18 18:24 UTC (permalink / raw)
  To: Ceph Development

We have a cluster going through a heavy rebalance operation, but its
hampered by several OSDs that keep crashing and restarting.

Jewel 10.2.7
Ubuntu 16.04.2

Here is a dump of one of the crashing OSD logs:



     0> 2017-09-18 14:08:18.631931 7f481207d8c0 -1 *** Caught signal
(Aborted) **
 in thread 7f481207d8c0 thread_name:ceph-osd

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (()+0x9770ae) [0x557bbc5fc0ae]
 2: (()+0x11390) [0x7f4810f3b390]
 3: (gsignal()+0x38) [0x7f480eed9428]
 4: (abort()+0x16a) [0x7f480eedb02a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f480f81b84d]
 6: (()+0x8d6b6) [0x7f480f8196b6]
 7: (()+0x8d701) [0x7f480f819701]
 8: (()+0x8d919) [0x7f480f819919]
 9: (()+0x1230f) [0x7f4811c1330f]
 10: (operator new[](unsigned long)+0x4e7) [0x7f4811c374b7]
 11: (leveldb::ReadBlock(leveldb::RandomAccessFile*,
leveldb::ReadOptions const&, leveldb::BlockHandle const&,
leveldb::BlockContents*)+0x313) [0x7f48115c4e63]
 12: (leveldb::Table::BlockReader(void*, leveldb::ReadOptions const&,
leveldb::Slice const&)+0x276) [0x7f48115c9426]
 13: (()+0x421be) [0x7f48115cd1be]
 14: (()+0x42240) [0x7f48115cd240]
 15: (()+0x4261e) [0x7f48115cd61e]
 16: (()+0x3d835) [0x7f48115c8835]
 17: (()+0x1fffb) [0x7f48115aaffb]
 18: (_ZN12LevelDBStore29LevelDBWholeSpaceIteratorImpl4nextEv()+0x8f)
[0x557bbc4b7a3f]
 19: (_ZN11DBObjectMap23DBObjectMapIteratorImpl4nextEb()+0x34) [0x557bbc46bb24]
 20: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t,
pg_info_t const&, std::map<eversion_t, hobject_t,
std::less<eversion_t>, std::allocator<std::pair<eversion_t const,
hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&,
std::__cxx11::basic_ostringstream<char, std::char_traits<char>,
std::allocator<char> >&, DoutPrefixProvider const*,
std::set<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, std::less<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > >,
std::allocator<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > > >*)+0xac3)
[0x557bbc2a9653]
 21: (_ZN2PG10read_stateEP11ObjectStoreRN4ceph6buffer4listE()+0x2f6)
[0x557bbc0db306]
 22: (OSD::load_pgs()+0x87a) [0x557bbc016f0a]
 23: (OSD::init()+0x2026) [0x557bbc0221f6]
 24: (main()+0x2ea5) [0x557bbbf93dc5]
 25: (__libc_start_main()+0xf0) [0x7f480eec4830]
 26: (_start()+0x29) [0x557bbbfd5459]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 1 ms
   0/ 1 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   0/ 1 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  99/99 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.70.log
--- end dump of recent events ---

^ permalink raw reply	[flat|nested] 7+ messages in thread

* OSD crashes
@ 2012-03-06  6:54 Borodin Vladimir
  0 siblings, 0 replies; 7+ messages in thread
From: Borodin Vladimir @ 2012-03-06  6:54 UTC (permalink / raw)
  To: ceph-devel

Hi all.

One of my OSD crashes when I try to start it. I've turned on "debug
osd = 20" in ceph.conf for this node and put the log here:
http://simply.name/osd.47.log. The ceph.conf file is here:
http://simply.name/ceph.conf. Is there any other information I should
show?

I've updated to 0.43 recently, but there were no problems after it.
Actually, I don't know, when this problem appeared.

Regards,
Vladimir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD crashes
  2011-10-11 16:02 Christian Brunner
@ 2011-10-11 18:09 ` Gregory Farnum
  0 siblings, 0 replies; 7+ messages in thread
From: Gregory Farnum @ 2011-10-11 18:09 UTC (permalink / raw)
  To: chb; +Cc: ceph-devel

On Tue, Oct 11, 2011 at 9:02 AM, Christian Brunner <chb@muc.de> wrote:
> Here is another one...
>
> I've now run mkcephfs and started importing all our data from a
> backup. However after two days, two of our OSDs are crashing right
> after the start again.
>
> It all started with a "hit suicide timeout". Now I can't start it any
> longer. Here is what I have in the logs. I'm sending the complete log
> because I' getting different messages.
This has a few interesting things, but it looks like it doesn't have
all the entries -- are you using syslog over UDP by any chance?
-Greg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* OSD crashes
@ 2011-10-11 16:02 Christian Brunner
  2011-10-11 18:09 ` Gregory Farnum
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Brunner @ 2011-10-11 16:02 UTC (permalink / raw)
  To: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 377 bytes --]

Here is another one...

I've now run mkcephfs and started importing all our data from a
backup. However after two days, two of our OSDs are crashing right
after the start again.

It all started with a "hit suicide timeout". Now I can't start it any
longer. Here is what I have in the logs. I'm sending the complete log
because I' getting different messages.

Thanks,
Christian

[-- Attachment #2: log.txt.gz --]
[-- Type: application/x-gzip, Size: 34679 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-10-10 15:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-10 14:36 OSD crashes Wyllys Ingersoll
2017-10-10 14:54 ` kefu chai
2017-10-10 15:00   ` Wyllys Ingersoll
  -- strict thread matches above, loose matches on Subject: below --
2017-09-18 18:24 Wyllys Ingersoll
2012-03-06  6:54 Borodin Vladimir
2011-10-11 16:02 Christian Brunner
2011-10-11 18:09 ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.