All of lore.kernel.org
 help / color / mirror / Atom feed
* OSD crash
@ 2012-08-22 20:31 Andrey Korolyov
  2012-08-22 22:33 ` Sage Weil
  0 siblings, 1 reply; 28+ messages in thread
From: Andrey Korolyov @ 2012-08-22 20:31 UTC (permalink / raw)
  To: ceph-devel

Hi,

today during heavy test a pair of osds and one mon died, resulting to
hard lockup of some kvm processes - they went unresponsible and was
killed leaving zombie processes ([kvm] <defunct>). Entire cluster
contain sixteen osd on eight nodes and three mons, on first and last
node and on vm outside cluster.

osd bt:
#0  0x00007fc37d490be3 in
tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
unsigned long, int) () from /usr/lib/libtcmalloc.so.4
(gdb) bt
#0  0x00007fc37d490be3 in
tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
unsigned long, int) () from /usr/lib/libtcmalloc.so.4
#1  0x00007fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
/usr/lib/libtcmalloc.so.4
#2  0x00007fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
#3  0x00000000008b1224 in _M_dispose (__a=..., this=0x6266d80) at
/usr/include/c++/4.7/bits/basic_string.h:246
#4  ~basic_string (this=0x7fc3736639d0, __in_chrg=<optimized out>) at
/usr/include/c++/4.7/bits/basic_string.h:536
#5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=<optimized out>)
at /usr/include/c++/4.7/sstream:60
#6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=<optimized
out>, __vtt_parm=<optimized out>) at /usr/include/c++/4.7/sstream:439
#7  pretty_version_to_str () at common/version.cc:40
#8  0x0000000000791630 in ceph::BackTrace::print (this=0x7fc373663d10,
out=...) at common/BackTrace.cc:19
#9  0x000000000078f450 in handle_fatal_signal (signum=11) at
global/signal_handler.cc:91
#10 <signal handler called>
#11 0x00007fc37d490be3 in
tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
unsigned long, int) () from /usr/lib/libtcmalloc.so.4
#12 0x00007fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
/usr/lib/libtcmalloc.so.4
#13 0x00007fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
#14 0x00007fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#15 0x00007fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007fc37d1c47c3 in std::terminate() () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#17 0x00007fc37d1c49ee in __cxa_throw () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x0000000000844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
"0 == \"unexpected error\"", file=<optimized out>, line=3007,
    func=0x90ef80 "unsigned int
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)")
at common/assert.cc:77
#19 0x000000000073148f in FileStore::_do_transaction
(this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
trans_num=trans_num@entry=0) at os/FileStore.cc:3007
#20 0x000000000073484e in FileStore::do_transactions (this=0x2cde000,
tls=..., op_seq=429545) at os/FileStore.cc:2436
#21 0x000000000070c680 in FileStore::_do_op (this=0x2cde000,
osr=<optimized out>) at os/FileStore.cc:2259
#22 0x000000000083ce01 in ThreadPool::worker (this=0x2cde828) at
common/WorkQueue.cc:54
#23 0x00000000006823ed in ThreadPool::WorkThread::entry
(this=<optimized out>) at ./common/WorkQueue.h:126
#24 0x00007fc37e3eee9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#25 0x00007fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#26 0x0000000000000000 in ?? ()

mon bt was exactly the same as in http://tracker.newdream.net/issues/2762

^ permalink raw reply	[flat|nested] 28+ messages in thread
* osd crash
@ 2020-09-07 16:42 Kaarlo Lahtela
  0 siblings, 0 replies; 28+ messages in thread
From: Kaarlo Lahtela @ 2020-09-07 16:42 UTC (permalink / raw)
  To: ceph-devel

Hi,
two of my osds on different nodes do not start.  Now I have one pg
that is down. This happened on ceph version 14.2.10 and still happens
after upgrading to 14.2.11. I get this error when starting osd:

===============8<========================
root@prox:~# /usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph
--setgroup ceph
2020-09-05 13:53:15.077 7fd0fca43c80 -1 osd.1 6645 log_to_monitors
{default=true}
2020-09-05 13:53:15.189 7fd0f5d4b700 -1 osd.1 6687 set_numa_affinity
unable to identify public interface 'vmbr0' numa node: (2) No such
file or directory
/build/ceph-JY24tx/ceph-14.2.11/src/osd/osd_types.cc: In function
'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread
7fd0e2524700 time 2020-09-05 13:53:17.980687
/build/ceph-JY24tx/ceph-14.2.11/src/osd/osd_types.cc: 5450: FAILED
ceph_assert(clone_overlap.count(clone))
 ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94)
nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x152) [0x562cc5ea83c8]
 2: (()+0x5115a0) [0x562cc5ea85a0]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x562cc61dc432]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x562cc6107fb7]
 5: (PrimaryLogPG::recover_backfill(unsigned long,
ThreadPool::TPHandle&, bool*)+0xfdc) [0x562cc6136a3c]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x1173) [0x562cc613ab43]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x562cc5f8b622]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x562cc622fac9]
 9: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x7d7) [0x562cc5fa7ba7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x562cc65740c4]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562cc6576ad0]
 12: (()+0x7fa3) [0x7fd0fd487fa3]
 13: (clone()+0x3f) [0x7fd0fd0374cf]
*** Caught signal (Aborted) **
 in thread 7fd0e2524700 thread_name:tp_osd_tp
2020-09-05 13:53:17.977 7fd0e2524700 -1
/build/ceph-JY24tx/ceph-14.2.11/src/osd/osd_types.cc: In function
'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread
7fd0e2524700 time 2020-09-05 13:53:17.980687
/build/ceph-JY24tx/ceph-14.2.11/src/osd/osd_types.cc: 5450: FAILED
ceph_assert(clone_overlap.count(clone))

 ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94)
nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x152) [0x562cc5ea83c8]
 2: (()+0x5115a0) [0x562cc5ea85a0]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x562cc61dc432]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x562cc6107fb7]
 5: (PrimaryLogPG::recover_backfill(unsigned long,
ThreadPool::TPHandle&, bool*)+0xfdc) [0x562cc6136a3c]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x1173) [0x562cc613ab43]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x562cc5f8b622]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x562cc622fac9]
 9: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x7d7) [0x562cc5fa7ba7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x562cc65740c4]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562cc6576ad0]
 12: (()+0x7fa3) [0x7fd0fd487fa3]
 13: (clone()+0x3f) [0x7fd0fd0374cf]

 ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94)
nautilus (stable)
 1: (()+0x12730) [0x7fd0fd492730]
 2: (gsignal()+0x10b) [0x7fd0fcf757bb]
 3: (abort()+0x121) [0x7fd0fcf60535]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1a3) [0x562cc5ea8419]
 5: (()+0x5115a0) [0x562cc5ea85a0]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x562cc61dc432]
 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x562cc6107fb7]
 8: (PrimaryLogPG::recover_backfill(unsigned long,
ThreadPool::TPHandle&, bool*)+0xfdc) [0x562cc6136a3c]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x1173) [0x562cc613ab43]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x562cc5f8b622]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x562cc622fac9]
 12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x7d7) [0x562cc5fa7ba7]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x562cc65740c4]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562cc6576ad0]
 15: (()+0x7fa3) [0x7fd0fd487fa3]
 16: (clone()+0x3f) [0x7fd0fd0374cf]
2020-09-05 13:53:17.981 7fd0e2524700 -1 *** Caught signal (Aborted) **
 in thread 7fd0e2524700 thread_name:tp_osd_tp

 ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94)
nautilus (stable)
 1: (()+0x12730) [0x7fd0fd492730]
 2: (gsignal()+0x10b) [0x7fd0fcf757bb]
 3: (abort()+0x121) [0x7fd0fcf60535]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1a3) [0x562cc5ea8419]
 5: (()+0x5115a0) [0x562cc5ea85a0]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x562cc61dc432]
 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x562cc6107fb7]
 8: (PrimaryLogPG::recover_backfill(unsigned long,
ThreadPool::TPHandle&, bool*)+0xfdc) [0x562cc6136a3c]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x1173) [0x562cc613ab43]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x562cc5f8b622]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x562cc622fac9]
 12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x7d7) [0x562cc5fa7ba7]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x562cc65740c4]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562cc6576ad0]
 15: (()+0x7fa3) [0x7fd0fd487fa3]
 16: (clone()+0x3f) [0x7fd0fd0374cf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

  -469> 2020-09-05 13:53:15.077 7fd0fca43c80 -1 osd.1 6645
log_to_monitors {default=true}
  -195> 2020-09-05 13:53:15.189 7fd0f5d4b700 -1 osd.1 6687
set_numa_affinity unable to identify public interface 'vmbr0' numa
node: (2) No such file or directory
    -1> 2020-09-05 13:53:17.977 7fd0e2524700 -1
/build/ceph-JY24tx/ceph-14.2.11/src/osd/osd_types.cc: In function
'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread
7fd0e2524700 time 2020-09-05 13:53:17.980687
/build/ceph-JY24tx/ceph-14.2.11/src/osd/osd_types.cc: 5450: FAILED
ceph_assert(clone_overlap.count(clone))

 ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94)
nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x152) [0x562cc5ea83c8]
 2: (()+0x5115a0) [0x562cc5ea85a0]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x562cc61dc432]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x562cc6107fb7]
 5: (PrimaryLogPG::recover_backfill(unsigned long,
ThreadPool::TPHandle&, bool*)+0xfdc) [0x562cc6136a3c]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x1173) [0x562cc613ab43]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x562cc5f8b622]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x562cc622fac9]
 9: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x7d7) [0x562cc5fa7ba7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x562cc65740c4]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562cc6576ad0]
 12: (()+0x7fa3) [0x7fd0fd487fa3]
 13: (clone()+0x3f) [0x7fd0fd0374cf]

     0> 2020-09-05 13:53:17.981 7fd0e2524700 -1 *** Caught signal (Aborted) **
 in thread 7fd0e2524700 thread_name:tp_osd_tp

 ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94)
nautilus (stable)
 1: (()+0x12730) [0x7fd0fd492730]
 2: (gsignal()+0x10b) [0x7fd0fcf757bb]
 3: (abort()+0x121) [0x7fd0fcf60535]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1a3) [0x562cc5ea8419]
 5: (()+0x5115a0) [0x562cc5ea85a0]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x562cc61dc432]
 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x562cc6107fb7]
 8: (PrimaryLogPG::recover_backfill(unsigned long,
ThreadPool::TPHandle&, bool*)+0xfdc) [0x562cc6136a3c]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x1173) [0x562cc613ab43]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x562cc5f8b622]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x562cc622fac9]
 12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x7d7) [0x562cc5fa7ba7]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x562cc65740c4]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562cc6576ad0]
 15: (()+0x7fa3) [0x7fd0fd487fa3]
 16: (clone()+0x3f) [0x7fd0fd0374cf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

  -469> 2020-09-05 13:53:15.077 7fd0fca43c80 -1 osd.1 6645
log_to_monitors {default=true}
  -195> 2020-09-05 13:53:15.189 7fd0f5d4b700 -1 osd.1 6687
set_numa_affinity unable to identify public interface 'vmbr0' numa
node: (2) No such file or directory
    -1> 2020-09-05 13:53:17.977 7fd0e2524700 -1
/build/ceph-JY24tx/ceph-14.2.11/src/osd/osd_types.cc: In function
'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread
7fd0e2524700 time 2020-09-05 13:53:17.980687
/build/ceph-JY24tx/ceph-14.2.11/src/osd/osd_types.cc: 5450: FAILED
ceph_assert(clone_overlap.count(clone))

 ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94)
nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x152) [0x562cc5ea83c8]
 2: (()+0x5115a0) [0x562cc5ea85a0]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x562cc61dc432]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x562cc6107fb7]
 5: (PrimaryLogPG::recover_backfill(unsigned long,
ThreadPool::TPHandle&, bool*)+0xfdc) [0x562cc6136a3c]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x1173) [0x562cc613ab43]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x562cc5f8b622]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x562cc622fac9]
 9: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x7d7) [0x562cc5fa7ba7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x562cc65740c4]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562cc6576ad0]
 12: (()+0x7fa3) [0x7fd0fd487fa3]
 13: (clone()+0x3f) [0x7fd0fd0374cf]

     0> 2020-09-05 13:53:17.981 7fd0e2524700 -1 *** Caught signal (Aborted) **
 in thread 7fd0e2524700 thread_name:tp_osd_tp

 ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94)
nautilus (stable)
 1: (()+0x12730) [0x7fd0fd492730]
 2: (gsignal()+0x10b) [0x7fd0fcf757bb]
 3: (abort()+0x121) [0x7fd0fcf60535]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1a3) [0x562cc5ea8419]
 5: (()+0x5115a0) [0x562cc5ea85a0]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x562cc61dc432]
 7: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x562cc6107fb7]
 8: (PrimaryLogPG::recover_backfill(unsigned long,
ThreadPool::TPHandle&, bool*)+0xfdc) [0x562cc6136a3c]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long,
ThreadPool::TPHandle&, unsigned long*)+0x1173) [0x562cc613ab43]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x562cc5f8b622]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x562cc622fac9]
 12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x7d7) [0x562cc5fa7ba7]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x562cc65740c4]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562cc6576ad0]
 15: (()+0x7fa3) [0x7fd0fd487fa3]
 16: (clone()+0x3f) [0x7fd0fd0374cf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

Aborted
===============8<========================

What can I do to recover my osds?

-- 
</kaarlo>

^ permalink raw reply	[flat|nested] 28+ messages in thread
[parent not found: <8566685.312.1362419807745.JavaMail.dspano@it1>]
* OSD crash
@ 2012-06-16 12:57 Stefan Priebe
  2012-06-16 13:34 ` Stefan Priebe
  0 siblings, 1 reply; 28+ messages in thread
From: Stefan Priebe @ 2012-06-16 12:57 UTC (permalink / raw)
  To: ceph-devel

Hi,

today i got another osd crash ;-( Strangely the osd logs are all empty. 
It seems the logrotate hasn't reloaded the daemons but i still have the 
core dump file? What's next?

Stefan


^ permalink raw reply	[flat|nested] 28+ messages in thread
* OSD crash
@ 2011-05-27  0:12 Fyodor Ustinov
  2011-05-27 15:16 ` Gregory Farnum
  0 siblings, 1 reply; 28+ messages in thread
From: Fyodor Ustinov @ 2011-05-27  0:12 UTC (permalink / raw)
  To: ceph-devel

Hi!

2011-05-27 02:35:22.046798 7fa8ff058700 journal check_for_full at 
837623808 : JOURNAL FULL 837623808 >= 147455 (max_size 996147200 start 
837771264)
2011-05-27 02:35:23.479379 7fa8f7f49700 journal throttle: waited for bytes
2011-05-27 02:35:34.730418 7fa8ff058700 journal check_for_full at 
836984832 : JOURNAL FULL 836984832 >= 638975 (max_size 996147200 start 
837623808)
2011-05-27 02:35:36.050384 7fa8f7f49700 journal throttle: waited for bytes
2011-05-27 02:35:47.226789 7fa8ff058700 journal check_for_full at 
836882432 : JOURNAL FULL 836882432 >= 102399 (max_size 996147200 start 
836984832)
2011-05-27 02:35:48.937259 7fa8f874a700 journal throttle: waited for bytes
2011-05-27 02:35:59.985040 7fa8ff058700 journal check_for_full at 
836685824 : JOURNAL FULL 836685824 >= 196607 (max_size 996147200 start 
836882432)
2011-05-27 02:36:01.654955 7fa8f874a700 journal throttle: waited for bytes
2011-05-27 02:36:12.362896 7fa8ff058700 journal check_for_full at 
835723264 : JOURNAL FULL 835723264 >= 962559 (max_size 996147200 start 
836685824)
2011-05-27 02:36:14.375435 7fa8f7f49700 journal throttle: waited for bytes
./include/xlist.h: In function 'void xlist<T>::remove(xlist<T>::item*) 
[with T = PG*]', in thread '0x7fa8f7748700'
./include/xlist.h: 107: FAILED assert(i->_list == this)
  ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632)
  1: (xlist<PG*>::pop_front()+0xbb) [0x54f28b]
  2: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3]
  3: (ThreadPool::worker()+0x10a) [0x65799a]
  4: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d]
  5: (()+0x6d8c) [0x7fa904294d8c]
  6: (clone()+0x6d) [0x7fa90314704d]
  ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632)
  1: (xlist<PG*>::pop_front()+0xbb) [0x54f28b]
  2: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3]
  3: (ThreadPool::worker()+0x10a) [0x65799a]
  4: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d]
  5: (()+0x6d8c) [0x7fa904294d8c]
  6: (clone()+0x6d) [0x7fa90314704d]
*** Caught signal (Aborted) **
  in thread 0x7fa8f7748700
  ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632)
  1: /usr/bin/cosd() [0x6729f9]
  2: (()+0xfc60) [0x7fa90429dc60]
  3: (gsignal()+0x35) [0x7fa903094d05]
  4: (abort()+0x186) [0x7fa903098ab6]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa90394b6dd]
  6: (()+0xb9926) [0x7fa903949926]
  7: (()+0xb9953) [0x7fa903949953]
  8: (()+0xb9a5e) [0x7fa903949a5e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x362) [0x655e32]
  10: (xlist<PG*>::pop_front()+0xbb) [0x54f28b]
  11: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3]
  12: (ThreadPool::worker()+0x10a) [0x65799a]
  13: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d]
  14: (()+0x6d8c) [0x7fa904294d8c]
  15: (clone()+0x6d) [0x7fa90314704d]

WBR,
     Fyodor.

^ permalink raw reply	[flat|nested] 28+ messages in thread
* RE: OSD Crash
@ 2011-05-11 20:47 Mark Nigh
  2011-05-11 21:06 ` Sage Weil
  2011-05-11 21:39 ` Colin McCabe
  0 siblings, 2 replies; 28+ messages in thread
From: Mark Nigh @ 2011-05-11 20:47 UTC (permalink / raw)
  To: Mark Nigh, ceph-devel

Some additional testing shows that the underlying filesystem btrfs does fail thus the daemon appropriately fails.

The way I am simulating a failed HDD is by removing the HDD. The failure is working, but the problem is when I reinsert the HDD. I think I see the BTRFS filesystem recovery (btrfs filesystem show) and I can start the correct osd daemon that corresponds to the mount point but I do not see the osd come up and in (ceph -s). The log is limited to

 ceph version 0.27.commit: 793034c62c8e9ffab4af675ca97135fd1b193c9c. process: cosd. pid: 2702
2011-05-11 15:13:58.650515 7fc6a349d760 filestore(/mnt/osd2) mount FIEMAP ioctl is NOT supported
2011-05-11 15:13:58.650754 7fc6a349d760 filestore(/mnt/osd2) mount detected btrfs
2011-05-11 15:13:58.650768 7fc6a349d760 filestore(/mnt/osd2) mount btrfs CLONE_RANGE ioctl is supported

If I try to restart the osd daemon, it is unable to kill the process and repeats trying to kill it.

Is the underlying file system not recovery like I think? I guess removing and inserting the HDD isn't the correct way to simulate a dead HDD.? Show I following the process of removing the osd, initializing the osd data dir and then restart the osd daemon?

Thanks.

Mark Nigh
Systems Architect
Netelligent Corporation
mnigh@netelligent.com



-----Original Message-----
From: Mark Nigh
Sent: Wednesday, May 11, 2011 8:12 AM
To: 'ceph-devel@vger.kernel.org'
Subject: OSD Crash

I was performing a few failure test with the osd by removing a HDD from one of the osd host. All was well, the cluster noticed the failure and re-balanced data but when I replace the HDD into the host, the cosd crashed.

Here is my setup. 6 osd host with 4 HDDs each (4 cosd daemons running for each host). 1 mon and 2 mds (separate host).

Here is the log from the osd0

2011-05-10 16:25:02.776151 7f9e16d36700 -- 10.6.1.92:6800/15566 >> 10.6.1.63:0/2322371038 pipe(0x4315a00 sd=14 pgs=0 cs=0 l=0).accept peer addr is really 10.6.1.63:0/2322371038 (socket is 10.6.1.63:42299/0)
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '0x7f9e22577700'
os/FileStore.cc: 2120: FAILED assert(0 == "EIO handling not implemented")
 ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x194) [0x5a0c84]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x156) [0x5a3536]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x13e) [0x598ebe]
 4: (ThreadPool::worker()+0x2a2) [0x626fa2]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x529f1d]
 6: (()+0x6d8c) [0x7f9e29434d8c]
 7: (clone()+0x6d) [0x7f9e2808204d]
 ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x194) [0x5a0c84]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x156) [0x5a3536]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x13e) [0x598ebe]
 4: (ThreadPool::worker()+0x2a2) [0x626fa2]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x529f1d]
 6: (()+0x6d8c) [0x7f9e29434d8c]
 7: (clone()+0x6d) [0x7f9e2808204d]
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '0x7f9e21d76700'
os/FileStore.cc: 2120: FAILED assert(0 == "EIO handling not implemented")
 ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x194) [0x5a0c84]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x156) [0x5a3536]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x13e) [0x598ebe]
 4: (ThreadPool::worker()+0x2a2) [0x626fa2]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x529f1d]
 6: (()+0x6d8c) [0x7f9e29434d8c]
 7: (clone()+0x6d) [0x7f9e2808204d]
 ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x194) [0x5a0c84]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x156) [0x5a3536]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x13e) [0x598ebe]
 4: (ThreadPool::worker()+0x2a2) [0x626fa2]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x529f1d]
 6: (()+0x6d8c) [0x7f9e29434d8c]
 7: (clone()+0x6d) [0x7f9e2808204d]
*** Caught signal (Aborted) **
 in thread 0x7f9e22577700
ceph version 0.27.commit: 793034c62c8e9ffab4af675ca97135fd1b193c9c. process: cosd. pid: 1414
2011-05-10 22:01:13.762083 7f0620492760 filestore(/mnt/osd0) mount FIEMAP ioctl is NOT supported
2011-05-10 22:01:13.762276 7f0620492760 filestore(/mnt/osd0) mount detected btrfs
2011-05-10 22:01:13.762288 7f0620492760 filestore(/mnt/osd0) mount btrfs CLONE_RANGE ioctl is supported
*** Caught signal (Terminated) **
 in thread 0x7f061e7b4700. Shutting down.

As you can see with the attached log, I try to restart the cosd at 22:01. The service is started but ceph -s doesn't include the osd.

Thanks for your help.

Mark Nigh
Systems Architect
Netelligent Corporation
mnigh@netelligent.com



This transmission and any attached files are privileged, confidential or otherwise the exclusive property of the intended recipient or Netelligent Corporation. If you are not the intended recipient, any disclosure, copying, distribution or use of any of the information contained in or attached to this transmission is strictly prohibited. If you have received this transmission in error, please contact us immediately by responding to this message or by telephone (314-392-6900) and promptly destroy the original transmission and its attachments.

^ permalink raw reply	[flat|nested] 28+ messages in thread
* OSD Crash
@ 2011-05-11 13:12 Mark Nigh
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Nigh @ 2011-05-11 13:12 UTC (permalink / raw)
  To: ceph-devel

I was performing a few failure test with the osd by removing a HDD from one of the osd host. All was well, the cluster noticed the failure and re-balanced data but when I replace the HDD into the host, the cosd crashed.

Here is my setup. 6 osd host with 4 HDDs each (4 cosd daemons running for each host). 1 mon and 2 mds (separate host).

Here is the log from the osd0

2011-05-10 16:25:02.776151 7f9e16d36700 -- 10.6.1.92:6800/15566 >> 10.6.1.63:0/2322371038 pipe(0x4315a00 sd=14 pgs=0 cs=0 l=0).accept peer addr is really 10.6.1.63:0/2322371038 (socket is 10.6.1.63:42299/0)
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '0x7f9e22577700'
os/FileStore.cc: 2120: FAILED assert(0 == "EIO handling not implemented")
 ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x194) [0x5a0c84]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x156) [0x5a3536]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x13e) [0x598ebe]
 4: (ThreadPool::worker()+0x2a2) [0x626fa2]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x529f1d]
 6: (()+0x6d8c) [0x7f9e29434d8c]
 7: (clone()+0x6d) [0x7f9e2808204d]
 ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x194) [0x5a0c84]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x156) [0x5a3536]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x13e) [0x598ebe]
 4: (ThreadPool::worker()+0x2a2) [0x626fa2]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x529f1d]
 6: (()+0x6d8c) [0x7f9e29434d8c]
 7: (clone()+0x6d) [0x7f9e2808204d]
os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&)', in thread '0x7f9e21d76700'
os/FileStore.cc: 2120: FAILED assert(0 == "EIO handling not implemented")
 ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x194) [0x5a0c84]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x156) [0x5a3536]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x13e) [0x598ebe]
 4: (ThreadPool::worker()+0x2a2) [0x626fa2]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x529f1d]
 6: (()+0x6d8c) [0x7f9e29434d8c]
 7: (clone()+0x6d) [0x7f9e2808204d]
 ceph version 0.27 (commit:793034c62c8e9ffab4af675ca97135fd1b193c9c)
 1: (FileStore::_do_transaction(ObjectStore::Transaction&)+0x194) [0x5a0c84]
 2: (FileStore::do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long)+0x156) [0x5a3536]
 3: (FileStore::_do_op(FileStore::OpSequencer*)+0x13e) [0x598ebe]
 4: (ThreadPool::worker()+0x2a2) [0x626fa2]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x529f1d]
 6: (()+0x6d8c) [0x7f9e29434d8c]
 7: (clone()+0x6d) [0x7f9e2808204d]
*** Caught signal (Aborted) **
 in thread 0x7f9e22577700
ceph version 0.27.commit: 793034c62c8e9ffab4af675ca97135fd1b193c9c. process: cosd. pid: 1414
2011-05-10 22:01:13.762083 7f0620492760 filestore(/mnt/osd0) mount FIEMAP ioctl is NOT supported
2011-05-10 22:01:13.762276 7f0620492760 filestore(/mnt/osd0) mount detected btrfs
2011-05-10 22:01:13.762288 7f0620492760 filestore(/mnt/osd0) mount btrfs CLONE_RANGE ioctl is supported
*** Caught signal (Terminated) **
 in thread 0x7f061e7b4700. Shutting down.

As you can see with the attached log, I try to restart the cosd at 22:01. The service is started but ceph -s doesn't include the osd.

Thanks for your help.

Mark Nigh
Systems Architect
Netelligent Corporation
mnigh@netelligent.com



This transmission and any attached files are privileged, confidential or otherwise the exclusive property of the intended recipient or Netelligent Corporation. If you are not the intended recipient, any disclosure, copying, distribution or use of any of the information contained in or attached to this transmission is strictly prohibited. If you have received this transmission in error, please contact us immediately by responding to this message or by telephone (314-392-6900) and promptly destroy the original transmission and its attachments.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-09-07 16:42 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-22 20:31 OSD crash Andrey Korolyov
2012-08-22 22:33 ` Sage Weil
2012-08-22 22:55   ` Andrey Korolyov
2012-08-23  0:09     ` Gregory Farnum
2012-08-25  8:30       ` Andrey Korolyov
2012-08-26 16:52         ` Andrey Korolyov
2012-08-26 20:44           ` Sage Weil
2012-09-04  8:13           ` Andrey Korolyov
2012-09-04 15:32             ` Sage Weil
  -- strict thread matches above, loose matches on Subject: below --
2020-09-07 16:42 osd crash Kaarlo Lahtela
     [not found] <8566685.312.1362419807745.JavaMail.dspano@it1>
2013-03-04 18:02 ` OSD Crash Dave Spano
2012-06-16 12:57 OSD crash Stefan Priebe
2012-06-16 13:34 ` Stefan Priebe
2012-06-17 21:16   ` Sage Weil
2012-06-18  6:41     ` Stefan Priebe - Profihost AG
2011-05-27  0:12 Fyodor Ustinov
2011-05-27 15:16 ` Gregory Farnum
2011-05-27 16:41   ` Fyodor Ustinov
2011-05-27 16:49     ` Gregory Farnum
2011-05-27 19:18       ` Gregory Farnum
2011-05-27 19:30         ` Fyodor Ustinov
2011-05-27 22:52         ` Fyodor Ustinov
2011-05-11 20:47 OSD Crash Mark Nigh
2011-05-11 21:06 ` Sage Weil
2011-05-11 21:39 ` Colin McCabe
2011-05-13 17:03   ` Mark Nigh
2011-05-13 18:34     ` Sage Weil
2011-05-11 13:12 Mark Nigh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.