All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG: crash in mds after recovery
@ 2011-07-23  5:28 Marcus Sorensen
  2011-07-23 15:01 ` Marcus Sorensen
  2011-07-23 21:55 ` Sage Weil
  0 siblings, 2 replies; 5+ messages in thread
From: Marcus Sorensen @ 2011-07-23  5:28 UTC (permalink / raw)
  To: ceph-devel

Should these be reported here or perhaps through the tracker? Just a
warning, I'm going to purposely be trying to break things and figuring
out how to recover (or at least what never to do :), so now's probably
a good time to set me straight on procedure.

i was fiddling with my test cluster tonight and ran into this.
Basically I decided to reboot the cluster, and after reboot I had to
play a bit with the journals to get things going again. I didn't
realize that putting them on tmpfs would require me to touch the file,
pull down the monmap, run 'cosd -i <num> --monmap <monmap>
--mkjournal' before the osds could come online. For some reason I had
the impression it would create the journal... I did something like
'service ceph start', saw that the osd was mounted but cosd failed,
pulled monmap, recreated journal, then 'service ceph start osd.<num>'.
After getting these online I noticed the mds services were crashing on
all three nodes like so. Notice also that there were some clients
connected prior.

[root@ceph3 ~]# 2011-07-22 23:03:57.640115 7f27d02cd700 mds-1.0
handle_mds_map standby
2011-07-22 23:04:21.928388 7f27d02cd700 mds-1.0 handle_mds_map standby
2011-07-22 23:04:22.164395 7f27d02cd700 mds-1.0 handle_mds_map standby
2011-07-22 23:04:37.634939 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
2011-07-22 23:04:37.634967 7f27d02cd700 mds0.22 handle_mds_map state
change up:standby --> up:replay
2011-07-22 23:04:37.634975 7f27d02cd700 mds0.22 replay_start
2011-07-22 23:04:37.634984 7f27d02cd700 mds0.22  recovery set is
2011-07-22 23:04:37.634990 7f27d02cd700 mds0.22  need osdmap epoch 131, have 128
2011-07-22 23:04:37.634995 7f27d02cd700 mds0.22  waiting for osdmap
131 (which blacklists prior instance)
2011-07-22 23:04:37.635027 7f27d02cd700 mds0.cache handle_mds_failure
mds0 : recovery peers are
2011-07-22 23:04:37.644053 7f27d02cd700 mds0.22 ms_handle_connect on
10.10.10.93:6801/3468
2011-07-22 23:04:37.649460 7f27d02cd700 mds0.22 ms_handle_connect on
10.10.10.91:6801/4314
2011-07-22 23:04:37.654568 7f27d02cd700 mds0.22 ms_handle_connect on
10.10.10.92:6803/4193
2011-07-22 23:04:37.725419 7f27d02cd700 mds0.cache creating system
inode with ino:100
2011-07-22 23:04:37.725542 7f27d02cd700 mds0.cache creating system
inode with ino:1
2011-07-22 23:04:39.223574 7f27cdbc2700 mds0.22 replay_done
2011-07-22 23:04:39.223636 7f27cdbc2700 mds0.22 making mds journal writeable
2011-07-22 23:04:39.239472 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
2011-07-22 23:04:39.239768 7f27d02cd700 mds0.22 handle_mds_map state
change up:replay --> up:reconnect
2011-07-22 23:04:39.239981 7f27d02cd700 mds0.22 reconnect_start
2011-07-22 23:04:39.240187 7f27d02cd700 mds0.22 reopen_log
2011-07-22 23:04:39.240399 7f27d02cd700 mds0.server reconnect_clients
-- 3 sessions
2011-07-22 23:05:27.040962 7f27cf0c9700 mds0.server reconnect gave up
on client4108 10.10.10.99:0/1929168255
2011-07-22 23:05:27.041021 7f27cf0c9700 mds0.server reconnect gave up
on client4906 10.10.10.99:0/3243338282
2011-07-22 23:05:27.041038 7f27cf0c9700 mds0.server reconnect gave up
on client5114 10.10.10.99:0/2910973772
2011-07-22 23:05:27.041104 7f27cf0c9700 mds0.22 reconnect_done
2011-07-22 23:05:27.061458 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
2011-07-22 23:05:27.061480 7f27d02cd700 mds0.22 handle_mds_map state
change up:reconnect --> up:rejoin
2011-07-22 23:05:27.061490 7f27d02cd700 mds0.22 rejoin_joint_start
2011-07-22 23:05:27.161091 7f27d02cd700 mds0.22 rejoin_done
2011-07-22 23:05:27.181033 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
2011-07-22 23:05:27.181092 7f27d02cd700 mds0.22 handle_mds_map state
change up:rejoin --> up:active
2011-07-22 23:05:27.181107 7f27d02cd700 mds0.22 recovery_done --
successful recovery!
2011-07-22 23:05:27.188878 7f27d02cd700 mds0.22 active_start
2011-07-22 23:05:27.191328 7f27d02cd700 mds0.22 cluster recovered.
mds/Locker.cc: In function 'bool Locker::issue_caps(CInode*,
Capability*)', in thread '0x7f27d02cd700'
mds/Locker.cc: 1459: FAILED assert(in->is_head())
 ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
 1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
 2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
 3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
Capability*, MClientCaps*)+0x665) [0x5c2a45]
 4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
 5: (finish_contexts(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0xd0) [0x591250]
 6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
 7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
 8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
 9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
 10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
 11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
 12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
 13: (()+0x77e1) [0x7f27d354a7e1]
 14: (clone()+0x6d) [0x7f27d2b048ed]
 ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
 1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
 2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
 3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
Capability*, MClientCaps*)+0x665) [0x5c2a45]
 4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
 5: (finish_contexts(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0xd0) [0x591250]
 6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
 7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
 8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
 9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
 10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
 11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
 12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
 13: (()+0x77e1) [0x7f27d354a7e1]
 14: (clone()+0x6d) [0x7f27d2b048ed]
*** Caught signal (Aborted) **
 in thread 0x7f27d02cd700
 ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
 1: /usr/bin/cmds() [0x747ea9]
 2: (()+0xf4c0) [0x7f27d35524c0]
 3: (gsignal()+0x35) [0x7f27d2a559a5]
 4: (abort()+0x175) [0x7f27d2a57185]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f27d32faaad]
 6: (()+0xbcc36) [0x7f27d32f8c36]
 7: (()+0xbcc63) [0x7f27d32f8c63]
 8: (()+0xbcd5e) [0x7f27d32f8d5e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x3a7) [0x6c6a47]
 10: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
 11: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
 12: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
Capability*, MClientCaps*)+0x665) [0x5c2a45]
 13: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
 14: (finish_contexts(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0xd0) [0x591250]
 15: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
 16: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
 17: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
 18: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
 19: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
 20: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
 21: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
 22: (()+0x77e1) [0x7f27d354a7e1]
 23: (clone()+0x6d) [0x7f27d2b048ed]



2011-07-22 23:18:55.727738   mds e196: 1/1/1 up
{0=ceph3=up:active(laggy or crashed)}, 1 up:standby
2011-07-22 23:18:55.742299   osd e140: 3 osds: 3 up, 3 in
2011-07-22 23:18:55.743943   mds e197: 1/1/1 up {0=ceph3=up:replay}
2011-07-22 23:18:55.758649    pg v957: 594 pgs: 594 active+clean; 3760
MB data, 13164 MB used, 18063 MB / 35997 MB avail
2011-07-22 23:18:55.792549   log 2011-07-22 23:18:58.459865 mon0
10.10.10.91:6789/0 26 : [INF] mds? 10.10.10.93:6800/4702 up:boot
2011-07-22 23:18:57.304880   mds e198: 1/1/1 up {0=ceph3=up:reconnect}
2011-07-22 23:18:57.367480   log 2011-07-22 23:19:00.037792 mon0
10.10.10.91:6789/0 27 : [INF] mds0 10.10.10.93:6800/4702 up:reconnect
2011-07-22 23:18:58.493288    pg v958: 594 pgs: 594 active+clean; 3760
MB data, 13164 MB used, 18063 MB / 35997 MB avail
2011-07-22 23:19:01.139243    pg v959: 594 pgs: 594 active+clean; 3760
MB data, 13163 MB used, 18063 MB / 35997 MB avail
2011-07-22 23:19:21.118647   osd e141: 3 osds: 3 up, 3 in
2011-07-22 23:19:21.141306    pg v960: 594 pgs: 594 active+clean; 3760
MB data, 13163 MB used, 18063 MB / 35997 MB avail
2011-07-22 23:19:23.488866    pg v961: 594 pgs: 594 active+clean; 3760
MB data, 13165 MB used, 18063 MB / 35997 MB avail
2011-07-22 23:19:24.516985    pg v962: 594 pgs: 594 active+clean; 3760
MB data, 13167 MB used, 18063 MB / 35997 MB avail
2011-07-22 23:19:46.738637   mds e199: 1/1/1 up {0=ceph3=up:rejoin}
2011-07-22 23:19:46.800581   log 2011-07-22 23:19:48.667174 mon0
10.10.10.91:6789/0 28 : [INF] mds0 10.10.10.93:6800/4702 up:rejoin
2011-07-22 23:19:46.859582   mds e200: 1/1/1 up {0=ceph3=up:active}
2011-07-22 23:19:47.814926   log 2011-07-22 23:19:48.787984 mon0
10.10.10.91:6789/0 29 : [INF] mds0 10.10.10.93:6800/4702 up:active
2011-07-22 23:19:49.584346    pg v963: 594 pgs: 594 active+clean; 3761
MB data, 13168 MB used, 18063 MB / 35997 MB avail
2011-07-22 23:19:50.598374    pg v964: 594 pgs: 594 active+clean; 3761
MB data, 13166 MB used, 18062 MB / 35997 MB avail
2011-07-22 23:20:01.943931   mds e201: 1/1/1 up
{0=ceph3=up:active(laggy or crashed)}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: crash in mds after recovery
  2011-07-23  5:28 BUG: crash in mds after recovery Marcus Sorensen
@ 2011-07-23 15:01 ` Marcus Sorensen
  2011-07-23 21:55 ` Sage Weil
  1 sibling, 0 replies; 5+ messages in thread
From: Marcus Sorensen @ 2011-07-23 15:01 UTC (permalink / raw)
  To: ceph-devel

In browsing this code it became obvious to me that this had something
to do with the snapshots, so I should mention that I did have two
snapshots created on my ceph filesystem.  I started fresh last night,
this time with a hard disk journal, and ran through the same steps:
mount filesystem, copy a few files, snapshot, delete a few files,
another snapshot, delete a few more, restart an mds, and bam, mds
won't stay up.

After looking through the code this morning and seeing that it was
looking for snapshot head, I continually restarted all of the mds
nodes until eventually the client connected long enough to delete the
snapshots. After that everything was fine.

On Fri, Jul 22, 2011 at 11:28 PM, Marcus Sorensen <shadowsor@gmail.com> wrote:
> Should these be reported here or perhaps through the tracker? Just a
> warning, I'm going to purposely be trying to break things and figuring
> out how to recover (or at least what never to do :), so now's probably
> a good time to set me straight on procedure.
>
> i was fiddling with my test cluster tonight and ran into this.
> Basically I decided to reboot the cluster, and after reboot I had to
> play a bit with the journals to get things going again. I didn't
> realize that putting them on tmpfs would require me to touch the file,
> pull down the monmap, run 'cosd -i <num> --monmap <monmap>
> --mkjournal' before the osds could come online. For some reason I had
> the impression it would create the journal... I did something like
> 'service ceph start', saw that the osd was mounted but cosd failed,
> pulled monmap, recreated journal, then 'service ceph start osd.<num>'.
> After getting these online I noticed the mds services were crashing on
> all three nodes like so. Notice also that there were some clients
> connected prior.
>
> [root@ceph3 ~]# 2011-07-22 23:03:57.640115 7f27d02cd700 mds-1.0
> handle_mds_map standby
> 2011-07-22 23:04:21.928388 7f27d02cd700 mds-1.0 handle_mds_map standby
> 2011-07-22 23:04:22.164395 7f27d02cd700 mds-1.0 handle_mds_map standby
> 2011-07-22 23:04:37.634939 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
> 2011-07-22 23:04:37.634967 7f27d02cd700 mds0.22 handle_mds_map state
> change up:standby --> up:replay
> 2011-07-22 23:04:37.634975 7f27d02cd700 mds0.22 replay_start
> 2011-07-22 23:04:37.634984 7f27d02cd700 mds0.22  recovery set is
> 2011-07-22 23:04:37.634990 7f27d02cd700 mds0.22  need osdmap epoch 131, have 128
> 2011-07-22 23:04:37.634995 7f27d02cd700 mds0.22  waiting for osdmap
> 131 (which blacklists prior instance)
> 2011-07-22 23:04:37.635027 7f27d02cd700 mds0.cache handle_mds_failure
> mds0 : recovery peers are
> 2011-07-22 23:04:37.644053 7f27d02cd700 mds0.22 ms_handle_connect on
> 10.10.10.93:6801/3468
> 2011-07-22 23:04:37.649460 7f27d02cd700 mds0.22 ms_handle_connect on
> 10.10.10.91:6801/4314
> 2011-07-22 23:04:37.654568 7f27d02cd700 mds0.22 ms_handle_connect on
> 10.10.10.92:6803/4193
> 2011-07-22 23:04:37.725419 7f27d02cd700 mds0.cache creating system
> inode with ino:100
> 2011-07-22 23:04:37.725542 7f27d02cd700 mds0.cache creating system
> inode with ino:1
> 2011-07-22 23:04:39.223574 7f27cdbc2700 mds0.22 replay_done
> 2011-07-22 23:04:39.223636 7f27cdbc2700 mds0.22 making mds journal writeable
> 2011-07-22 23:04:39.239472 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
> 2011-07-22 23:04:39.239768 7f27d02cd700 mds0.22 handle_mds_map state
> change up:replay --> up:reconnect
> 2011-07-22 23:04:39.239981 7f27d02cd700 mds0.22 reconnect_start
> 2011-07-22 23:04:39.240187 7f27d02cd700 mds0.22 reopen_log
> 2011-07-22 23:04:39.240399 7f27d02cd700 mds0.server reconnect_clients
> -- 3 sessions
> 2011-07-22 23:05:27.040962 7f27cf0c9700 mds0.server reconnect gave up
> on client4108 10.10.10.99:0/1929168255
> 2011-07-22 23:05:27.041021 7f27cf0c9700 mds0.server reconnect gave up
> on client4906 10.10.10.99:0/3243338282
> 2011-07-22 23:05:27.041038 7f27cf0c9700 mds0.server reconnect gave up
> on client5114 10.10.10.99:0/2910973772
> 2011-07-22 23:05:27.041104 7f27cf0c9700 mds0.22 reconnect_done
> 2011-07-22 23:05:27.061458 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
> 2011-07-22 23:05:27.061480 7f27d02cd700 mds0.22 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2011-07-22 23:05:27.061490 7f27d02cd700 mds0.22 rejoin_joint_start
> 2011-07-22 23:05:27.161091 7f27d02cd700 mds0.22 rejoin_done
> 2011-07-22 23:05:27.181033 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
> 2011-07-22 23:05:27.181092 7f27d02cd700 mds0.22 handle_mds_map state
> change up:rejoin --> up:active
> 2011-07-22 23:05:27.181107 7f27d02cd700 mds0.22 recovery_done --
> successful recovery!
> 2011-07-22 23:05:27.188878 7f27d02cd700 mds0.22 active_start
> 2011-07-22 23:05:27.191328 7f27d02cd700 mds0.22 cluster recovered.
> mds/Locker.cc: In function 'bool Locker::issue_caps(CInode*,
> Capability*)', in thread '0x7f27d02cd700'
> mds/Locker.cc: 1459: FAILED assert(in->is_head())
>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>  1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>  2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>  3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>  4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>  5: (finish_contexts(CephContext*, std::list<Context*,
> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>  6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>  7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>  8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>  9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>  10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>  11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>  12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>  13: (()+0x77e1) [0x7f27d354a7e1]
>  14: (clone()+0x6d) [0x7f27d2b048ed]
>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>  1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>  2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>  3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>  4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>  5: (finish_contexts(CephContext*, std::list<Context*,
> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>  6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>  7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>  8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>  9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>  10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>  11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>  12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>  13: (()+0x77e1) [0x7f27d354a7e1]
>  14: (clone()+0x6d) [0x7f27d2b048ed]
> *** Caught signal (Aborted) **
>  in thread 0x7f27d02cd700
>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>  1: /usr/bin/cmds() [0x747ea9]
>  2: (()+0xf4c0) [0x7f27d35524c0]
>  3: (gsignal()+0x35) [0x7f27d2a559a5]
>  4: (abort()+0x175) [0x7f27d2a57185]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f27d32faaad]
>  6: (()+0xbcc36) [0x7f27d32f8c36]
>  7: (()+0xbcc63) [0x7f27d32f8c63]
>  8: (()+0xbcd5e) [0x7f27d32f8d5e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x3a7) [0x6c6a47]
>  10: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>  11: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>  12: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>  13: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>  14: (finish_contexts(CephContext*, std::list<Context*,
> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>  15: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>  16: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>  17: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>  18: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>  19: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>  20: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>  21: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>  22: (()+0x77e1) [0x7f27d354a7e1]
>  23: (clone()+0x6d) [0x7f27d2b048ed]
>
>
>
> 2011-07-22 23:18:55.727738   mds e196: 1/1/1 up
> {0=ceph3=up:active(laggy or crashed)}, 1 up:standby
> 2011-07-22 23:18:55.742299   osd e140: 3 osds: 3 up, 3 in
> 2011-07-22 23:18:55.743943   mds e197: 1/1/1 up {0=ceph3=up:replay}
> 2011-07-22 23:18:55.758649    pg v957: 594 pgs: 594 active+clean; 3760
> MB data, 13164 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:18:55.792549   log 2011-07-22 23:18:58.459865 mon0
> 10.10.10.91:6789/0 26 : [INF] mds? 10.10.10.93:6800/4702 up:boot
> 2011-07-22 23:18:57.304880   mds e198: 1/1/1 up {0=ceph3=up:reconnect}
> 2011-07-22 23:18:57.367480   log 2011-07-22 23:19:00.037792 mon0
> 10.10.10.91:6789/0 27 : [INF] mds0 10.10.10.93:6800/4702 up:reconnect
> 2011-07-22 23:18:58.493288    pg v958: 594 pgs: 594 active+clean; 3760
> MB data, 13164 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:01.139243    pg v959: 594 pgs: 594 active+clean; 3760
> MB data, 13163 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:21.118647   osd e141: 3 osds: 3 up, 3 in
> 2011-07-22 23:19:21.141306    pg v960: 594 pgs: 594 active+clean; 3760
> MB data, 13163 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:23.488866    pg v961: 594 pgs: 594 active+clean; 3760
> MB data, 13165 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:24.516985    pg v962: 594 pgs: 594 active+clean; 3760
> MB data, 13167 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:46.738637   mds e199: 1/1/1 up {0=ceph3=up:rejoin}
> 2011-07-22 23:19:46.800581   log 2011-07-22 23:19:48.667174 mon0
> 10.10.10.91:6789/0 28 : [INF] mds0 10.10.10.93:6800/4702 up:rejoin
> 2011-07-22 23:19:46.859582   mds e200: 1/1/1 up {0=ceph3=up:active}
> 2011-07-22 23:19:47.814926   log 2011-07-22 23:19:48.787984 mon0
> 10.10.10.91:6789/0 29 : [INF] mds0 10.10.10.93:6800/4702 up:active
> 2011-07-22 23:19:49.584346    pg v963: 594 pgs: 594 active+clean; 3761
> MB data, 13168 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:50.598374    pg v964: 594 pgs: 594 active+clean; 3761
> MB data, 13166 MB used, 18062 MB / 35997 MB avail
> 2011-07-22 23:20:01.943931   mds e201: 1/1/1 up
> {0=ceph3=up:active(laggy or crashed)}
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: crash in mds after recovery
  2011-07-23  5:28 BUG: crash in mds after recovery Marcus Sorensen
  2011-07-23 15:01 ` Marcus Sorensen
@ 2011-07-23 21:55 ` Sage Weil
  2011-07-23 22:28   ` Marcus Sorensen
  1 sibling, 1 reply; 5+ messages in thread
From: Sage Weil @ 2011-07-23 21:55 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: ceph-devel

On Fri, 22 Jul 2011, Marcus Sorensen wrote:
> Should these be reported here or perhaps through the tracker?

It would be great if you could add this to the tracker with a detail 
description of how you hit the bug and whatever logs you have available.  
On-list is fine too, just a bit more work on our end.

We haven't done a lot of MDS restart testing with snapshot workloads, so 
this isn't terribly surprising.  I'm happy to hear you were able to work 
around it, though!

> Just a
> warning, I'm going to purposely be trying to break things and figuring
> out how to recover (or at least what never to do :), so now's probably
> a good time to set me straight on procedure.
> 
> i was fiddling with my test cluster tonight and ran into this.
> Basically I decided to reboot the cluster, and after reboot I had to
> play a bit with the journals to get things going again. I didn't
> realize that putting them on tmpfs would require me to touch the file,

Putting the journals on tmpfs is not a good idea.  The OSD considers 
anything journaled safe and durable, so you're exposing yourself to data 
loss in the event of a power failure.  Hopefully this is a temporary thing 
to simulate an SSD or NVRAM journal...? :)

sage


> pull down the monmap, run 'cosd -i <num> --monmap <monmap>
> --mkjournal' before the osds could come online. For some reason I had
> the impression it would create the journal... I did something like
> 'service ceph start', saw that the osd was mounted but cosd failed,
> pulled monmap, recreated journal, then 'service ceph start osd.<num>'.
> After getting these online I noticed the mds services were crashing on
> all three nodes like so. Notice also that there were some clients
> connected prior.
> 
> [root@ceph3 ~]# 2011-07-22 23:03:57.640115 7f27d02cd700 mds-1.0
> handle_mds_map standby
> 2011-07-22 23:04:21.928388 7f27d02cd700 mds-1.0 handle_mds_map standby
> 2011-07-22 23:04:22.164395 7f27d02cd700 mds-1.0 handle_mds_map standby
> 2011-07-22 23:04:37.634939 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
> 2011-07-22 23:04:37.634967 7f27d02cd700 mds0.22 handle_mds_map state
> change up:standby --> up:replay
> 2011-07-22 23:04:37.634975 7f27d02cd700 mds0.22 replay_start
> 2011-07-22 23:04:37.634984 7f27d02cd700 mds0.22  recovery set is
> 2011-07-22 23:04:37.634990 7f27d02cd700 mds0.22  need osdmap epoch 131, have 128
> 2011-07-22 23:04:37.634995 7f27d02cd700 mds0.22  waiting for osdmap
> 131 (which blacklists prior instance)
> 2011-07-22 23:04:37.635027 7f27d02cd700 mds0.cache handle_mds_failure
> mds0 : recovery peers are
> 2011-07-22 23:04:37.644053 7f27d02cd700 mds0.22 ms_handle_connect on
> 10.10.10.93:6801/3468
> 2011-07-22 23:04:37.649460 7f27d02cd700 mds0.22 ms_handle_connect on
> 10.10.10.91:6801/4314
> 2011-07-22 23:04:37.654568 7f27d02cd700 mds0.22 ms_handle_connect on
> 10.10.10.92:6803/4193
> 2011-07-22 23:04:37.725419 7f27d02cd700 mds0.cache creating system
> inode with ino:100
> 2011-07-22 23:04:37.725542 7f27d02cd700 mds0.cache creating system
> inode with ino:1
> 2011-07-22 23:04:39.223574 7f27cdbc2700 mds0.22 replay_done
> 2011-07-22 23:04:39.223636 7f27cdbc2700 mds0.22 making mds journal writeable
> 2011-07-22 23:04:39.239472 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
> 2011-07-22 23:04:39.239768 7f27d02cd700 mds0.22 handle_mds_map state
> change up:replay --> up:reconnect
> 2011-07-22 23:04:39.239981 7f27d02cd700 mds0.22 reconnect_start
> 2011-07-22 23:04:39.240187 7f27d02cd700 mds0.22 reopen_log
> 2011-07-22 23:04:39.240399 7f27d02cd700 mds0.server reconnect_clients
> -- 3 sessions
> 2011-07-22 23:05:27.040962 7f27cf0c9700 mds0.server reconnect gave up
> on client4108 10.10.10.99:0/1929168255
> 2011-07-22 23:05:27.041021 7f27cf0c9700 mds0.server reconnect gave up
> on client4906 10.10.10.99:0/3243338282
> 2011-07-22 23:05:27.041038 7f27cf0c9700 mds0.server reconnect gave up
> on client5114 10.10.10.99:0/2910973772
> 2011-07-22 23:05:27.041104 7f27cf0c9700 mds0.22 reconnect_done
> 2011-07-22 23:05:27.061458 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
> 2011-07-22 23:05:27.061480 7f27d02cd700 mds0.22 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2011-07-22 23:05:27.061490 7f27d02cd700 mds0.22 rejoin_joint_start
> 2011-07-22 23:05:27.161091 7f27d02cd700 mds0.22 rejoin_done
> 2011-07-22 23:05:27.181033 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
> 2011-07-22 23:05:27.181092 7f27d02cd700 mds0.22 handle_mds_map state
> change up:rejoin --> up:active
> 2011-07-22 23:05:27.181107 7f27d02cd700 mds0.22 recovery_done --
> successful recovery!
> 2011-07-22 23:05:27.188878 7f27d02cd700 mds0.22 active_start
> 2011-07-22 23:05:27.191328 7f27d02cd700 mds0.22 cluster recovered.
> mds/Locker.cc: In function 'bool Locker::issue_caps(CInode*,
> Capability*)', in thread '0x7f27d02cd700'
> mds/Locker.cc: 1459: FAILED assert(in->is_head())
>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>  1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>  2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>  3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>  4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>  5: (finish_contexts(CephContext*, std::list<Context*,
> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>  6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>  7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>  8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>  9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>  10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>  11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>  12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>  13: (()+0x77e1) [0x7f27d354a7e1]
>  14: (clone()+0x6d) [0x7f27d2b048ed]
>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>  1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>  2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>  3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>  4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>  5: (finish_contexts(CephContext*, std::list<Context*,
> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>  6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>  7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>  8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>  9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>  10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>  11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>  12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>  13: (()+0x77e1) [0x7f27d354a7e1]
>  14: (clone()+0x6d) [0x7f27d2b048ed]
> *** Caught signal (Aborted) **
>  in thread 0x7f27d02cd700
>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>  1: /usr/bin/cmds() [0x747ea9]
>  2: (()+0xf4c0) [0x7f27d35524c0]
>  3: (gsignal()+0x35) [0x7f27d2a559a5]
>  4: (abort()+0x175) [0x7f27d2a57185]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f27d32faaad]
>  6: (()+0xbcc36) [0x7f27d32f8c36]
>  7: (()+0xbcc63) [0x7f27d32f8c63]
>  8: (()+0xbcd5e) [0x7f27d32f8d5e]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x3a7) [0x6c6a47]
>  10: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>  11: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>  12: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>  13: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>  14: (finish_contexts(CephContext*, std::list<Context*,
> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>  15: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>  16: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>  17: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>  18: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>  19: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>  20: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>  21: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>  22: (()+0x77e1) [0x7f27d354a7e1]
>  23: (clone()+0x6d) [0x7f27d2b048ed]
> 
> 
> 
> 2011-07-22 23:18:55.727738   mds e196: 1/1/1 up
> {0=ceph3=up:active(laggy or crashed)}, 1 up:standby
> 2011-07-22 23:18:55.742299   osd e140: 3 osds: 3 up, 3 in
> 2011-07-22 23:18:55.743943   mds e197: 1/1/1 up {0=ceph3=up:replay}
> 2011-07-22 23:18:55.758649    pg v957: 594 pgs: 594 active+clean; 3760
> MB data, 13164 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:18:55.792549   log 2011-07-22 23:18:58.459865 mon0
> 10.10.10.91:6789/0 26 : [INF] mds? 10.10.10.93:6800/4702 up:boot
> 2011-07-22 23:18:57.304880   mds e198: 1/1/1 up {0=ceph3=up:reconnect}
> 2011-07-22 23:18:57.367480   log 2011-07-22 23:19:00.037792 mon0
> 10.10.10.91:6789/0 27 : [INF] mds0 10.10.10.93:6800/4702 up:reconnect
> 2011-07-22 23:18:58.493288    pg v958: 594 pgs: 594 active+clean; 3760
> MB data, 13164 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:01.139243    pg v959: 594 pgs: 594 active+clean; 3760
> MB data, 13163 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:21.118647   osd e141: 3 osds: 3 up, 3 in
> 2011-07-22 23:19:21.141306    pg v960: 594 pgs: 594 active+clean; 3760
> MB data, 13163 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:23.488866    pg v961: 594 pgs: 594 active+clean; 3760
> MB data, 13165 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:24.516985    pg v962: 594 pgs: 594 active+clean; 3760
> MB data, 13167 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:46.738637   mds e199: 1/1/1 up {0=ceph3=up:rejoin}
> 2011-07-22 23:19:46.800581   log 2011-07-22 23:19:48.667174 mon0
> 10.10.10.91:6789/0 28 : [INF] mds0 10.10.10.93:6800/4702 up:rejoin
> 2011-07-22 23:19:46.859582   mds e200: 1/1/1 up {0=ceph3=up:active}
> 2011-07-22 23:19:47.814926   log 2011-07-22 23:19:48.787984 mon0
> 10.10.10.91:6789/0 29 : [INF] mds0 10.10.10.93:6800/4702 up:active
> 2011-07-22 23:19:49.584346    pg v963: 594 pgs: 594 active+clean; 3761
> MB data, 13168 MB used, 18063 MB / 35997 MB avail
> 2011-07-22 23:19:50.598374    pg v964: 594 pgs: 594 active+clean; 3761
> MB data, 13166 MB used, 18062 MB / 35997 MB avail
> 2011-07-22 23:20:01.943931   mds e201: 1/1/1 up
> {0=ceph3=up:active(laggy or crashed)}
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: crash in mds after recovery
  2011-07-23 21:55 ` Sage Weil
@ 2011-07-23 22:28   ` Marcus Sorensen
  2011-07-25  4:31     ` Marcus Sorensen
  0 siblings, 1 reply; 5+ messages in thread
From: Marcus Sorensen @ 2011-07-23 22:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Yeah, I've got a few VMs set up to play with. Sort of painful having
multiple VMs write twice with the hard disk journal.

I'm certain now that I can duplicate this, so I'll start fresh again
once more and keep track of the steps.

Replying-all this time...

On Sat, Jul 23, 2011 at 3:55 PM, Sage Weil <sage@newdream.net> wrote:
> On Fri, 22 Jul 2011, Marcus Sorensen wrote:
>> Should these be reported here or perhaps through the tracker?
>
> It would be great if you could add this to the tracker with a detail
> description of how you hit the bug and whatever logs you have available.
> On-list is fine too, just a bit more work on our end.
>
> We haven't done a lot of MDS restart testing with snapshot workloads, so
> this isn't terribly surprising.  I'm happy to hear you were able to work
> around it, though!
>
>> Just a
>> warning, I'm going to purposely be trying to break things and figuring
>> out how to recover (or at least what never to do :), so now's probably
>> a good time to set me straight on procedure.
>>
>> i was fiddling with my test cluster tonight and ran into this.
>> Basically I decided to reboot the cluster, and after reboot I had to
>> play a bit with the journals to get things going again. I didn't
>> realize that putting them on tmpfs would require me to touch the file,
>
> Putting the journals on tmpfs is not a good idea.  The OSD considers
> anything journaled safe and durable, so you're exposing yourself to data
> loss in the event of a power failure.  Hopefully this is a temporary thing
> to simulate an SSD or NVRAM journal...? :)
>
> sage
>
>
>> pull down the monmap, run 'cosd -i <num> --monmap <monmap>
>> --mkjournal' before the osds could come online. For some reason I had
>> the impression it would create the journal... I did something like
>> 'service ceph start', saw that the osd was mounted but cosd failed,
>> pulled monmap, recreated journal, then 'service ceph start osd.<num>'.
>> After getting these online I noticed the mds services were crashing on
>> all three nodes like so. Notice also that there were some clients
>> connected prior.
>>
>> [root@ceph3 ~]# 2011-07-22 23:03:57.640115 7f27d02cd700 mds-1.0
>> handle_mds_map standby
>> 2011-07-22 23:04:21.928388 7f27d02cd700 mds-1.0 handle_mds_map standby
>> 2011-07-22 23:04:22.164395 7f27d02cd700 mds-1.0 handle_mds_map standby
>> 2011-07-22 23:04:37.634939 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
>> 2011-07-22 23:04:37.634967 7f27d02cd700 mds0.22 handle_mds_map state
>> change up:standby --> up:replay
>> 2011-07-22 23:04:37.634975 7f27d02cd700 mds0.22 replay_start
>> 2011-07-22 23:04:37.634984 7f27d02cd700 mds0.22  recovery set is
>> 2011-07-22 23:04:37.634990 7f27d02cd700 mds0.22  need osdmap epoch 131, have 128
>> 2011-07-22 23:04:37.634995 7f27d02cd700 mds0.22  waiting for osdmap
>> 131 (which blacklists prior instance)
>> 2011-07-22 23:04:37.635027 7f27d02cd700 mds0.cache handle_mds_failure
>> mds0 : recovery peers are
>> 2011-07-22 23:04:37.644053 7f27d02cd700 mds0.22 ms_handle_connect on
>> 10.10.10.93:6801/3468
>> 2011-07-22 23:04:37.649460 7f27d02cd700 mds0.22 ms_handle_connect on
>> 10.10.10.91:6801/4314
>> 2011-07-22 23:04:37.654568 7f27d02cd700 mds0.22 ms_handle_connect on
>> 10.10.10.92:6803/4193
>> 2011-07-22 23:04:37.725419 7f27d02cd700 mds0.cache creating system
>> inode with ino:100
>> 2011-07-22 23:04:37.725542 7f27d02cd700 mds0.cache creating system
>> inode with ino:1
>> 2011-07-22 23:04:39.223574 7f27cdbc2700 mds0.22 replay_done
>> 2011-07-22 23:04:39.223636 7f27cdbc2700 mds0.22 making mds journal writeable
>> 2011-07-22 23:04:39.239472 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
>> 2011-07-22 23:04:39.239768 7f27d02cd700 mds0.22 handle_mds_map state
>> change up:replay --> up:reconnect
>> 2011-07-22 23:04:39.239981 7f27d02cd700 mds0.22 reconnect_start
>> 2011-07-22 23:04:39.240187 7f27d02cd700 mds0.22 reopen_log
>> 2011-07-22 23:04:39.240399 7f27d02cd700 mds0.server reconnect_clients
>> -- 3 sessions
>> 2011-07-22 23:05:27.040962 7f27cf0c9700 mds0.server reconnect gave up
>> on client4108 10.10.10.99:0/1929168255
>> 2011-07-22 23:05:27.041021 7f27cf0c9700 mds0.server reconnect gave up
>> on client4906 10.10.10.99:0/3243338282
>> 2011-07-22 23:05:27.041038 7f27cf0c9700 mds0.server reconnect gave up
>> on client5114 10.10.10.99:0/2910973772
>> 2011-07-22 23:05:27.041104 7f27cf0c9700 mds0.22 reconnect_done
>> 2011-07-22 23:05:27.061458 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
>> 2011-07-22 23:05:27.061480 7f27d02cd700 mds0.22 handle_mds_map state
>> change up:reconnect --> up:rejoin
>> 2011-07-22 23:05:27.061490 7f27d02cd700 mds0.22 rejoin_joint_start
>> 2011-07-22 23:05:27.161091 7f27d02cd700 mds0.22 rejoin_done
>> 2011-07-22 23:05:27.181033 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
>> 2011-07-22 23:05:27.181092 7f27d02cd700 mds0.22 handle_mds_map state
>> change up:rejoin --> up:active
>> 2011-07-22 23:05:27.181107 7f27d02cd700 mds0.22 recovery_done --
>> successful recovery!
>> 2011-07-22 23:05:27.188878 7f27d02cd700 mds0.22 active_start
>> 2011-07-22 23:05:27.191328 7f27d02cd700 mds0.22 cluster recovered.
>> mds/Locker.cc: In function 'bool Locker::issue_caps(CInode*,
>> Capability*)', in thread '0x7f27d02cd700'
>> mds/Locker.cc: 1459: FAILED assert(in->is_head())
>>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>>  1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>>  2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
>> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>>  3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
>> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>>  4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>>  5: (finish_contexts(CephContext*, std::list<Context*,
>> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>>  6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>>  7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>>  8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>>  9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>>  10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>>  11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>>  12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>>  13: (()+0x77e1) [0x7f27d354a7e1]
>>  14: (clone()+0x6d) [0x7f27d2b048ed]
>>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>>  1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>>  2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
>> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>>  3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
>> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>>  4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>>  5: (finish_contexts(CephContext*, std::list<Context*,
>> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>>  6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>>  7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>>  8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>>  9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>>  10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>>  11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>>  12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>>  13: (()+0x77e1) [0x7f27d354a7e1]
>>  14: (clone()+0x6d) [0x7f27d2b048ed]
>> *** Caught signal (Aborted) **
>>  in thread 0x7f27d02cd700
>>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>>  1: /usr/bin/cmds() [0x747ea9]
>>  2: (()+0xf4c0) [0x7f27d35524c0]
>>  3: (gsignal()+0x35) [0x7f27d2a559a5]
>>  4: (abort()+0x175) [0x7f27d2a57185]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f27d32faaad]
>>  6: (()+0xbcc36) [0x7f27d32f8c36]
>>  7: (()+0xbcc63) [0x7f27d32f8c63]
>>  8: (()+0xbcd5e) [0x7f27d32f8d5e]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x3a7) [0x6c6a47]
>>  10: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>>  11: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
>> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>>  12: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
>> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>>  13: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>>  14: (finish_contexts(CephContext*, std::list<Context*,
>> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>>  15: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>>  16: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>>  17: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>>  18: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>>  19: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>>  20: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>>  21: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>>  22: (()+0x77e1) [0x7f27d354a7e1]
>>  23: (clone()+0x6d) [0x7f27d2b048ed]
>>
>>
>>
>> 2011-07-22 23:18:55.727738   mds e196: 1/1/1 up
>> {0=ceph3=up:active(laggy or crashed)}, 1 up:standby
>> 2011-07-22 23:18:55.742299   osd e140: 3 osds: 3 up, 3 in
>> 2011-07-22 23:18:55.743943   mds e197: 1/1/1 up {0=ceph3=up:replay}
>> 2011-07-22 23:18:55.758649    pg v957: 594 pgs: 594 active+clean; 3760
>> MB data, 13164 MB used, 18063 MB / 35997 MB avail
>> 2011-07-22 23:18:55.792549   log 2011-07-22 23:18:58.459865 mon0
>> 10.10.10.91:6789/0 26 : [INF] mds? 10.10.10.93:6800/4702 up:boot
>> 2011-07-22 23:18:57.304880   mds e198: 1/1/1 up {0=ceph3=up:reconnect}
>> 2011-07-22 23:18:57.367480   log 2011-07-22 23:19:00.037792 mon0
>> 10.10.10.91:6789/0 27 : [INF] mds0 10.10.10.93:6800/4702 up:reconnect
>> 2011-07-22 23:18:58.493288    pg v958: 594 pgs: 594 active+clean; 3760
>> MB data, 13164 MB used, 18063 MB / 35997 MB avail
>> 2011-07-22 23:19:01.139243    pg v959: 594 pgs: 594 active+clean; 3760
>> MB data, 13163 MB used, 18063 MB / 35997 MB avail
>> 2011-07-22 23:19:21.118647   osd e141: 3 osds: 3 up, 3 in
>> 2011-07-22 23:19:21.141306    pg v960: 594 pgs: 594 active+clean; 3760
>> MB data, 13163 MB used, 18063 MB / 35997 MB avail
>> 2011-07-22 23:19:23.488866    pg v961: 594 pgs: 594 active+clean; 3760
>> MB data, 13165 MB used, 18063 MB / 35997 MB avail
>> 2011-07-22 23:19:24.516985    pg v962: 594 pgs: 594 active+clean; 3760
>> MB data, 13167 MB used, 18063 MB / 35997 MB avail
>> 2011-07-22 23:19:46.738637   mds e199: 1/1/1 up {0=ceph3=up:rejoin}
>> 2011-07-22 23:19:46.800581   log 2011-07-22 23:19:48.667174 mon0
>> 10.10.10.91:6789/0 28 : [INF] mds0 10.10.10.93:6800/4702 up:rejoin
>> 2011-07-22 23:19:46.859582   mds e200: 1/1/1 up {0=ceph3=up:active}
>> 2011-07-22 23:19:47.814926   log 2011-07-22 23:19:48.787984 mon0
>> 10.10.10.91:6789/0 29 : [INF] mds0 10.10.10.93:6800/4702 up:active
>> 2011-07-22 23:19:49.584346    pg v963: 594 pgs: 594 active+clean; 3761
>> MB data, 13168 MB used, 18063 MB / 35997 MB avail
>> 2011-07-22 23:19:50.598374    pg v964: 594 pgs: 594 active+clean; 3761
>> MB data, 13166 MB used, 18062 MB / 35997 MB avail
>> 2011-07-22 23:20:01.943931   mds e201: 1/1/1 up
>> {0=ceph3=up:active(laggy or crashed)}
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BUG: crash in mds after recovery
  2011-07-23 22:28   ` Marcus Sorensen
@ 2011-07-25  4:31     ` Marcus Sorensen
  0 siblings, 0 replies; 5+ messages in thread
From: Marcus Sorensen @ 2011-07-25  4:31 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Having trouble submitting an issue in the redmine tracker at the
moment... just spitting back 'Internal Error, see administrator'.
Below is the problem description and how to reproduce.

Applicable files are here:
http://learnitwithme.com/share/ceph1-Locker.cc-1459-failed-assert.tar.bz2
http://learnitwithme.com/share/ceph2-Locker.cc-1459-failed-assert.tar.bz2
http://learnitwithme.com/share/ceph3-Locker.cc-1459-failed-assert.tar.bz2

***********************************

Ceph filesystem with snapshot crashes mds when active is shut down and
standby takes over. This is a basic 3 node cluster with mon,mds,osd
running on each node, along with a single client connected. Including
corefiles, logs, and conf.

<pre>
ceph 1:
  mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs

ceph1, ceph2, ceph3:
  service ceph start


Client:
  mount -t ceph 10.10.10.91,10.10.10.92,10.10.10.93:/ /mnt
  cd /mnt
  rsync -a /usr/lib .
  mkdir .snap/daily
  rm -rf lib
  rsync -a /lib/modules .

ceph 2:
  [root@ceph2 ~]# ceph mds stat
  2011-07-24 21:59:19.326608 mon <- [mds,stat]
  2011-07-24 21:59:19.328114 mon0 -> 'e6: 1/1/1 up
{0=ceph2=up:active}, 2 up:standby' (0)

  [root@ceph2 ~]# service ceph stop mds
  === mds.ceph2 ===
  Stopping Ceph mds.ceph2 on ceph2...kill 2984...done

  [root@ceph2 ~]# ceph mds stat
  2011-07-24 21:59:55.377053 mon <- [mds,stat]
  2011-07-24 21:59:55.377720 mon1 -> 'e10: 1/1/1 up
{0=ceph1=up:active}, 1 up:standby' (0)

ceph 1:
  [root@ceph1 ~]# service ceph status
  === mon.ceph1 ===
  running...
  === mds.ceph1 ===
  dead.
  === osd.0 ===
  running...

  [root@ceph1 ~]# ceph mds stat
  2011-07-24 22:01:58.770243 mon <- [mds,stat]
  2011-07-24 22:01:58.771013 mon0 -> 'e19: 1/1/1 up
{0=ceph3=up:active(laggy or crashed)}' (0)

ceph 2:
  [root@ceph2 ~]# service ceph start mds
  === mds.ceph2 ===
  Starting Ceph mds.ceph2 on ceph2...
   ** WARNING: Ceph is still under development.  Any feedback can be
directed  **
   **          at ceph-devel@vger.kernel.org or
http://ceph.newdream.net/.     **
  starting mds.ceph2 at 0.0.0.0:6800/3462

  [root@ceph2 ~]# ceph mds stat
  2011-07-24 22:04:15.111478 mon <- [mds,stat]
  2011-07-24 22:04:15.112092 mon1 -> 'e29: 1/1/1 up
{0=ceph2=up:active(laggy or crashed)}' (0)




On Sat, Jul 23, 2011 at 4:28 PM, Marcus Sorensen <shadowsor@gmail.com> wrote:
> Yeah, I've got a few VMs set up to play with. Sort of painful having
> multiple VMs write twice with the hard disk journal.
>
> I'm certain now that I can duplicate this, so I'll start fresh again
> once more and keep track of the steps.
>
> Replying-all this time...
>
> On Sat, Jul 23, 2011 at 3:55 PM, Sage Weil <sage@newdream.net> wrote:
>> On Fri, 22 Jul 2011, Marcus Sorensen wrote:
>>> Should these be reported here or perhaps through the tracker?
>>
>> It would be great if you could add this to the tracker with a detail
>> description of how you hit the bug and whatever logs you have available.
>> On-list is fine too, just a bit more work on our end.
>>
>> We haven't done a lot of MDS restart testing with snapshot workloads, so
>> this isn't terribly surprising.  I'm happy to hear you were able to work
>> around it, though!
>>
>>> Just a
>>> warning, I'm going to purposely be trying to break things and figuring
>>> out how to recover (or at least what never to do :), so now's probably
>>> a good time to set me straight on procedure.
>>>
>>> i was fiddling with my test cluster tonight and ran into this.
>>> Basically I decided to reboot the cluster, and after reboot I had to
>>> play a bit with the journals to get things going again. I didn't
>>> realize that putting them on tmpfs would require me to touch the file,
>>
>> Putting the journals on tmpfs is not a good idea.  The OSD considers
>> anything journaled safe and durable, so you're exposing yourself to data
>> loss in the event of a power failure.  Hopefully this is a temporary thing
>> to simulate an SSD or NVRAM journal...? :)
>>
>> sage
>>
>>
>>> pull down the monmap, run 'cosd -i <num> --monmap <monmap>
>>> --mkjournal' before the osds could come online. For some reason I had
>>> the impression it would create the journal... I did something like
>>> 'service ceph start', saw that the osd was mounted but cosd failed,
>>> pulled monmap, recreated journal, then 'service ceph start osd.<num>'.
>>> After getting these online I noticed the mds services were crashing on
>>> all three nodes like so. Notice also that there were some clients
>>> connected prior.
>>>
>>> [root@ceph3 ~]# 2011-07-22 23:03:57.640115 7f27d02cd700 mds-1.0
>>> handle_mds_map standby
>>> 2011-07-22 23:04:21.928388 7f27d02cd700 mds-1.0 handle_mds_map standby
>>> 2011-07-22 23:04:22.164395 7f27d02cd700 mds-1.0 handle_mds_map standby
>>> 2011-07-22 23:04:37.634939 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
>>> 2011-07-22 23:04:37.634967 7f27d02cd700 mds0.22 handle_mds_map state
>>> change up:standby --> up:replay
>>> 2011-07-22 23:04:37.634975 7f27d02cd700 mds0.22 replay_start
>>> 2011-07-22 23:04:37.634984 7f27d02cd700 mds0.22  recovery set is
>>> 2011-07-22 23:04:37.634990 7f27d02cd700 mds0.22  need osdmap epoch 131, have 128
>>> 2011-07-22 23:04:37.634995 7f27d02cd700 mds0.22  waiting for osdmap
>>> 131 (which blacklists prior instance)
>>> 2011-07-22 23:04:37.635027 7f27d02cd700 mds0.cache handle_mds_failure
>>> mds0 : recovery peers are
>>> 2011-07-22 23:04:37.644053 7f27d02cd700 mds0.22 ms_handle_connect on
>>> 10.10.10.93:6801/3468
>>> 2011-07-22 23:04:37.649460 7f27d02cd700 mds0.22 ms_handle_connect on
>>> 10.10.10.91:6801/4314
>>> 2011-07-22 23:04:37.654568 7f27d02cd700 mds0.22 ms_handle_connect on
>>> 10.10.10.92:6803/4193
>>> 2011-07-22 23:04:37.725419 7f27d02cd700 mds0.cache creating system
>>> inode with ino:100
>>> 2011-07-22 23:04:37.725542 7f27d02cd700 mds0.cache creating system
>>> inode with ino:1
>>> 2011-07-22 23:04:39.223574 7f27cdbc2700 mds0.22 replay_done
>>> 2011-07-22 23:04:39.223636 7f27cdbc2700 mds0.22 making mds journal writeable
>>> 2011-07-22 23:04:39.239472 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
>>> 2011-07-22 23:04:39.239768 7f27d02cd700 mds0.22 handle_mds_map state
>>> change up:replay --> up:reconnect
>>> 2011-07-22 23:04:39.239981 7f27d02cd700 mds0.22 reconnect_start
>>> 2011-07-22 23:04:39.240187 7f27d02cd700 mds0.22 reopen_log
>>> 2011-07-22 23:04:39.240399 7f27d02cd700 mds0.server reconnect_clients
>>> -- 3 sessions
>>> 2011-07-22 23:05:27.040962 7f27cf0c9700 mds0.server reconnect gave up
>>> on client4108 10.10.10.99:0/1929168255
>>> 2011-07-22 23:05:27.041021 7f27cf0c9700 mds0.server reconnect gave up
>>> on client4906 10.10.10.99:0/3243338282
>>> 2011-07-22 23:05:27.041038 7f27cf0c9700 mds0.server reconnect gave up
>>> on client5114 10.10.10.99:0/2910973772
>>> 2011-07-22 23:05:27.041104 7f27cf0c9700 mds0.22 reconnect_done
>>> 2011-07-22 23:05:27.061458 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
>>> 2011-07-22 23:05:27.061480 7f27d02cd700 mds0.22 handle_mds_map state
>>> change up:reconnect --> up:rejoin
>>> 2011-07-22 23:05:27.061490 7f27d02cd700 mds0.22 rejoin_joint_start
>>> 2011-07-22 23:05:27.161091 7f27d02cd700 mds0.22 rejoin_done
>>> 2011-07-22 23:05:27.181033 7f27d02cd700 mds0.22 handle_mds_map i am now mds0.22
>>> 2011-07-22 23:05:27.181092 7f27d02cd700 mds0.22 handle_mds_map state
>>> change up:rejoin --> up:active
>>> 2011-07-22 23:05:27.181107 7f27d02cd700 mds0.22 recovery_done --
>>> successful recovery!
>>> 2011-07-22 23:05:27.188878 7f27d02cd700 mds0.22 active_start
>>> 2011-07-22 23:05:27.191328 7f27d02cd700 mds0.22 cluster recovered.
>>> mds/Locker.cc: In function 'bool Locker::issue_caps(CInode*,
>>> Capability*)', in thread '0x7f27d02cd700'
>>> mds/Locker.cc: 1459: FAILED assert(in->is_head())
>>>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>>>  1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>>>  2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
>>> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>>>  3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
>>> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>>>  4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>>>  5: (finish_contexts(CephContext*, std::list<Context*,
>>> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>>>  6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>>>  7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>>>  8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>>>  9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>>>  10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>>>  11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>>>  12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>>>  13: (()+0x77e1) [0x7f27d354a7e1]
>>>  14: (clone()+0x6d) [0x7f27d2b048ed]
>>>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>>>  1: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>>>  2: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
>>> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>>>  3: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
>>> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>>>  4: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>>>  5: (finish_contexts(CephContext*, std::list<Context*,
>>> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>>>  6: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>>>  7: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>>>  8: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>>>  9: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>>>  10: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>>>  11: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>>>  12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>>>  13: (()+0x77e1) [0x7f27d354a7e1]
>>>  14: (clone()+0x6d) [0x7f27d2b048ed]
>>> *** Caught signal (Aborted) **
>>>  in thread 0x7f27d02cd700
>>>  ceph version 0.31 (commit:9019c6ce64053ad515a493e912e2e63ba9b8e278)
>>>  1: /usr/bin/cmds() [0x747ea9]
>>>  2: (()+0xf4c0) [0x7f27d35524c0]
>>>  3: (gsignal()+0x35) [0x7f27d2a559a5]
>>>  4: (abort()+0x175) [0x7f27d2a57185]
>>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f27d32faaad]
>>>  6: (()+0xbcc36) [0x7f27d32f8c36]
>>>  7: (()+0xbcc63) [0x7f27d32f8c63]
>>>  8: (()+0xbcd5e) [0x7f27d32f8d5e]
>>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x3a7) [0x6c6a47]
>>>  10: (Locker::issue_caps(CInode*, Capability*)+0x1a0c) [0x5b25dc]
>>>  11: (Locker::issue_caps_set(std::set<CInode*, std::less<CInode*>,
>>> std::allocator<CInode*> >&)+0x26) [0x5b3ff6]
>>>  12: (Locker::file_update_finish(CInode*, Mutation*, bool, client_t,
>>> Capability*, MClientCaps*)+0x665) [0x5c2a45]
>>>  13: (C_Locker_FileUpdate_finish::finish(int)+0x2c) [0x5d6b1c]
>>>  14: (finish_contexts(CephContext*, std::list<Context*,
>>> std::allocator<Context*> >&, int)+0xd0) [0x591250]
>>>  15: (Journaler::_finish_flush(int, unsigned long, utime_t)+0x20e) [0x6ac6be]
>>>  16: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xa8e) [0x69290e]
>>>  17: (MDS::handle_core_message(Message*)+0x86f) [0x49e67f]
>>>  18: (MDS::_dispatch(Message*)+0x2aa) [0x49e99a]
>>>  19: (MDS::ms_dispatch(Message*)+0x6d) [0x4a060d]
>>>  20: (SimpleMessenger::dispatch_entry()+0x8a1) [0x6f45b1]
>>>  21: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x4806ac]
>>>  22: (()+0x77e1) [0x7f27d354a7e1]
>>>  23: (clone()+0x6d) [0x7f27d2b048ed]
>>>
>>>
>>>
>>> 2011-07-22 23:18:55.727738   mds e196: 1/1/1 up
>>> {0=ceph3=up:active(laggy or crashed)}, 1 up:standby
>>> 2011-07-22 23:18:55.742299   osd e140: 3 osds: 3 up, 3 in
>>> 2011-07-22 23:18:55.743943   mds e197: 1/1/1 up {0=ceph3=up:replay}
>>> 2011-07-22 23:18:55.758649    pg v957: 594 pgs: 594 active+clean; 3760
>>> MB data, 13164 MB used, 18063 MB / 35997 MB avail
>>> 2011-07-22 23:18:55.792549   log 2011-07-22 23:18:58.459865 mon0
>>> 10.10.10.91:6789/0 26 : [INF] mds? 10.10.10.93:6800/4702 up:boot
>>> 2011-07-22 23:18:57.304880   mds e198: 1/1/1 up {0=ceph3=up:reconnect}
>>> 2011-07-22 23:18:57.367480   log 2011-07-22 23:19:00.037792 mon0
>>> 10.10.10.91:6789/0 27 : [INF] mds0 10.10.10.93:6800/4702 up:reconnect
>>> 2011-07-22 23:18:58.493288    pg v958: 594 pgs: 594 active+clean; 3760
>>> MB data, 13164 MB used, 18063 MB / 35997 MB avail
>>> 2011-07-22 23:19:01.139243    pg v959: 594 pgs: 594 active+clean; 3760
>>> MB data, 13163 MB used, 18063 MB / 35997 MB avail
>>> 2011-07-22 23:19:21.118647   osd e141: 3 osds: 3 up, 3 in
>>> 2011-07-22 23:19:21.141306    pg v960: 594 pgs: 594 active+clean; 3760
>>> MB data, 13163 MB used, 18063 MB / 35997 MB avail
>>> 2011-07-22 23:19:23.488866    pg v961: 594 pgs: 594 active+clean; 3760
>>> MB data, 13165 MB used, 18063 MB / 35997 MB avail
>>> 2011-07-22 23:19:24.516985    pg v962: 594 pgs: 594 active+clean; 3760
>>> MB data, 13167 MB used, 18063 MB / 35997 MB avail
>>> 2011-07-22 23:19:46.738637   mds e199: 1/1/1 up {0=ceph3=up:rejoin}
>>> 2011-07-22 23:19:46.800581   log 2011-07-22 23:19:48.667174 mon0
>>> 10.10.10.91:6789/0 28 : [INF] mds0 10.10.10.93:6800/4702 up:rejoin
>>> 2011-07-22 23:19:46.859582   mds e200: 1/1/1 up {0=ceph3=up:active}
>>> 2011-07-22 23:19:47.814926   log 2011-07-22 23:19:48.787984 mon0
>>> 10.10.10.91:6789/0 29 : [INF] mds0 10.10.10.93:6800/4702 up:active
>>> 2011-07-22 23:19:49.584346    pg v963: 594 pgs: 594 active+clean; 3761
>>> MB data, 13168 MB used, 18063 MB / 35997 MB avail
>>> 2011-07-22 23:19:50.598374    pg v964: 594 pgs: 594 active+clean; 3761
>>> MB data, 13166 MB used, 18062 MB / 35997 MB avail
>>> 2011-07-22 23:20:01.943931   mds e201: 1/1/1 up
>>> {0=ceph3=up:active(laggy or crashed)}
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-07-25  4:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-23  5:28 BUG: crash in mds after recovery Marcus Sorensen
2011-07-23 15:01 ` Marcus Sorensen
2011-07-23 21:55 ` Sage Weil
2011-07-23 22:28   ` Marcus Sorensen
2011-07-25  4:31     ` Marcus Sorensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.