From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yann Dupont Subject: Re: domino-style OSD crash Date: Tue, 03 Jul 2012 10:40:11 +0200 Message-ID: <4FF2AFEB.1010403@univ-nantes.fr> References: <4FCC7573.3000704@univ-nantes.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from smtptls1-loi.cpub.univ-nantes.fr ([193.52.103.112]:51812 "EHLO smtp-tls.univ-nantes.fr" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750840Ab2GCIkT (ORCPT ); Tue, 3 Jul 2012 04:40:19 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sam Just Cc: Tommi Virtanen , ceph-devel Le 04/06/2012 19:40, Sam Just a =E9crit : > Can you send the osd logs? The merge_log crashes are probably fixabl= e > if I can see the logs. > Well I'm sorry - As I send in private mail I was away from computer for= =20 a long time. I can't send those logs anymore, they are rotated now... Anyway. Now that I'm back, I try to restart where I stopped, and tried=20 to restart the failed nodes. Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK righ= t=20 now. Tried to restart osd with 0.47.3, then next branch, and today with 0.48= =2E 4 of 8 nodes fails with the same message : ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2= 030) 1: /usr/bin/ceph-osd() [0x701929] 2: (()+0xf030) [0x7fe5b4777030] 3: (gsignal()+0x35) [0x7fe5b33fc4f5] 4: (abort()+0x180) [0x7fe5b33ff770] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe5b3c4f68d] 6: (()+0x63796) [0x7fe5b3c4d796] 7: (()+0x637c3) [0x7fe5b3c4d7c3] 8: (()+0x639ee) [0x7fe5b3c4d9ee] 9: (std::__throw_length_error(char const*)+0x5d) [0x7fe5b3c9f5ed] 10: (()+0xbfad2) [0x7fe5b3ca9ad2] 11: (char* std::string::_S_construct(char const*, char=20 const*, std::allocator const&, std::forward_iterator_tag)+0x35)=20 [0x7fe5b3cab4a5] 12: (std::basic_string,=20 std::allocator >::basic_string(char const*, unsigned long,=20 std::allocator const&)+0x1d) [0x7fe5b3cab5bd] 13:=20 (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,=20 leveldb::Slice const&) const+0x4d) [0x6e811d] 14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice= =20 const&)+0x9f) [0x6f681f] 15:=20 (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0= x4d3)=20 [0x6e3643] 16: (leveldb::DBImpl::BackgroundCompaction()+0x222) [0x6e45a2] 17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6e4e18] 18: /usr/bin/ceph-osd() [0x6fd401] 19: (()+0x6b50) [0x7fe5b476eb50] 20: (clone()+0x6d) [0x7fe5b34a278d] NOTE: a copy of the executable, or `objdump -rdS ` is=20 needed to interpret this. ceph-osd is from the debian package (64 bits) I have a core dump, but I'm afraid it won't help much : gdb /usr/bin/ceph-osd core GNU gdb (GDB) 7.0.1-debian =2E... Core was generated by `/usr/bin/ceph-osd -i 2 --pid-file=20 /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.con'. Program terminated with signal 6, Aborted. ---Type to continue, or q to quit--- #0 0x00007fe5b4776efb in raise () from=20 /lib/x86_64-linux-gnu/libpthread.so.0 This time I REALLY CAN (knock on wood) furnish logs & core. Granted, this crash was very probably caused by corruption on btrfs, bu= t=20 it could be great if there's a way to recover the crashed osd node. Cheers, --=20 Yann Dupont - Service IRTS, DSI Universit=E9 de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html