* domino-style OSD crash
@ 2012-06-04 8:44 Yann Dupont
2012-06-04 16:16 ` Tommi Virtanen
0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-06-04 8:44 UTC (permalink / raw)
To: ceph-devel
Hello,
Besides the performance inconsistency (see other thread titled poor OSD
performance using kernel 3.4) where I promised some tests (will run this
afternoon), we tried this week-end to stress test ceph, making backups
with bacula on a rbd volume of 15T (8 osd nodes, using 8 physical machines)
Results : Worked like a charm during two days, apart btrfs warn messages
then OSD begin to crash 1 after all 'domino style'.
This morning, only 2 OSD of 8 are left.
1 of the physical machine was in kernel oops state - Nothing was remote
logged, don't know what happened, there were no clear stack message. I
suspect btrfs , but I have no proof.
This node (OSD.7) seems to have been the 1st one to crash, generated
reconstruction between OSD & then lead to the cascade osd crash.
The other physical machines are still up, but with no osd running. here
are some trace found in osd log :
-3> 2012-06-03 12:43:32.524671 7ff1352b8700 0 log [WRN] : slow
request 30.506952 seconds old, rec
eived at 2012-06-03 12:43:01.997386: osd_sub_op(osd.0.0:1842628 2.57
ea8d5657/label5_17606_object7068/
head [push] v 191'628 snapset=0=[]:[] snapc=0=[]) v6 currently queued for pg
-2> 2012-06-03 12:44:32.869852 7ff1352b8700 0 log [WRN] : 1 slow
requests, 1 included below; olde
st blocked for > 30.073136 secs
-1> 2012-06-03 12:44:32.869886 7ff1352b8700 0 log [WRN] : slow
request 30.073136 seconds old, rec
eived at 2012-06-03 12:44:02.796651: osd_sub_op(osd.6.0:1837430 2.59
97e62059/rb.0.1.0000000a2cdf/head
[push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
0> 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
(Aborted) **
in thread 7ff1237f6700
ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
1: /usr/bin/ceph-osd() [0x708ea9]
2: (()+0xeff0) [0x7ff13af2cff0]
3: (gsignal()+0x35) [0x7ff13950b1b5]
4: (abort()+0x180) [0x7ff13950dfc0]
5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7ff139d9fdc5]
6: (()+0xcb166) [0x7ff139d9e166]
7: (()+0xcb193) [0x7ff139d9e193]
8: (()+0xcb28e) [0x7ff139d9e28e]
9: (std::__throw_length_error(char const*)+0x67) [0x7ff139d39307]
10: (std::string::_Rep::_S_create(unsigned long, unsigned long,
std::allocator<char> const&)+0x72) [0x7ff139d7ab42]
11: (()+0xa8565) [0x7ff139d7b565]
12: (std::basic_string<char, std::char_traits<char>,
std::allocator<char> >::basic_string(char const*, unsigned long,
std::allocator<char> const&)+0x1b) [0x7ff139d7b7ab]
13:
(leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
leveldb::Slice const&) const+0x4d) [0x6ef69d]
14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice
const&)+0x9f) [0x6fdd9f]
15:
(leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x4d3)
[0x6eaba3]
16: (leveldb::DBImpl::BackgroundCompaction()+0x222) [0x6ebb02]
17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6ec378]
18: /usr/bin/ceph-osd() [0x704981]
19: (()+0x68ca) [0x7ff13af248ca]
20: (clone()+0x6d) [0x7ff1395a892d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
2 OSD exhibit similar traces.
---
4 other had traces like this one :
-5> 2012-06-03 13:31:39.393489 7f74fd9c7700 -1 osd.3 1513
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:19.393488)
-4> 2012-06-03 13:31:40.393689 7f74fd9c7700 -1 osd.3 1513
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:20.393687)
-3> 2012-06-03 13:31:41.402873 7f74fd9c7700 -1 osd.3 1513
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:21.402872)
-2> 2012-06-03 13:31:42.363270 7f74f08ac700 -1 osd.3 1513
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.363269)
-1> 2012-06-03 13:31:42.416968 7f74fd9c7700 -1 osd.3 1513
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.416966)
0> 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In
function 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&,
pg_log_t&, int)' thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >=
log.tail)
ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
1: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&,
int)+0x1eae) [0x649cce]
2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec
const&)+0x2b1) [0x649fc1]
3: (boost::statechart::simple_state<PG::RecoveryState::Stray,
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x203) [0x660343]
4:
(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x6b) [0x6580eb]
5: (PG::RecoveryState::handle_log(int, MOSDPGLog*,
PG::RecoveryCtx*)+0x190) [0x6139d0]
6: (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x666) [0x5cec66]
7: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x11b) [0x5d312b]
8: (OSD::_dispatch(Message*)+0x173) [0x5dc273]
9: (OSD::ms_dispatch(Message*)+0x1e7) [0x5dcba7]
10: (SimpleMessenger::dispatch_entry()+0x979) [0x7d60a9]
11: (SimpleMessenger::DispatchThread::entry()+0xd) [0x72781d]
12: (()+0x68ca) [0x7f75036338ca]
13: (clone()+0x6d) [0x7f7501cb792d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- end dump of recent events ---
2012-06-03 13:36:48.487021 7f74f58b6700 -1 *** Caught signal (Aborted) **
in thread 7f74f58b6700
ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
1: /usr/bin/ceph-osd() [0x708ea9]
2: (()+0xeff0) [0x7f750363bff0]
3: (gsignal()+0x35) [0x7f7501c1a1b5]
4: (abort()+0x180) [0x7f7501c1cfc0]
5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f75024aedc5]
6: (()+0xcb166) [0x7f75024ad166]
7: (()+0xcb193) [0x7f75024ad193]
8: (()+0xcb28e) [0x7f75024ad28e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x940) [0x77d550]
10: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&,
int)+0x1eae) [0x649cce]
11: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec
const&)+0x2b1) [0x649fc1]
12: (boost::statechart::simple_state<PG::RecoveryState::Stray,
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x203) [0x660343]
13:
(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x6b) [0x6580eb]
14: (PG::RecoveryState::handle_log(int, MOSDPGLog*,
PG::RecoveryCtx*)+0x190) [0x6139d0]
15: (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x666) [0x5cec66]
16: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x11b) [0x5d312b]
17: (OSD::_dispatch(Message*)+0x173) [0x5dc273]
18: (OSD::ms_dispatch(Message*)+0x1e7) [0x5dcba7]
19: (SimpleMessenger::dispatch_entry()+0x979) [0x7d60a9]
20: (SimpleMessenger::DispatchThread::entry()+0xd) [0x72781d]
21: (()+0x68ca) [0x7f75036338ca]
22: (clone()+0x6d) [0x7f7501cb792d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
The root cause can be btrfs... or maybe not. I don't see any btrfs crash
oops, just :
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479173]
Pid: 16875, comm: kworker/7:0 Tainted: G W 3.4.0-dsiun-120521 #108
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479218]
Call Trace:
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479243]
[<ffffffff81039f1b>] ? warn_slowpath_common+0x7b/0xc0
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
[<ffffffffa026fca5>] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
[<ffffffffa026965a>] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
[<ffffffffa02bc9a0>] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
[<ffffffffa026a6f1>] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
[<ffffffff8105a9f0>] ? add_wait_queue+0x60/0x60
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
[<ffffffffa026aba0>] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
[<ffffffffa026abb1>] ? do_async_commit+0x11/0x20 [btrfs]
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
[<ffffffff810548a7>] ? process_one_work+0x107/0x460
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479601]
[<ffffffff81055a8e>] ? worker_thread+0x14e/0x330
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479628]
[<ffffffff81055940>] ? manage_workers.isra.28+0x210/0x210
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479657]
[<ffffffff8105a005>] ? kthread+0x85/0x90
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479684]
[<ffffffff813be3e4>] ? kernel_thread_helper+0x4/0x10
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479712]
[<ffffffff81059f80>] ? kthread_freezable_should_stop+0x60/0x60
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479741]
[<ffffffff813be3e0>] ? gs_change+0x13/0x13
Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479767]
---[ end trace 303c47aab4b5d025 ]---
Jun 3 00:44:11 chichibu.u14.univ-nantes.prive kernel: [204497.711101]
------------[ cut here ]------------
But this is just a warn (maybe that could lead to kernel oops/crash).
Seems to have been fixed lately in git kernels.
I can give you all 8 logs of OSD + logs of MDS & MON if it can help.
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-06-04 8:44 domino-style OSD crash Yann Dupont
@ 2012-06-04 16:16 ` Tommi Virtanen
2012-06-04 17:40 ` Sam Just
0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-06-04 16:16 UTC (permalink / raw)
To: Yann Dupont; +Cc: ceph-devel
On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Results : Worked like a charm during two days, apart btrfs warn messages
> then OSD begin to crash 1 after all 'domino style'.
Sorry to hear that. Reading through your message, there seem to be
several problems; whether they are because of the same root cause, I
can't tell.
Quick triage to benefit the other devs:
#1: kernel crash, no details available
> 1 of the physical machine was in kernel oops state - Nothing was remote
#2: leveldb corruption? may be memory corruption that started
elsewhere.. Sam, does this look like the leveldb issue you saw?
> [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
> 0> 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
> (Aborted) **
...
> 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
> leveldb::Slice const&) const+0x4d) [0x6ef69d]
> 14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice
> const&)+0x9f) [0x6fdd9f]
#3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
> 0> 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function
> 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, int)'
> thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
> osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >=
> log.tail)
#4: unknown btrfs warnings, there should an actual message above this
traceback; believed fixed in latest kernel
> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
> [<ffffffffa026fca5>] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
> [<ffffffffa026965a>] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
> [<ffffffffa02bc9a0>] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
> [<ffffffffa026a6f1>] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
> [<ffffffff8105a9f0>] ? add_wait_queue+0x60/0x60
> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
> [<ffffffffa026aba0>] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
> [<ffffffffa026abb1>] ? do_async_commit+0x11/0x20 [btrfs]
> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-06-04 16:16 ` Tommi Virtanen
@ 2012-06-04 17:40 ` Sam Just
2012-06-04 18:34 ` Greg Farnum
2012-07-03 8:40 ` Yann Dupont
0 siblings, 2 replies; 25+ messages in thread
From: Sam Just @ 2012-06-04 17:40 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Yann Dupont, ceph-devel
Can you send the osd logs? The merge_log crashes are probably fixable
if I can see the logs.
The leveldb crash is almost certainly a result of memory corruption.
Thanks
-Sam
On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen <tv@inktank.com> wrote:
> On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> Results : Worked like a charm during two days, apart btrfs warn messages
>> then OSD begin to crash 1 after all 'domino style'.
>
> Sorry to hear that. Reading through your message, there seem to be
> several problems; whether they are because of the same root cause, I
> can't tell.
>
> Quick triage to benefit the other devs:
>
> #1: kernel crash, no details available
>> 1 of the physical machine was in kernel oops state - Nothing was remote
>
> #2: leveldb corruption? may be memory corruption that started
> elsewhere.. Sam, does this look like the leveldb issue you saw?
>> [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
>> 0> 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
>> (Aborted) **
> ...
>> 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
>> leveldb::Slice const&) const+0x4d) [0x6ef69d]
>> 14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice
>> const&)+0x9f) [0x6fdd9f]
>
> #3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
>> 0> 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function
>> 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, int)'
>> thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
>> osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >=
>> log.tail)
>
> #4: unknown btrfs warnings, there should an actual message above this
> traceback; believed fixed in latest kernel
>> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
>> [<ffffffffa026fca5>] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
>> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
>> [<ffffffffa026965a>] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
>> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
>> [<ffffffffa02bc9a0>] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
>> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
>> [<ffffffffa026a6f1>] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
>> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
>> [<ffffffff8105a9f0>] ? add_wait_queue+0x60/0x60
>> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
>> [<ffffffffa026aba0>] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
>> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
>> [<ffffffffa026abb1>] ? do_async_commit+0x11/0x20 [btrfs]
>> Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-06-04 17:40 ` Sam Just
@ 2012-06-04 18:34 ` Greg Farnum
2012-07-03 8:40 ` Yann Dupont
1 sibling, 0 replies; 25+ messages in thread
From: Greg Farnum @ 2012-06-04 18:34 UTC (permalink / raw)
To: Sam Just; +Cc: Tommi Virtanen, Yann Dupont, ceph-devel
This is probably the same/similar to http://tracker.newdream.net/issues/2462, no? There's a log there, though I've no idea how helpful it is.
On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote:
> Can you send the osd logs? The merge_log crashes are probably fixable
> if I can see the logs.
>
> The leveldb crash is almost certainly a result of memory corruption.
>
> Thanks
> -Sam
>
> On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen <tv@inktank.com (mailto:tv@inktank.com)> wrote:
> > On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
> > > Results : Worked like a charm during two days, apart btrfs warn messages
> > > then OSD begin to crash 1 after all 'domino style'.
> >
> >
> >
> > Sorry to hear that. Reading through your message, there seem to be
> > several problems; whether they are because of the same root cause, I
> > can't tell.
> >
> > Quick triage to benefit the other devs:
> >
> > #1: kernel crash, no details available
> > > 1 of the physical machine was in kernel oops state - Nothing was remote
> >
> >
> >
> > #2: leveldb corruption? may be memory corruption that started
> > elsewhere.. Sam, does this look like the leveldb issue you saw?
> > > [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
> > > 0> 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
> > > (Aborted) **
> >
> >
> > ...
> > > 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
> > > leveldb::Slice const&) const+0x4d) [0x6ef69d]
> > > 14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice
> > > const&)+0x9f) [0x6fdd9f]
> >
> >
> >
> > #3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
> > > 0> 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc (http://PG.cc): In function
> > > 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, int)'
> > > thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
> > > osd/PG.cc (http://PG.cc): 402: FAILED assert(log.head >= olog.tail && olog.head >=
> > > log.tail)
> >
> >
> >
> > #4: unknown btrfs warnings, there should an actual message above this
> > traceback; believed fixed in latest kernel
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
> > > [<ffffffffa026fca5>] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
> > > [<ffffffffa026965a>] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
> > > [<ffffffffa02bc9a0>] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
> > > [<ffffffffa026a6f1>] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
> > > [<ffffffff8105a9f0>] ? add_wait_queue+0x60/0x60
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
> > > [<ffffffffa026aba0>] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
> > > [<ffffffffa026abb1>] ? do_async_commit+0x11/0x20 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-06-04 17:40 ` Sam Just
2012-06-04 18:34 ` Greg Farnum
@ 2012-07-03 8:40 ` Yann Dupont
2012-07-03 19:42 ` Tommi Virtanen
1 sibling, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-03 8:40 UTC (permalink / raw)
To: Sam Just; +Cc: Tommi Virtanen, ceph-devel
Le 04/06/2012 19:40, Sam Just a écrit :
> Can you send the osd logs? The merge_log crashes are probably fixable
> if I can see the logs.
>
Well I'm sorry - As I send in private mail I was away from computer for
a long time.
I can't send those logs anymore, they are rotated now...
Anyway. Now that I'm back, I try to restart where I stopped, and tried
to restart the failed nodes.
Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
now.
Tried to restart osd with 0.47.3, then next branch, and today with 0.48.
4 of 8 nodes fails with the same message :
ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
1: /usr/bin/ceph-osd() [0x701929]
2: (()+0xf030) [0x7fe5b4777030]
3: (gsignal()+0x35) [0x7fe5b33fc4f5]
4: (abort()+0x180) [0x7fe5b33ff770]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe5b3c4f68d]
6: (()+0x63796) [0x7fe5b3c4d796]
7: (()+0x637c3) [0x7fe5b3c4d7c3]
8: (()+0x639ee) [0x7fe5b3c4d9ee]
9: (std::__throw_length_error(char const*)+0x5d) [0x7fe5b3c9f5ed]
10: (()+0xbfad2) [0x7fe5b3ca9ad2]
11: (char* std::string::_S_construct<char const*>(char const*, char
const*, std::allocator<char> const&, std::forward_iterator_tag)+0x35)
[0x7fe5b3cab4a5]
12: (std::basic_string<char, std::char_traits<char>,
std::allocator<char> >::basic_string(char const*, unsigned long,
std::allocator<char> const&)+0x1d) [0x7fe5b3cab5bd]
13:
(leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
leveldb::Slice const&) const+0x4d) [0x6e811d]
14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice
const&)+0x9f) [0x6f681f]
15:
(leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x4d3)
[0x6e3643]
16: (leveldb::DBImpl::BackgroundCompaction()+0x222) [0x6e45a2]
17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6e4e18]
18: /usr/bin/ceph-osd() [0x6fd401]
19: (()+0x6b50) [0x7fe5b476eb50]
20: (clone()+0x6d) [0x7fe5b34a278d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
ceph-osd is from the debian package (64 bits)
I have a core dump, but I'm afraid it won't help much :
gdb /usr/bin/ceph-osd core
GNU gdb (GDB) 7.0.1-debian
....
Core was generated by `/usr/bin/ceph-osd -i 2 --pid-file
/var/run/ceph/osd.2.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 6, Aborted.
---Type <return> to continue, or q <return> to quit---
#0 0x00007fe5b4776efb in raise () from
/lib/x86_64-linux-gnu/libpthread.so.0
This time I REALLY CAN (knock on wood) furnish logs & core.
Granted, this crash was very probably caused by corruption on btrfs, but
it could be great if there's a way to recover the crashed osd node.
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-03 8:40 ` Yann Dupont
@ 2012-07-03 19:42 ` Tommi Virtanen
2012-07-03 20:54 ` Yann Dupont
0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-03 19:42 UTC (permalink / raw)
To: Yann Dupont; +Cc: Sam Just, ceph-devel
On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
> now.
>
> Tried to restart osd with 0.47.3, then next branch, and today with 0.48.
>
> 4 of 8 nodes fails with the same message :
>
> ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
> 1: /usr/bin/ceph-osd() [0x701929]
...
> 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
> leveldb::Slice const&) const+0x4d) [0x6e811d]
That looks like http://tracker.newdream.net/issues/2563 and the best
we have for that ticket is "looks like you have a corrupted leveldb
file". Is this reproducible with a freshly mkfs'ed data partition?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-03 19:42 ` Tommi Virtanen
@ 2012-07-03 20:54 ` Yann Dupont
2012-07-03 21:38 ` Tommi Virtanen
0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-03 20:54 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Sam Just, ceph-devel
Le 03/07/2012 21:42, Tommi Virtanen a écrit :
> On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
>> now.
>>
>> Tried to restart osd with 0.47.3, then next branch, and today with 0.48.
>>
>> 4 of 8 nodes fails with the same message :
>>
>> ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
>> 1: /usr/bin/ceph-osd() [0x701929]
> ...
>> 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
>> leveldb::Slice const&) const+0x4d) [0x6e811d]
> That looks like http://tracker.newdream.net/issues/2563 and the best
> we have for that ticket is "looks like you have a corrupted leveldb
> file". Is this reproducible with a freshly mkfs'ed data partition?
Probably not. I have multiple data volumes on each nodes (I was planning
xfs vs ext4 vs btrfs benchmarks before being ill) and thoses nodes start
OK with another data partition .
It's very probable that there is corruption somewhere, due to kernel bug
, probably triggered by btrfs.
Issue 2563 is probably the same.
I'd like to restart those nodes without formatting them, not because the
data is valuable, but because if the same thing happens in production, a
method similar to "fsck" the node could be of great value.
I saw the method to check the leveldb. Will try tomorrow without garantees.
In the case I could repair, do you think a crashed FS as it is right now
is valuable for you, for future reference , as I saw you can't reproduce
the problem ? I can make an archive (or a btrfs dump ?), but it will be
quite big.
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-03 20:54 ` Yann Dupont
@ 2012-07-03 21:38 ` Tommi Virtanen
2012-07-04 8:06 ` Yann Dupont
0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-03 21:38 UTC (permalink / raw)
To: Yann Dupont; +Cc: Sam Just, ceph-devel
On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> In the case I could repair, do you think a crashed FS as it is right now is
> valuable for you, for future reference , as I saw you can't reproduce the
> problem ? I can make an archive (or a btrfs dump ?), but it will be quite
> big.
At this point, it's more about the upstream developers (of btrfs etc)
than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-03 21:38 ` Tommi Virtanen
@ 2012-07-04 8:06 ` Yann Dupont
2012-07-04 16:21 ` Gregory Farnum
2012-07-09 17:43 ` Tommi Virtanen
0 siblings, 2 replies; 25+ messages in thread
From: Yann Dupont @ 2012-07-04 8:06 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Sam Just, ceph-devel
Le 03/07/2012 23:38, Tommi Virtanen a écrit :
> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> In the case I could repair, do you think a crashed FS as it is right now is
>> valuable for you, for future reference , as I saw you can't reproduce the
>> problem ? I can make an archive (or a btrfs dump ?), but it will be quite
>> big.
> At this point, it's more about the upstream developers (of btrfs etc)
> than us; we're on good terms with them but not experts on the on-disk
> format(s). You might want to send an email to the relevant mailing
> lists before wiping the disks.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
Well, I probably wasn't clear enough. I talked about crashed FS, but i
was talking about ceph. The underlying FS (btrfs in that case) of 1 node
(and only one) has PROBABLY crashed in the past, causing corruption in
ceph data on this node, and then the subsequent crash of other nodes.
RIGHT now btrfs on this node is OK. I can access the filesystem without
errors.
For the moment, on 8 nodes, 4 refuse to restart .
1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
with the underlying fs as far as I can tell.
So I think the scenario is :
One node had problem with btrfs, leading first to kernel problem ,
probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
kernel oops. Before that ultimate kernel oops, bad data has been
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
nodes.
If you think this scenario is highly improbable in real life (that is,
btrfs will probably be fixed for good, and then, corruption can't
happen), it's ok.
But I wonder if this scenario can be triggered with other problem, and
bad data can be transmitted to other sane nodes (power outage, out of
memory condition, disk full... for example)
That's why I proposed you a crashed ceph volume image (I shouldn't have
talked about a crashed fs, sorry for the confusion)
Talking about btrfs, there is a lot of fixes in btrfs between 3.4 and
3.5rc. After the crash, I couldn't mount the btrfs volume. With 3.5rc I
can , and there is no sign of problem on it. It does'nt mean data is
safe there, but i think it's a sign that at least, some bugs have been
corrected in btrfs code.
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-04 8:06 ` Yann Dupont
@ 2012-07-04 16:21 ` Gregory Farnum
2012-07-04 17:53 ` Yann Dupont
2012-07-09 17:43 ` Tommi Virtanen
1 sibling, 1 reply; 25+ messages in thread
From: Gregory Farnum @ 2012-07-04 16:21 UTC (permalink / raw)
To: Yann Dupont; +Cc: Tommi Virtanen, Sam Just, ceph-devel
On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
> Le 03/07/2012 23:38, Tommi Virtanen a écrit :
> > On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
> > > In the case I could repair, do you think a crashed FS as it is right now is
> > > valuable for you, for future reference , as I saw you can't reproduce the
> > > problem ? I can make an archive (or a btrfs dump ?), but it will be quite
> > > big.
> >
> >
> > At this point, it's more about the upstream developers (of btrfs etc)
> > than us; we're on good terms with them but not experts on the on-disk
> > format(s). You might want to send an email to the relevant mailing
> > lists before wiping the disks.
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> Well, I probably wasn't clear enough. I talked about crashed FS, but i
> was talking about ceph. The underlying FS (btrfs in that case) of 1 node
> (and only one) has PROBABLY crashed in the past, causing corruption in
> ceph data on this node, and then the subsequent crash of other nodes.
>
> RIGHT now btrfs on this node is OK. I can access the filesystem without
> errors.
>
> For the moment, on 8 nodes, 4 refuse to restart .
> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
> with the underlying fs as far as I can tell.
>
> So I think the scenario is :
>
> One node had problem with btrfs, leading first to kernel problem ,
> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
> kernel oops. Before that ultimate kernel oops, bad data has been
> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
> nodes.
I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/
>
> If you think this scenario is highly improbable in real life (that is,
> btrfs will probably be fixed for good, and then, corruption can't
> happen), it's ok.
>
> But I wonder if this scenario can be triggered with other problem, and
> bad data can be transmitted to other sane nodes (power outage, out of
> memory condition, disk full... for example)
>
> That's why I proposed you a crashed ceph volume image (I shouldn't have
> talked about a crashed fs, sorry for the confusion)
I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a "ceph image" won't really help in doing so.
I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-04 16:21 ` Gregory Farnum
@ 2012-07-04 17:53 ` Yann Dupont
2012-07-05 21:32 ` Gregory Farnum
0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-04 17:53 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Tommi Virtanen, Sam Just, ceph-devel
Le 04/07/2012 18:21, Gregory Farnum a écrit :
> On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
>> Le 03/07/2012 23:38, Tommi Virtanen a écrit :
>>> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
>>>> In the case I could repair, do you think a crashed FS as it is right now is
>>>> valuable for you, for future reference , as I saw you can't reproduce the
>>>> problem ? I can make an archive (or a btrfs dump ?), but it will be quite
>>>> big.
>>>
>>>
>>> At this point, it's more about the upstream developers (of btrfs etc)
>>> than us; we're on good terms with them but not experts on the on-disk
>>> format(s). You might want to send an email to the relevant mailing
>>> lists before wiping the disks.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>> Well, I probably wasn't clear enough. I talked about crashed FS, but i
>> was talking about ceph. The underlying FS (btrfs in that case) of 1 node
>> (and only one) has PROBABLY crashed in the past, causing corruption in
>> ceph data on this node, and then the subsequent crash of other nodes.
>>
>> RIGHT now btrfs on this node is OK. I can access the filesystem without
>> errors.
>>
>> For the moment, on 8 nodes, 4 refuse to restart .
>> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
>> with the underlying fs as far as I can tell.
>>
>> So I think the scenario is :
>>
>> One node had problem with btrfs, leading first to kernel problem ,
>> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
>> kernel oops. Before that ultimate kernel oops, bad data has been
>> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
>> nodes.
> I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/
ok, so as all nodes were identical, I probably have hit a btrfs bug
(like a erroneous out of space ) in more or less the same time. And when
1 osd was out,
>
>>
>> If you think this scenario is highly improbable in real life (that is,
>> btrfs will probably be fixed for good, and then, corruption can't
>> happen), it's ok.
>>
>> But I wonder if this scenario can be triggered with other problem, and
>> bad data can be transmitted to other sane nodes (power outage, out of
>> memory condition, disk full... for example)
>>
>> That's why I proposed you a crashed ceph volume image (I shouldn't have
>> talked about a crashed fs, sorry for the confusion)
> I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a "ceph image" won't really help in doing so.
ok, no problem. I'll restart from scratch, freshly formated.
>
> I wonder if maybe there's a confounding factor here — are all your nodes similar to each other,
Yes. I designed the cluster that way. All nodes are identical hardware
(powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)
> or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can?
Ceph was running 0.47.2 at that time - (debian package for ceph). After
the crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48
without success.
Nothing particular for upgrades, because for the moment ceph is broken,
so just apt-get upgrade with new version.
ceph -s show that :
root@label5:~# ceph -s
health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering;
32 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck
stale; 1092 pgs stuck unclean; recovery 267286/2491140 degraded
(10.729%); 1814/1245570 unfound (0.146%)
monmap e1: 3 mons at
{chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa
osdmap e2404: 8 osds: 3 up, 3 in
pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5
active+recovering+remapped, 32 active+clean+replay, 11
active+recovering+degraded, 25 active+remapped, 710 down+peering, 222
active+degraded, 7 stale+active+recovering+degraded, 61
stale+down+peering, 20 stale+active+degraded, 6 down+remapped+peering, 8
stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB
used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%);
1814/1245570 unfound (0.146%)
mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby
BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of
the 4 surviving OSD didn't complete :
2012-07-04 10:13:27.291541 7f8711099780 -1 filestore(/CEPH/data/osd.1)
FileStore::mount : stale version stamp detected: 2. Proceeding,
do_update is set, performing disk format upgrade.
2012-07-04 10:13:27.291618 7f8711099780 0 filestore(/CEPH/data/osd.1)
mount found snaps <3744666,3746725>
then , nothing happens for hours, iotop show constant disk usage :
6069 be/4 root 0.00 B/s 32.09 M/s 0.00 % 19.08 % ceph-osd -i
1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf
strace show lots of syscall like this :
[pid 6069] pread(25, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4101,
94950) = 4101
[pid 6069] pread(23, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4107,
49678) = 4107
[pid 6069] pread(36, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4110,
99797) = 4110
[pid 6069] pread(37, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4105,
8211) = 4105
[pid 6069] pread(25, "\0C\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4121,
99051) = 4121
[pid 6069] pread(36, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4173,
103907) = 4173
[pid 6069] pread(37, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4169,
12316) = 4169
[pid 6069] pread(37, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4130,
16485) = 4130
[pid 6069] pread(36, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4129,
108080) = 4129
Seeems to loop indefinitely.
But It's another problem I guess, maybe a consequence of the others problems
Cheers.
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-04 17:53 ` Yann Dupont
@ 2012-07-05 21:32 ` Gregory Farnum
2012-07-06 7:19 ` Yann Dupont
0 siblings, 1 reply; 25+ messages in thread
From: Gregory Farnum @ 2012-07-05 21:32 UTC (permalink / raw)
To: Yann Dupont, Sam Just; +Cc: ceph-devel
On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Le 04/07/2012 18:21, Gregory Farnum a écrit :
>
>> On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
>>>
>>> Le 03/07/2012 23:38, Tommi Virtanen a écrit :
>>>>
>>>> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr
>>>> (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
>>>>>
>>>>> In the case I could repair, do you think a crashed FS as it is right
>>>>> now is
>>>>> valuable for you, for future reference , as I saw you can't reproduce
>>>>> the
>>>>> problem ? I can make an archive (or a btrfs dump ?), but it will be
>>>>> quite
>>>>> big.
>>>>
>>>> At this point, it's more about the upstream developers (of btrfs
>>>> etc)
>>>> than us; we're on good terms with them but not experts on the on-disk
>>>> format(s). You might want to send an email to the relevant mailing
>>>> lists before wiping the disks.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> (mailto:majordomo@vger.kernel.org)
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> Well, I probably wasn't clear enough. I talked about crashed FS, but
>>> i
>>> was talking about ceph. The underlying FS (btrfs in that case) of 1 node
>>> (and only one) has PROBABLY crashed in the past, causing corruption in
>>> ceph data on this node, and then the subsequent crash of other nodes.
>>> RIGHT now btrfs on this node is OK. I can access the filesystem without
>>> errors.
>>> For the moment, on 8 nodes, 4 refuse to restart .
>>> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
>>> with the underlying fs as far as I can tell.
>>> So I think the scenario is :
>>> One node had problem with btrfs, leading first to kernel problem ,
>>> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
>>> kernel oops. Before that ultimate kernel oops, bad data has been
>>> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
>>> nodes.
>>
>> I don't think that's actually possible — the OSDs all do quite a lot of
>> interpretation between what they get off the wire and what goes on disk.
>> What you've got here are 4 corrupted LevelDB databases, and we pretty much
>> can't do that through the interfaces we have. :/
>
>
> ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
> erroneous out of space ) in more or less the same time. And when 1 osd was
> out,
>
>>
>>>
>>> If you think this scenario is highly improbable in real life (that is,
>>> btrfs will probably be fixed for good, and then, corruption can't
>>> happen), it's ok.
>>> But I wonder if this scenario can be triggered with other problem, and
>>> bad data can be transmitted to other sane nodes (power outage, out of
>>> memory condition, disk full... for example)
>>> That's why I proposed you a crashed ceph volume image (I shouldn't have
>>> talked about a crashed fs, sorry for the confusion)
>>
>> I appreciate the offer, but I don't think this will help much — it's a
>> disk state managed by somebody else, not our logical state, which has
>> broken. If we could figure out how that state got broken that'd be good, but
>> a "ceph image" won't really help in doing so.
>
> ok, no problem. I'll restart from scratch, freshly formated.
>
>>
>> I wonder if maybe there's a confounding factor here — are all your nodes
>> similar to each other,
>
>
> Yes. I designed the cluster that way. All nodes are identical hardware
> (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
> storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)
Oh, interesting. Are the broken nodes all on the same set of arrays?
>
>
>> or are they running on different kinds of hardware? How did you do your
>> Ceph upgrades? What's ceph -s display when the cluster is running as best it
>> can?
>
>
> Ceph was running 0.47.2 at that time - (debian package for ceph). After the
> crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without
> success.
>
> Nothing particular for upgrades, because for the moment ceph is broken, so
> just apt-get upgrade with new version.
>
>
> ceph -s show that :
>
> root@label5:~# ceph -s
> health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32
> pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale;
> 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%);
> 1814/1245570 unfound (0.146%)
> monmap e1: 3 mons at
> {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
> election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa
> osdmap e2404: 8 osds: 3 up, 3 in
> pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5
> active+recovering+remapped, 32 active+clean+replay, 11
> active+recovering+degraded, 25 active+remapped, 710 down+peering, 222
> active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering,
> 20 stale+active+degraded, 6 down+remapped+peering, 8
> stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB
> used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%);
> 1814/1245570 unfound (0.146%)
> mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby
Okay, that looks about how I'd expect if half your OSDs are down.
>
>
>
> BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of the 4
> surviving OSD didn't complete :
>
> 2012-07-04 10:13:27.291541 7f8711099780 -1 filestore(/CEPH/data/osd.1)
> FileStore::mount : stale version stamp detected: 2. Proceeding, do_update is
> set, performing disk format upgrade.
> 2012-07-04 10:13:27.291618 7f8711099780 0 filestore(/CEPH/data/osd.1) mount
> found snaps <3744666,3746725>
>
> then , nothing happens for hours, iotop show constant disk usage :
> 6069 be/4 root 0.00 B/s 32.09 M/s 0.00 % 19.08 % ceph-osd -i 1
> --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf
>
> strace show lots of syscall like this :
>
> [pid 6069] pread(25, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4101,
> 94950) = 4101
> [pid 6069] pread(23, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4107,
> 49678) = 4107
> [pid 6069] pread(36, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4110,
> 99797) = 4110
> [pid 6069] pread(37, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4105, 8211)
> = 4105
> [pid 6069] pread(25, "\0C\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4121,
> 99051) = 4121
> [pid 6069] pread(36, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4173,
> 103907) = 4173
> [pid 6069] pread(37, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4169,
> 12316) = 4169
> [pid 6069] pread(37, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4130,
> 16485) = 4130
> [pid 6069] pread(36, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4129,
> 108080) = 4129
Sam, does this look like something of ours to you?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-05 21:32 ` Gregory Farnum
@ 2012-07-06 7:19 ` Yann Dupont
2012-07-06 17:01 ` Gregory Farnum
0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-06 7:19 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sam Just, ceph-devel
Le 05/07/2012 23:32, Gregory Farnum a écrit :
[...]
>> ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
>> erroneous out of space ) in more or less the same time. And when 1 osd was
>> out,
OH , I didn't finish the sentence... When 1 osd was out, missing data
was copied on another nodes, probably speeding btrfs problem on those
nodes (I suspect erroneous out of space conditions)
I've reformatted OSD with xfs. Performance is slightly worse for the
moment (well, depend on the workload, and maybe lack of syncfs is to
blame), but at least I hope to have the storage layer rock-solid. BTW,
I've managed to keep the faulty btrfs volumes .
[...]
>>> I wonder if maybe there's a confounding factor here — are all your nodes
>>> similar to each other,
>> Yes. I designed the cluster that way. All nodes are identical hardware
>> (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
>> storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)
> Oh, interesting. Are the broken nodes all on the same set of arrays?
No. There are 4 completely independant raid arrays, in 4 different
locations. They are similar (same brand & model, but slighltly different
disks, and 1 different firmware), all arrays are multipathed. I don't
think the raid array is the problem. We use those particular models
since 2/3 years, and in the logs I don't see any problem that can be
caused by the storage itself (like scsi or multipath errors)
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-06 7:19 ` Yann Dupont
@ 2012-07-06 17:01 ` Gregory Farnum
2012-07-07 8:19 ` Yann Dupont
0 siblings, 1 reply; 25+ messages in thread
From: Gregory Farnum @ 2012-07-06 17:01 UTC (permalink / raw)
To: Yann Dupont; +Cc: Sam Just, ceph-devel
On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Le 05/07/2012 23:32, Gregory Farnum a écrit :
>
> [...]
>
>>> ok, so as all nodes were identical, I probably have hit a btrfs bug (like
>>> a
>>> erroneous out of space ) in more or less the same time. And when 1 osd
>>> was
>>> out,
>
>
> OH , I didn't finish the sentence... When 1 osd was out, missing data was
> copied on another nodes, probably speeding btrfs problem on those nodes (I
> suspect erroneous out of space conditions)
Ah. How full are/were the disks?
>
> I've reformatted OSD with xfs. Performance is slightly worse for the moment
> (well, depend on the workload, and maybe lack of syncfs is to blame), but at
> least I hope to have the storage layer rock-solid. BTW, I've managed to keep
> the faulty btrfs volumes .
>
> [...]
>
>
>>>> I wonder if maybe there's a confounding factor here — are all your nodes
>>>> similar to each other,
>>>
>>> Yes. I designed the cluster that way. All nodes are identical hardware
>>> (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
>>> storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)
>>
>> Oh, interesting. Are the broken nodes all on the same set of arrays?
>
>
> No. There are 4 completely independant raid arrays, in 4 different
> locations. They are similar (same brand & model, but slighltly different
> disks, and 1 different firmware), all arrays are multipathed. I don't think
> the raid array is the problem. We use those particular models since 2/3
> years, and in the logs I don't see any problem that can be caused by the
> storage itself (like scsi or multipath errors)
I must have misunderstood then. What did you mean by "1 Array for 2 OSD nodes"?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-06 17:01 ` Gregory Farnum
@ 2012-07-07 8:19 ` Yann Dupont
2012-07-09 17:14 ` Samuel Just
0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-07 8:19 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sam Just, ceph-devel
Le 06/07/2012 19:01, Gregory Farnum a écrit :
> On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> Le 05/07/2012 23:32, Gregory Farnum a écrit :
>>
>> [...]
>>
>>>> ok, so as all nodes were identical, I probably have hit a btrfs bug (like
>>>> a
>>>> erroneous out of space ) in more or less the same time. And when 1 osd
>>>> was
>>>> out,
>>
>> OH , I didn't finish the sentence... When 1 osd was out, missing data was
>> copied on another nodes, probably speeding btrfs problem on those nodes (I
>> suspect erroneous out of space conditions)
> Ah. How full are/were the disks?
The OSD nodes were below 50 % (all are 5 To volumes):
osd.0 : 31%
osd.1 : 31%
osd.2 : 39%
osd.3 : 65%
no osd.4 :)
osd.5 : 35%
osd.6 : 60%
osd.7 : 42%
osd.8 : 34%
all the volumes were using btrfs with lzo compress.
[...]
>
> Oh, interesting. Are the broken nodes all on the same set of arrays?
>>
>> No. There are 4 completely independant raid arrays, in 4 different
>> locations. They are similar (same brand & model, but slighltly different
>> disks, and 1 different firmware), all arrays are multipathed. I don't think
>> the raid array is the problem. We use those particular models since 2/3
>> years, and in the logs I don't see any problem that can be caused by the
>> storage itself (like scsi or multipath errors)
> I must have misunderstood then. What did you mean by "1 Array for 2 OSD nodes"?
I have 8 osd nodes, in 4 different locations (several km away). In each
location I have 2 nodes and 1 raid Array.
On each location, each raid array has 16 2To disks, 2 controllers with
4x 8 Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks
for one, 7 disks for the orher). Each raid set is primary attached to 1
controller, and each osd node on the location has acces to the
controller with 2 distinct paths.
There were no correlation between failed nodes & raid array.
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-07 8:19 ` Yann Dupont
@ 2012-07-09 17:14 ` Samuel Just
2012-07-10 9:46 ` Yann Dupont
0 siblings, 1 reply; 25+ messages in thread
From: Samuel Just @ 2012-07-09 17:14 UTC (permalink / raw)
To: Yann Dupont; +Cc: Gregory Farnum, ceph-devel
Can you restart the node that failed to complete the upgrade with
debug filestore = 20
debug osd = 20
and post the log after an hour or so of running? The upgrade process
might legitimately take a while.
-Sam
On Sat, Jul 7, 2012 at 1:19 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Le 06/07/2012 19:01, Gregory Farnum a écrit :
>
>> On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr>
>> wrote:
>>>
>>> Le 05/07/2012 23:32, Gregory Farnum a écrit :
>>>
>>> [...]
>>>
>>>>> ok, so as all nodes were identical, I probably have hit a btrfs bug
>>>>> (like
>>>>> a
>>>>> erroneous out of space ) in more or less the same time. And when 1 osd
>>>>> was
>>>>> out,
>>>
>>>
>>> OH , I didn't finish the sentence... When 1 osd was out, missing data was
>>> copied on another nodes, probably speeding btrfs problem on those nodes
>>> (I
>>> suspect erroneous out of space conditions)
>>
>> Ah. How full are/were the disks?
>
>
> The OSD nodes were below 50 % (all are 5 To volumes):
>
> osd.0 : 31%
> osd.1 : 31%
> osd.2 : 39%
> osd.3 : 65%
> no osd.4 :)
> osd.5 : 35%
> osd.6 : 60%
> osd.7 : 42%
> osd.8 : 34%
>
> all the volumes were using btrfs with lzo compress.
>
> [...]
>
>>
>> Oh, interesting. Are the broken nodes all on the same set of arrays?
>>>
>>>
>>> No. There are 4 completely independant raid arrays, in 4 different
>>> locations. They are similar (same brand & model, but slighltly different
>>> disks, and 1 different firmware), all arrays are multipathed. I don't
>>> think
>>> the raid array is the problem. We use those particular models since 2/3
>>> years, and in the logs I don't see any problem that can be caused by the
>>> storage itself (like scsi or multipath errors)
>>
>> I must have misunderstood then. What did you mean by "1 Array for 2 OSD
>> nodes"?
>
>
> I have 8 osd nodes, in 4 different locations (several km away). In each
> location I have 2 nodes and 1 raid Array.
> On each location, each raid array has 16 2To disks, 2 controllers with 4x 8
> Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks for one,
> 7 disks for the orher). Each raid set is primary attached to 1 controller,
> and each osd node on the location has acces to the controller with 2
> distinct paths.
>
> There were no correlation between failed nodes & raid array.
>
>
> Cheers,
>
> --
> Yann Dupont - Service IRTS, DSI Université de Nantes
> Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-04 8:06 ` Yann Dupont
2012-07-04 16:21 ` Gregory Farnum
@ 2012-07-09 17:43 ` Tommi Virtanen
2012-07-09 19:05 ` Yann Dupont
1 sibling, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-09 17:43 UTC (permalink / raw)
To: Yann Dupont; +Cc: Sam Just, ceph-devel
On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Well, I probably wasn't clear enough. I talked about crashed FS, but i was
> talking about ceph. The underlying FS (btrfs in that case) of 1 node (and
> only one) has PROBABLY crashed in the past, causing corruption in ceph data
> on this node, and then the subsequent crash of other nodes.
>
> RIGHT now btrfs on this node is OK. I can access the filesystem without
> errors.
But the LevelDB isn't. It's contents got corrupted, somehow somewhere,
and it really is up to the LevelDB library to tolerate those errors;
we have a simple get/put interface we use, and LevelDB is triggering
an internal error.
> One node had problem with btrfs, leading first to kernel problem , probably
> corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops.
> Before that ultimate kernel oops, bad data has been transmitted to other
> (sane) nodes, leading to ceph-osd crash on thoses nodes.
The LevelDB binary contents are not transferred over to other nodes;
this kind of corruption would not spread over the Ceph clustering
mechanisms. It's more likely that you have 4 independently corrupted
LevelDBs. Something in the workload Ceph runs makes that corruption
quite likely.
The information here isn't enough to say whether the cause of the
corruption is btrfs or LevelDB, but the recovery needs to handled by
LevelDB -- and upstream is working on making it more robust:
http://code.google.com/p/leveldb/issues/detail?id=97
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-09 17:43 ` Tommi Virtanen
@ 2012-07-09 19:05 ` Yann Dupont
2012-07-09 19:48 ` Tommi Virtanen
0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-09 19:05 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Sam Just, ceph-devel
Le 09/07/2012 19:43, Tommi Virtanen a écrit :
> On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> Well, I probably wasn't clear enough. I talked about crashed FS, but i was
>> talking about ceph. The underlying FS (btrfs in that case) of 1 node (and
>> only one) has PROBABLY crashed in the past, causing corruption in ceph data
>> on this node, and then the subsequent crash of other nodes.
>>
>> RIGHT now btrfs on this node is OK. I can access the filesystem without
>> errors.
> But the LevelDB isn't. It's contents got corrupted, somehow somewhere,
> and it really is up to the LevelDB library to tolerate those errors;
> we have a simple get/put interface we use, and LevelDB is triggering
> an internal error.
Yes, understood.
>> One node had problem with btrfs, leading first to kernel problem , probably
>> corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops.
>> Before that ultimate kernel oops, bad data has been transmitted to other
>> (sane) nodes, leading to ceph-osd crash on thoses nodes.
> The LevelDB binary contents are not transferred over to other nodes;
Ok thanks for the clarification ;
> this kind of corruption would not spread over the Ceph clustering
> mechanisms. It's more likely that you have 4 independently corrupted
> LevelDBs. Something in the workload Ceph runs makes that corruption
> quite likely.
Very likely : since I reformatted my nodes with XFS I don't have
problems so far.
>
> The information here isn't enough to say whether the cause of the
> corruption is btrfs or LevelDB, but the recovery needs to handled by
> LevelDB -- and upstream is working on making it more robust:
> http://code.google.com/p/leveldb/issues/detail?id=97
Yes, saw this. It's very important. Sometimes, s... happens. In respect
to the size ceph volumes can reach, having a tool to restart damaged
nodes (for whatever reason) is a must.
Thanks for the time you took to answer. It's much clearer for me now.
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-09 19:05 ` Yann Dupont
@ 2012-07-09 19:48 ` Tommi Virtanen
0 siblings, 0 replies; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-09 19:48 UTC (permalink / raw)
To: Yann Dupont; +Cc: Sam Just, ceph-devel
On Mon, Jul 9, 2012 at 12:05 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> The information here isn't enough to say whether the cause of the
>> corruption is btrfs or LevelDB, but the recovery needs to handled by
>> LevelDB -- and upstream is working on making it more robust:
>> http://code.google.com/p/leveldb/issues/detail?id=97
>
> Yes, saw this. It's very important. Sometimes, s... happens. In respect to
> the size ceph volumes can reach, having a tool to restart damaged nodes (for
> whatever reason) is a must.
>
> Thanks for the time you took to answer. It's much clearer for me now.
If it doesn't recover, you re-format the disk and thereby throw away
the contents. Not really all that different from handling hardware
failure. That's why we have replication.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-09 17:14 ` Samuel Just
@ 2012-07-10 9:46 ` Yann Dupont
2012-07-10 15:56 ` Tommi Virtanen
0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-10 9:46 UTC (permalink / raw)
To: Samuel Just; +Cc: Gregory Farnum, ceph-devel
Le 09/07/2012 19:14, Samuel Just a écrit :
> Can you restart the node that failed to complete the upgrade with
Well, it's a little big complicated ; I now run those nodes with XFS,
and I've long-running jobs on it right now, so I can't stop the ceph
cluster at the moment.
As I've keeped the original broken btrfs volumes, I tried this morning
to run the old osd in parrallel, using the $cluster variable. I only
have partial success.
I tried using different port for the mons, but ceph want to use the old
mon map. I can edit it (epoch 1) but it seems to use 'latest' instead,
the format isn't compatible with monmaptool and I don't know how to
"inject" the modified on a non running cluster.
Anyway, osd seems to start fine, and I can reproduce the bug :
> debug filestore = 20
> debug osd = 20
>
I've put it in [global], is it sufficient ?
>
> and post the log after an hour or so of running? The upgrade process
> might legitimately take a while.
> -Sam
Only 15 minutes running, but ceph-osd is consumming lots of cpu, and a
strace shows lots of pread.
Here is the log :
[..]
2012-07-10 11:33:29.560052 7f3e615ac780 0
filestore(/CEPH-PROD/data/osd.1) mount syncfs(2) syscall not support by
glibc
2012-07-10 11:33:29.560062 7f3e615ac780 0
filestore(/CEPH-PROD/data/osd.1) mount no syncfs(2), but the btrfs SYNC
ioctl will suffice
2012-07-10 11:33:29.560172 7f3e615ac780 -1
filestore(/CEPH-PROD/data/osd.1) FileStore::mount : stale version stamp
detected: 2. Proceeding, do_update is set, performing disk format upgrade.
2012-07-10 11:33:29.560233 7f3e615ac780 0
filestore(/CEPH-PROD/data/osd.1) mount found snaps <3744666,3746725>
2012-07-10 11:33:29.560263 7f3e615ac780 10
filestore(/CEPH-PROD/data/osd.1) current/ seq was 3746725
2012-07-10 11:33:29.560267 7f3e615ac780 10
filestore(/CEPH-PROD/data/osd.1) most recent snap from
<3744666,3746725> is 3746725
2012-07-10 11:33:29.560280 7f3e615ac780 10
filestore(/CEPH-PROD/data/osd.1) mount rolling back to consistent snap
3746725
2012-07-10 11:33:29.839281 7f3e615ac780 5
filestore(/CEPH-PROD/data/osd.1) mount op_seq is 3746725
... and nothing more.
I'll let him running for 3 hours. If I have another message, I'll let
you know.
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-10 9:46 ` Yann Dupont
@ 2012-07-10 15:56 ` Tommi Virtanen
2012-07-10 16:39 ` Yann Dupont
0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-10 15:56 UTC (permalink / raw)
To: Yann Dupont; +Cc: Samuel Just, Gregory Farnum, ceph-devel
On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> As I've keeped the original broken btrfs volumes, I tried this morning to
> run the old osd in parrallel, using the $cluster variable. I only have
> partial success.
The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-10 15:56 ` Tommi Virtanen
@ 2012-07-10 16:39 ` Yann Dupont
2012-07-10 17:11 ` Tommi Virtanen
0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-10 16:39 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Samuel Just, Gregory Farnum, ceph-devel
Le 10/07/2012 17:56, Tommi Virtanen a écrit :
> On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> As I've keeped the original broken btrfs volumes, I tried this morning to
>> run the old osd in parrallel, using the $cluster variable. I only have
>> partial success.
> The cluster mechanism was never intended for moving existing osds to
> other clusters. Trying that might not be a good idea.
Ok, good to know. I saw that the remaining maps could lead to problem,
but in 2 words, what are the other associated risks ? Basically If I use
2 distincts config files,
with differents & non-overlapping paths, and different ports for OSD,
MDS & MON, we basically have 2 distincts and independant instances ?
By the way, is using 2 mon instance with different ports supported ?
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-10 16:39 ` Yann Dupont
@ 2012-07-10 17:11 ` Tommi Virtanen
2012-07-10 17:36 ` Yann Dupont
0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-10 17:11 UTC (permalink / raw)
To: Yann Dupont; +Cc: Samuel Just, Gregory Farnum, ceph-devel
On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> The cluster mechanism was never intended for moving existing osds to
>> other clusters. Trying that might not be a good idea.
> Ok, good to know. I saw that the remaining maps could lead to problem, but
> in 2 words, what are the other associated risks ? Basically If I use 2
> distincts config files,
> with differents & non-overlapping paths, and different ports for OSD, MDS &
> MON, we basically have 2 distincts and independant instances ?
Fundamentally, it comes down to this: the two clusters will still have
the same fsid, and you won't be isolated from configuration errors or
leftover state (such as the monmap) in any way. There's a high chance
that your "let's poke around and debug" cluster wrecks your healthy
cluster.
> By the way, is using 2 mon instance with different ports supported ?
Monitors are identified by ip:port. You can have multiple bind to the
same IP address, as long as they get separate ports.
Naturally, this practically means giving up on high availability.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-10 17:11 ` Tommi Virtanen
@ 2012-07-10 17:36 ` Yann Dupont
2012-07-10 18:16 ` Tommi Virtanen
0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-10 17:36 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Samuel Just, Gregory Farnum, ceph-devel
Le 10/07/2012 19:11, Tommi Virtanen a écrit :
> On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>>> The cluster mechanism was never intended for moving existing osds to
>>> other clusters. Trying that might not be a good idea.
>> Ok, good to know. I saw that the remaining maps could lead to problem, but
>> in 2 words, what are the other associated risks ? Basically If I use 2
>> distincts config files,
>> with differents & non-overlapping paths, and different ports for OSD, MDS &
>> MON, we basically have 2 distincts and independant instances ?
> Fundamentally, it comes down to this: the two clusters will still have
> the same fsid, and you won't be isolated from configuration errors or
Ah I understand. This is not the case : see :
root@chichibu:~# cat /CEPH/data/osd.0/fsid
f00139fe-478e-4c50-80e2-f7cb359100d4
root@chichibu:~# cat /CEPH-PROD/data/osd.0/fsid
43afd025-330e-4aa8-9324-3e9b0afce794
(CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume,
completely redone & reformatted with mkcephfs. The volumes are totally
independant :
if you want the gore details :
root@chichibu:~# lvs
LV VG Attr LSize Origin Snap% Move Log
Copy% Convert
ceph-osd LocalDisk -wi-a- 225,00g
mon-btrfs LocalDisk -wi-ao 10,00g
mon-xfs LocalDisk -wi-ao 10,00g
data ceph-chichibu -wi-ao 5,00t <- OLD btrfs,
mounted on /CEPH-PROD
data xceph-chichibu -wi-ao 4,50t <- NEW xfs, mounted
on /CEPH
> leftover state (such as the monmap) in any way. There's a high chance
> that your "let's poke around and debug" cluster wrecks your healthy
> cluster.
Yes I understand the risk.
>> By the way, is using 2 mon instance with different ports supported ?
> Monitors are identified by ip:port. You can have multiple bind to the
> same IP address, as long as they get separate ports.
>
> Naturally, this practically means giving up on high availability.
The idea is not just having 2 mon. I'll still use 3 differents machines
for mon, but with 2 mon instance on each. One for the current ceph, the
other for the old ceph.
2x3 Mon.
Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: domino-style OSD crash
2012-07-10 17:36 ` Yann Dupont
@ 2012-07-10 18:16 ` Tommi Virtanen
0 siblings, 0 replies; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-10 18:16 UTC (permalink / raw)
To: Yann Dupont; +Cc: Samuel Just, Gregory Farnum, ceph-devel
On Tue, Jul 10, 2012 at 10:36 AM, Yann Dupont
<Yann.Dupont@univ-nantes.fr> wrote:
>> Fundamentally, it comes down to this: the two clusters will still have
>> the same fsid, and you won't be isolated from configuration errors or
> (CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, completely
> redone & reformatted with mkcephfs. The volumes are totally independant :
Ahh you re-created the monitors too. That changes things, then you
have a new random fsid. I understood you only re-mkfsed the osd.
Doing it like that, your real worry is just the remembered state of
monmaps, osdmaps etc. If the daemons accidentally talk to the wrong
cluster, the fsid *should* protect you from damage; they should get
rejected. Similarly, if you use cephx authentication, the keys won't
match either.
>> Naturally, this practically means giving up on high availability.
> The idea is not just having 2 mon. I'll still use 3 differents machines for
> mon, but with 2 mon instance on each. One for the current ceph, the other
> for the old ceph.
> 2x3 Mon.
That should be perfectly doable.
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2012-07-10 18:16 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-04 8:44 domino-style OSD crash Yann Dupont
2012-06-04 16:16 ` Tommi Virtanen
2012-06-04 17:40 ` Sam Just
2012-06-04 18:34 ` Greg Farnum
2012-07-03 8:40 ` Yann Dupont
2012-07-03 19:42 ` Tommi Virtanen
2012-07-03 20:54 ` Yann Dupont
2012-07-03 21:38 ` Tommi Virtanen
2012-07-04 8:06 ` Yann Dupont
2012-07-04 16:21 ` Gregory Farnum
2012-07-04 17:53 ` Yann Dupont
2012-07-05 21:32 ` Gregory Farnum
2012-07-06 7:19 ` Yann Dupont
2012-07-06 17:01 ` Gregory Farnum
2012-07-07 8:19 ` Yann Dupont
2012-07-09 17:14 ` Samuel Just
2012-07-10 9:46 ` Yann Dupont
2012-07-10 15:56 ` Tommi Virtanen
2012-07-10 16:39 ` Yann Dupont
2012-07-10 17:11 ` Tommi Virtanen
2012-07-10 17:36 ` Yann Dupont
2012-07-10 18:16 ` Tommi Virtanen
2012-07-09 17:43 ` Tommi Virtanen
2012-07-09 19:05 ` Yann Dupont
2012-07-09 19:48 ` Tommi Virtanen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.