All of lore.kernel.org
 help / color / mirror / Atom feed
* domino-style OSD crash
@ 2012-06-04  8:44 Yann Dupont
  2012-06-04 16:16 ` Tommi Virtanen
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-06-04  8:44 UTC (permalink / raw)
  To: ceph-devel

Hello,
Besides the performance inconsistency (see other thread titled poor OSD 
performance using kernel 3.4) where I promised some tests (will run this 
afternoon), we tried this week-end to stress test ceph, making backups 
with bacula on a rbd volume of 15T (8 osd nodes, using 8 physical machines)

Results : Worked like a charm during two days, apart btrfs warn messages 
then OSD begin to crash 1 after all 'domino style'.

This morning, only 2 OSD of 8 are left.

1 of the physical machine was in kernel oops state - Nothing was remote 
logged, don't know what happened, there were no clear stack message. I 
suspect btrfs , but I have no proof.

This node (OSD.7) seems to have been the 1st one to crash, generated 
reconstruction between OSD & then lead to the cascade osd crash.

The other physical machines are still up, but with no osd running. here 
are some trace found in osd log :

    -3> 2012-06-03 12:43:32.524671 7ff1352b8700  0 log [WRN] : slow 
request 30.506952 seconds old, rec
eived at 2012-06-03 12:43:01.997386: osd_sub_op(osd.0.0:1842628 2.57 
ea8d5657/label5_17606_object7068/
head [push] v 191'628 snapset=0=[]:[] snapc=0=[]) v6 currently queued for pg
     -2> 2012-06-03 12:44:32.869852 7ff1352b8700  0 log [WRN] : 1 slow 
requests, 1 included below; olde
st blocked for > 30.073136 secs
     -1> 2012-06-03 12:44:32.869886 7ff1352b8700  0 log [WRN] : slow 
request 30.073136 seconds old, rec
eived at 2012-06-03 12:44:02.796651: osd_sub_op(osd.6.0:1837430 2.59 
97e62059/rb.0.1.0000000a2cdf/head
  [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
      0> 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal 
(Aborted) **
  in thread 7ff1237f6700

  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
  1: /usr/bin/ceph-osd() [0x708ea9]
  2: (()+0xeff0) [0x7ff13af2cff0]
  3: (gsignal()+0x35) [0x7ff13950b1b5]
  4: (abort()+0x180) [0x7ff13950dfc0]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7ff139d9fdc5]
  6: (()+0xcb166) [0x7ff139d9e166]
  7: (()+0xcb193) [0x7ff139d9e193]
  8: (()+0xcb28e) [0x7ff139d9e28e]
  9: (std::__throw_length_error(char const*)+0x67) [0x7ff139d39307]
  10: (std::string::_Rep::_S_create(unsigned long, unsigned long, 
std::allocator<char> const&)+0x72) [0x7ff139d7ab42]
  11: (()+0xa8565) [0x7ff139d7b565]
  12: (std::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_string(char const*, unsigned long, 
std::allocator<char> const&)+0x1b) [0x7ff139d7b7ab]
  13: 
(leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, 
leveldb::Slice const&) const+0x4d) [0x6ef69d]
  14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice 
const&)+0x9f) [0x6fdd9f]
  15: 
(leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x4d3) 
[0x6eaba3]
  16: (leveldb::DBImpl::BackgroundCompaction()+0x222) [0x6ebb02]
  17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6ec378]
  18: /usr/bin/ceph-osd() [0x704981]
  19: (()+0x68ca) [0x7ff13af248ca]
  20: (clone()+0x6d) [0x7ff1395a892d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

2 OSD exhibit similar traces.

---

4 other had traces like this one :

     -5> 2012-06-03 13:31:39.393489 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:19.393488)
     -4> 2012-06-03 13:31:40.393689 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:20.393687)
     -3> 2012-06-03 13:31:41.402873 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:21.402872)
     -2> 2012-06-03 13:31:42.363270 7f74f08ac700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.363269)
     -1> 2012-06-03 13:31:42.416968 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin
ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.416966)
      0> 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In 
function 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&, 
pg_log_t&, int)' thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >= 
log.tail)

  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
  1: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, 
int)+0x1eae) [0x649cce]
  2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec 
const&)+0x2b1) [0x649fc1]
  3: (boost::statechart::simple_state<PG::RecoveryState::Stray, 
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x203) [0x660343]
  4: 
(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, 
PG::RecoveryState::Initial, std::allocator<void>, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base 
const&)+0x6b) [0x6580eb]
  5: (PG::RecoveryState::handle_log(int, MOSDPGLog*, 
PG::RecoveryCtx*)+0x190) [0x6139d0]
  6: (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x666) [0x5cec66]
  7: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x11b) [0x5d312b]
  8: (OSD::_dispatch(Message*)+0x173) [0x5dc273]
  9: (OSD::ms_dispatch(Message*)+0x1e7) [0x5dcba7]
  10: (SimpleMessenger::dispatch_entry()+0x979) [0x7d60a9]
  11: (SimpleMessenger::DispatchThread::entry()+0xd) [0x72781d]
  12: (()+0x68ca) [0x7f75036338ca]
  13: (clone()+0x6d) [0x7f7501cb792d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- end dump of recent events ---
2012-06-03 13:36:48.487021 7f74f58b6700 -1 *** Caught signal (Aborted) **
  in thread 7f74f58b6700

  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
  1: /usr/bin/ceph-osd() [0x708ea9]
  2: (()+0xeff0) [0x7f750363bff0]
  3: (gsignal()+0x35) [0x7f7501c1a1b5]
  4: (abort()+0x180) [0x7f7501c1cfc0]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f75024aedc5]
  6: (()+0xcb166) [0x7f75024ad166]
  7: (()+0xcb193) [0x7f75024ad193]
  8: (()+0xcb28e) [0x7f75024ad28e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x940) [0x77d550]
  10: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, 
int)+0x1eae) [0x649cce]
  11: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec 
const&)+0x2b1) [0x649fc1]
  12: (boost::statechart::simple_state<PG::RecoveryState::Stray, 
PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x203) [0x660343]
  13: 
(boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, 
PG::RecoveryState::Initial, std::allocator<void>, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base 
const&)+0x6b) [0x6580eb]
  14: (PG::RecoveryState::handle_log(int, MOSDPGLog*, 
PG::RecoveryCtx*)+0x190) [0x6139d0]
  15: (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x666) [0x5cec66]
  16: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x11b) [0x5d312b]
  17: (OSD::_dispatch(Message*)+0x173) [0x5dc273]
  18: (OSD::ms_dispatch(Message*)+0x1e7) [0x5dcba7]
  19: (SimpleMessenger::dispatch_entry()+0x979) [0x7d60a9]
  20: (SimpleMessenger::DispatchThread::entry()+0xd) [0x72781d]
  21: (()+0x68ca) [0x7f75036338ca]
  22: (clone()+0x6d) [0x7f7501cb792d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

The root cause can be btrfs... or maybe not. I don't see any btrfs crash 
oops, just :

Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479173] 
Pid: 16875, comm: kworker/7:0 Tainted: G        W    3.4.0-dsiun-120521 #108
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479218] 
Call Trace:
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479243] 
[<ffffffff81039f1b>] ? warn_slowpath_common+0x7b/0xc0
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278] 
[<ffffffffa026fca5>] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328] 
[<ffffffffa026965a>] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379] 
[<ffffffffa02bc9a0>] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415] 
[<ffffffffa026a6f1>] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460] 
[<ffffffff8105a9f0>] ? add_wait_queue+0x60/0x60
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493] 
[<ffffffffa026aba0>] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543] 
[<ffffffffa026abb1>] ? do_async_commit+0x11/0x20 [btrfs]
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572] 
[<ffffffff810548a7>] ? process_one_work+0x107/0x460
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479601] 
[<ffffffff81055a8e>] ? worker_thread+0x14e/0x330
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479628] 
[<ffffffff81055940>] ? manage_workers.isra.28+0x210/0x210
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479657] 
[<ffffffff8105a005>] ? kthread+0x85/0x90
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479684] 
[<ffffffff813be3e4>] ? kernel_thread_helper+0x4/0x10
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479712] 
[<ffffffff81059f80>] ? kthread_freezable_should_stop+0x60/0x60
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479741] 
[<ffffffff813be3e0>] ? gs_change+0x13/0x13
Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479767] 
---[ end trace 303c47aab4b5d025 ]---
Jun  3 00:44:11 chichibu.u14.univ-nantes.prive kernel: [204497.711101] 
------------[ cut here ]------------

But this is just a warn (maybe that could lead to kernel oops/crash). 
Seems to have been fixed lately in git kernels.


I can give you all 8 logs of OSD + logs of MDS & MON if it can help.


Cheers,
--

Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-06-04  8:44 domino-style OSD crash Yann Dupont
@ 2012-06-04 16:16 ` Tommi Virtanen
  2012-06-04 17:40   ` Sam Just
  0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-06-04 16:16 UTC (permalink / raw)
  To: Yann Dupont; +Cc: ceph-devel

On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Results : Worked like a charm during two days, apart btrfs warn messages
> then OSD begin to crash 1 after all 'domino style'.

Sorry to hear that. Reading through your message, there seem to be
several problems; whether they are because of the same root cause, I
can't tell.

Quick triage to benefit the other devs:

#1: kernel crash, no details available
> 1 of the physical machine was in kernel oops state - Nothing was remote

#2: leveldb corruption? may be memory corruption that started
elsewhere.. Sam, does this look like the leveldb issue you saw?
>  [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
>     0> 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
> (Aborted) **
...
>  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
> leveldb::Slice const&) const+0x4d) [0x6ef69d]
>  14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice
> const&)+0x9f) [0x6fdd9f]

#3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
>     0> 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function
> 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, int)'
> thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
> osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >=
> log.tail)

#4: unknown btrfs warnings, there should an actual message above this
traceback; believed fixed in latest kernel
> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
> [<ffffffffa026fca5>] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
> [<ffffffffa026965a>] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
> [<ffffffffa02bc9a0>] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
> [<ffffffffa026a6f1>] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
> [<ffffffff8105a9f0>] ? add_wait_queue+0x60/0x60
> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
> [<ffffffffa026aba0>] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
> [<ffffffffa026abb1>] ? do_async_commit+0x11/0x20 [btrfs]
> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-06-04 16:16 ` Tommi Virtanen
@ 2012-06-04 17:40   ` Sam Just
  2012-06-04 18:34     ` Greg Farnum
  2012-07-03  8:40     ` Yann Dupont
  0 siblings, 2 replies; 25+ messages in thread
From: Sam Just @ 2012-06-04 17:40 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Yann Dupont, ceph-devel

Can you send the osd logs?  The merge_log crashes are probably fixable
if I can see the logs.

The leveldb crash is almost certainly a result of memory corruption.

Thanks
-Sam

On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen <tv@inktank.com> wrote:
> On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> Results : Worked like a charm during two days, apart btrfs warn messages
>> then OSD begin to crash 1 after all 'domino style'.
>
> Sorry to hear that. Reading through your message, there seem to be
> several problems; whether they are because of the same root cause, I
> can't tell.
>
> Quick triage to benefit the other devs:
>
> #1: kernel crash, no details available
>> 1 of the physical machine was in kernel oops state - Nothing was remote
>
> #2: leveldb corruption? may be memory corruption that started
> elsewhere.. Sam, does this look like the leveldb issue you saw?
>>  [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
>>     0> 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
>> (Aborted) **
> ...
>>  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
>> leveldb::Slice const&) const+0x4d) [0x6ef69d]
>>  14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice
>> const&)+0x9f) [0x6fdd9f]
>
> #3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
>>     0> 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function
>> 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, int)'
>> thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
>> osd/PG.cc: 402: FAILED assert(log.head >= olog.tail && olog.head >=
>> log.tail)
>
> #4: unknown btrfs warnings, there should an actual message above this
> traceback; believed fixed in latest kernel
>> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
>> [<ffffffffa026fca5>] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
>> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
>> [<ffffffffa026965a>] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
>> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
>> [<ffffffffa02bc9a0>] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
>> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
>> [<ffffffffa026a6f1>] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
>> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
>> [<ffffffff8105a9f0>] ? add_wait_queue+0x60/0x60
>> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
>> [<ffffffffa026aba0>] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
>> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
>> [<ffffffffa026abb1>] ? do_async_commit+0x11/0x20 [btrfs]
>> Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-06-04 17:40   ` Sam Just
@ 2012-06-04 18:34     ` Greg Farnum
  2012-07-03  8:40     ` Yann Dupont
  1 sibling, 0 replies; 25+ messages in thread
From: Greg Farnum @ 2012-06-04 18:34 UTC (permalink / raw)
  To: Sam Just; +Cc: Tommi Virtanen, Yann Dupont, ceph-devel

This is probably the same/similar to http://tracker.newdream.net/issues/2462, no? There's a log there, though I've no idea how helpful it is.


On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote:

> Can you send the osd logs? The merge_log crashes are probably fixable
> if I can see the logs.
> 
> The leveldb crash is almost certainly a result of memory corruption.
> 
> Thanks
> -Sam
> 
> On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen <tv@inktank.com (mailto:tv@inktank.com)> wrote:
> > On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
> > > Results : Worked like a charm during two days, apart btrfs warn messages
> > > then OSD begin to crash 1 after all 'domino style'.
> > 
> > 
> > 
> > Sorry to hear that. Reading through your message, there seem to be
> > several problems; whether they are because of the same root cause, I
> > can't tell.
> > 
> > Quick triage to benefit the other devs:
> > 
> > #1: kernel crash, no details available
> > > 1 of the physical machine was in kernel oops state - Nothing was remote
> > 
> > 
> > 
> > #2: leveldb corruption? may be memory corruption that started
> > elsewhere.. Sam, does this look like the leveldb issue you saw?
> > > [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
> > > 0> 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
> > > (Aborted) **
> > 
> > 
> > ...
> > > 13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
> > > leveldb::Slice const&) const+0x4d) [0x6ef69d]
> > > 14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice
> > > const&)+0x9f) [0x6fdd9f]
> > 
> > 
> > 
> > #3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
> > > 0> 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc (http://PG.cc): In function
> > > 'void PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, int)'
> > > thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
> > > osd/PG.cc (http://PG.cc): 402: FAILED assert(log.head >= olog.tail && olog.head >=
> > > log.tail)
> > 
> > 
> > 
> > #4: unknown btrfs warnings, there should an actual message above this
> > traceback; believed fixed in latest kernel
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
> > > [<ffffffffa026fca5>] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
> > > [<ffffffffa026965a>] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
> > > [<ffffffffa02bc9a0>] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
> > > [<ffffffffa026a6f1>] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
> > > [<ffffffff8105a9f0>] ? add_wait_queue+0x60/0x60
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
> > > [<ffffffffa026aba0>] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
> > > [<ffffffffa026abb1>] ? do_async_commit+0x11/0x20 [btrfs]
> > > Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-06-04 17:40   ` Sam Just
  2012-06-04 18:34     ` Greg Farnum
@ 2012-07-03  8:40     ` Yann Dupont
  2012-07-03 19:42       ` Tommi Virtanen
  1 sibling, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-03  8:40 UTC (permalink / raw)
  To: Sam Just; +Cc: Tommi Virtanen, ceph-devel

Le 04/06/2012 19:40, Sam Just a écrit :
> Can you send the osd logs?  The merge_log crashes are probably fixable
> if I can see the logs.
>

Well I'm sorry - As I send in private mail I was away from computer for 
a long time.
I can't send those logs anymore, they are rotated now...

Anyway. Now that I'm back, I try to restart where I stopped, and tried 
to restart the failed nodes.

Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right 
now.

Tried to restart osd with 0.47.3, then next branch, and today with 0.48.

4 of 8 nodes fails with the same message :

ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
  1: /usr/bin/ceph-osd() [0x701929]
  2: (()+0xf030) [0x7fe5b4777030]
  3: (gsignal()+0x35) [0x7fe5b33fc4f5]
  4: (abort()+0x180) [0x7fe5b33ff770]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fe5b3c4f68d]
  6: (()+0x63796) [0x7fe5b3c4d796]
  7: (()+0x637c3) [0x7fe5b3c4d7c3]
  8: (()+0x639ee) [0x7fe5b3c4d9ee]
  9: (std::__throw_length_error(char const*)+0x5d) [0x7fe5b3c9f5ed]
  10: (()+0xbfad2) [0x7fe5b3ca9ad2]
  11: (char* std::string::_S_construct<char const*>(char const*, char 
const*, std::allocator<char> const&, std::forward_iterator_tag)+0x35) 
[0x7fe5b3cab4a5]
  12: (std::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_string(char const*, unsigned long, 
std::allocator<char> const&)+0x1d) [0x7fe5b3cab5bd]
  13: 
(leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, 
leveldb::Slice const&) const+0x4d) [0x6e811d]
  14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice 
const&)+0x9f) [0x6f681f]
  15: 
(leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x4d3) 
[0x6e3643]
  16: (leveldb::DBImpl::BackgroundCompaction()+0x222) [0x6e45a2]
  17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6e4e18]
  18: /usr/bin/ceph-osd() [0x6fd401]
  19: (()+0x6b50) [0x7fe5b476eb50]
  20: (clone()+0x6d) [0x7fe5b34a278d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

ceph-osd is from the debian package (64 bits)
I have a core dump, but I'm afraid it won't help much :

gdb /usr/bin/ceph-osd core
GNU gdb (GDB) 7.0.1-debian

....

Core was generated by `/usr/bin/ceph-osd -i 2 --pid-file 
/var/run/ceph/osd.2.pid -c /etc/ceph/ceph.con'.
Program terminated with signal 6, Aborted.
---Type <return> to continue, or q <return> to quit---
#0  0x00007fe5b4776efb in raise () from 
/lib/x86_64-linux-gnu/libpthread.so.0

This time I REALLY CAN (knock on wood) furnish logs & core.

Granted, this crash was very probably caused by corruption on btrfs, but 
it could be great if there's a way to recover the crashed osd node.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-03  8:40     ` Yann Dupont
@ 2012-07-03 19:42       ` Tommi Virtanen
  2012-07-03 20:54         ` Yann Dupont
  0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-03 19:42 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Sam Just, ceph-devel

On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
> now.
>
> Tried to restart osd with 0.47.3, then next branch, and today with 0.48.
>
> 4 of 8 nodes fails with the same message :
>
> ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
>  1: /usr/bin/ceph-osd() [0x701929]
...
>  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
> leveldb::Slice const&) const+0x4d) [0x6e811d]

That looks like http://tracker.newdream.net/issues/2563 and the best
we have for that ticket is "looks like you have a corrupted leveldb
file". Is this reproducible with a freshly mkfs'ed data partition?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-03 19:42       ` Tommi Virtanen
@ 2012-07-03 20:54         ` Yann Dupont
  2012-07-03 21:38           ` Tommi Virtanen
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-03 20:54 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Sam Just, ceph-devel

Le 03/07/2012 21:42, Tommi Virtanen a écrit :
> On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
>> now.
>>
>> Tried to restart osd with 0.47.3, then next branch, and today with 0.48.
>>
>> 4 of 8 nodes fails with the same message :
>>
>> ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
>>   1: /usr/bin/ceph-osd() [0x701929]
> ...
>>   13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
>> leveldb::Slice const&) const+0x4d) [0x6e811d]
> That looks like http://tracker.newdream.net/issues/2563 and the best
> we have for that ticket is "looks like you have a corrupted leveldb
> file". Is this reproducible with a freshly mkfs'ed data partition?
Probably not. I have multiple data volumes on each nodes (I was planning 
xfs vs ext4 vs btrfs benchmarks before being ill) and thoses nodes start 
OK with another data partition .

It's very probable that there is corruption somewhere, due to kernel bug 
, probably triggered by btrfs.

Issue 2563 is probably the same.

I'd like to restart those nodes without formatting them, not because the 
data is valuable, but because if the same thing happens in production, a 
method similar to "fsck" the node could be of great value.

I saw the method to check the leveldb. Will try tomorrow without garantees.

In the case I could repair, do you think a crashed FS as it is right now 
is valuable for you, for future reference , as I saw you can't reproduce 
the problem ? I can make an archive (or a btrfs dump ?), but it will be 
quite big.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-03 20:54         ` Yann Dupont
@ 2012-07-03 21:38           ` Tommi Virtanen
  2012-07-04  8:06             ` Yann Dupont
  0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-03 21:38 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Sam Just, ceph-devel

On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> In the case I could repair, do you think a crashed FS as it is right now is
> valuable for you, for future reference , as I saw you can't reproduce the
> problem ? I can make an archive (or a btrfs dump ?), but it will be quite
> big.

At this point, it's more about the upstream developers (of btrfs etc)
than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-03 21:38           ` Tommi Virtanen
@ 2012-07-04  8:06             ` Yann Dupont
  2012-07-04 16:21               ` Gregory Farnum
  2012-07-09 17:43               ` Tommi Virtanen
  0 siblings, 2 replies; 25+ messages in thread
From: Yann Dupont @ 2012-07-04  8:06 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Sam Just, ceph-devel

Le 03/07/2012 23:38, Tommi Virtanen a écrit :
> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> In the case I could repair, do you think a crashed FS as it is right now is
>> valuable for you, for future reference , as I saw you can't reproduce the
>> problem ? I can make an archive (or a btrfs dump ?), but it will be quite
>> big.
> At this point, it's more about the upstream developers (of btrfs etc)
> than us; we're on good terms with them but not experts on the on-disk
> format(s). You might want to send an email to the relevant mailing
> lists before wiping the disks.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
Well, I probably wasn't clear enough. I talked about crashed FS, but i 
was talking about ceph. The underlying FS (btrfs in that case) of 1 node 
(and only one) has PROBABLY crashed in the past, causing corruption in 
ceph data on this node, and then the subsequent crash of other nodes.

RIGHT now btrfs on this node is OK. I can access the filesystem without 
errors.

For the moment, on 8 nodes, 4 refuse to restart .
1 of the 4 nodes was the crashed node , the 3 others didn't had broblem 
with the underlying fs as far as I can tell.

So I think the scenario is :

One node had problem with btrfs, leading first to kernel problem , 
probably corruption (in disk/ in memory maybe ?) ,and ultimately to a 
kernel oops. Before that ultimate kernel oops, bad data has been 
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses 
nodes.

If you think this scenario is highly improbable in real life (that is, 
btrfs will probably be fixed for good, and then, corruption can't 
happen), it's ok.

But I wonder if this scenario can be triggered with other problem, and 
bad data can be transmitted to other sane nodes (power outage, out of 
memory condition, disk full... for example)

That's why I proposed you a crashed ceph volume image (I shouldn't have 
talked about a crashed fs, sorry for the confusion)

Talking about btrfs, there is a lot of fixes in btrfs between 3.4 and 
3.5rc. After the crash, I couldn't mount the btrfs volume. With 3.5rc I 
can , and there is no sign of problem on it. It does'nt mean data is 
safe there, but i think it's a sign that at least, some bugs have been 
corrected in btrfs code.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-04  8:06             ` Yann Dupont
@ 2012-07-04 16:21               ` Gregory Farnum
  2012-07-04 17:53                 ` Yann Dupont
  2012-07-09 17:43               ` Tommi Virtanen
  1 sibling, 1 reply; 25+ messages in thread
From: Gregory Farnum @ 2012-07-04 16:21 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Tommi Virtanen, Sam Just, ceph-devel

On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
> Le 03/07/2012 23:38, Tommi Virtanen a écrit :
> > On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
> > > In the case I could repair, do you think a crashed FS as it is right now is
> > > valuable for you, for future reference , as I saw you can't reproduce the
> > > problem ? I can make an archive (or a btrfs dump ?), but it will be quite
> > > big.
> >  
> >  
> > At this point, it's more about the upstream developers (of btrfs etc)
> > than us; we're on good terms with them but not experts on the on-disk
> > format(s). You might want to send an email to the relevant mailing
> > lists before wiping the disks.
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>  
>  
> Well, I probably wasn't clear enough. I talked about crashed FS, but i  
> was talking about ceph. The underlying FS (btrfs in that case) of 1 node  
> (and only one) has PROBABLY crashed in the past, causing corruption in  
> ceph data on this node, and then the subsequent crash of other nodes.
>  
> RIGHT now btrfs on this node is OK. I can access the filesystem without  
> errors.
>  
> For the moment, on 8 nodes, 4 refuse to restart .
> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem  
> with the underlying fs as far as I can tell.
>  
> So I think the scenario is :
>  
> One node had problem with btrfs, leading first to kernel problem ,  
> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a  
> kernel oops. Before that ultimate kernel oops, bad data has been  
> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses  
> nodes.

I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/
  
>  
> If you think this scenario is highly improbable in real life (that is,  
> btrfs will probably be fixed for good, and then, corruption can't  
> happen), it's ok.
>  
> But I wonder if this scenario can be triggered with other problem, and  
> bad data can be transmitted to other sane nodes (power outage, out of  
> memory condition, disk full... for example)
>  
> That's why I proposed you a crashed ceph volume image (I shouldn't have  
> talked about a crashed fs, sorry for the confusion)

I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a "ceph image" won't really help in doing so.

I wonder if maybe there's a confounding factor here — are all your nodes similar to each other, or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can?
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-04 16:21               ` Gregory Farnum
@ 2012-07-04 17:53                 ` Yann Dupont
  2012-07-05 21:32                   ` Gregory Farnum
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-04 17:53 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Tommi Virtanen, Sam Just, ceph-devel

Le 04/07/2012 18:21, Gregory Farnum a écrit :
> On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
>> Le 03/07/2012 23:38, Tommi Virtanen a écrit :
>>> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
>>>> In the case I could repair, do you think a crashed FS as it is right now is
>>>> valuable for you, for future reference , as I saw you can't reproduce the
>>>> problem ? I can make an archive (or a btrfs dump ?), but it will be quite
>>>> big.
>>>   
>>>   
>>> At this point, it's more about the upstream developers (of btrfs etc)
>>> than us; we're on good terms with them but not experts on the on-disk
>>> format(s). You might want to send an email to the relevant mailing
>>> lists before wiping the disks.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>   
>>   
>> Well, I probably wasn't clear enough. I talked about crashed FS, but i
>> was talking about ceph. The underlying FS (btrfs in that case) of 1 node
>> (and only one) has PROBABLY crashed in the past, causing corruption in
>> ceph data on this node, and then the subsequent crash of other nodes.
>>   
>> RIGHT now btrfs on this node is OK. I can access the filesystem without
>> errors.
>>   
>> For the moment, on 8 nodes, 4 refuse to restart .
>> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
>> with the underlying fs as far as I can tell.
>>   
>> So I think the scenario is :
>>   
>> One node had problem with btrfs, leading first to kernel problem ,
>> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
>> kernel oops. Before that ultimate kernel oops, bad data has been
>> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
>> nodes.
> I don't think that's actually possible — the OSDs all do quite a lot of interpretation between what they get off the wire and what goes on disk. What you've got here are 4 corrupted LevelDB databases, and we pretty much can't do that through the interfaces we have. :/

ok, so as all nodes were identical, I probably have hit a btrfs bug 
(like a erroneous out of space ) in more or less the same time. And when 
1 osd was out,
>    
>>   
>> If you think this scenario is highly improbable in real life (that is,
>> btrfs will probably be fixed for good, and then, corruption can't
>> happen), it's ok.
>>   
>> But I wonder if this scenario can be triggered with other problem, and
>> bad data can be transmitted to other sane nodes (power outage, out of
>> memory condition, disk full... for example)
>>   
>> That's why I proposed you a crashed ceph volume image (I shouldn't have
>> talked about a crashed fs, sorry for the confusion)
> I appreciate the offer, but I don't think this will help much — it's a disk state managed by somebody else, not our logical state, which has broken. If we could figure out how that state got broken that'd be good, but a "ceph image" won't really help in doing so.
ok, no problem. I'll restart from scratch, freshly formated.
>
> I wonder if maybe there's a confounding factor here — are all your nodes similar to each other,

Yes. I designed the cluster that way. All nodes are identical hardware 
(powerEdge M610, 10G intel ethernet + emulex fibre channel attached to 
storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

>   or are they running on different kinds of hardware? How did you do your Ceph upgrades? What's ceph -s display when the cluster is running as best it can?

Ceph was running 0.47.2 at that time - (debian package for ceph). After 
the crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 
without success.

Nothing particular for upgrades, because for the moment ceph is broken, 
so just apt-get upgrade with new version.


ceph -s show that :

root@label5:~# ceph -s
    health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 
32 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck 
stale; 1092 pgs stuck unclean; recovery 267286/2491140 degraded 
(10.729%); 1814/1245570 unfound (0.146%)
    monmap e1: 3 mons at 
{chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, 
election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa
    osdmap e2404: 8 osds: 3 up, 3 in
     pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5 
active+recovering+remapped, 32 active+clean+replay, 11 
active+recovering+degraded, 25 active+remapped, 710 down+peering, 222 
active+degraded, 7 stale+active+recovering+degraded, 61 
stale+down+peering, 20 stale+active+degraded, 6 down+remapped+peering, 8 
stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB 
used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%); 
1814/1245570 unfound (0.146%)
    mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby



BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of 
the 4 surviving OSD didn't complete :

2012-07-04 10:13:27.291541 7f8711099780 -1 filestore(/CEPH/data/osd.1) 
FileStore::mount : stale version stamp detected: 2. Proceeding, 
do_update is set, performing disk format upgrade.
2012-07-04 10:13:27.291618 7f8711099780  0 filestore(/CEPH/data/osd.1) 
mount found snaps <3744666,3746725>

then , nothing happens for hours, iotop show constant disk usage :

  6069 be/4 root        0.00 B/s   32.09 M/s  0.00 % 19.08 % ceph-osd -i 
1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf

strace show lots of syscall like this :

[pid  6069] pread(25, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4101, 
94950) = 4101
[pid  6069] pread(23, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4107, 
49678) = 4107
[pid  6069] pread(36, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4110, 
99797) = 4110
[pid  6069] pread(37, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4105, 
8211) = 4105
[pid  6069] pread(25, "\0C\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4121, 
99051) = 4121
[pid  6069] pread(36, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4173, 
103907) = 4173
[pid  6069] pread(37, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4169, 
12316) = 4169
[pid  6069] pread(37, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4130, 
16485) = 4130
[pid  6069] pread(36, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4129, 
108080) = 4129


Seeems to loop indefinitely.

But It's another problem I guess, maybe a consequence of the others problems

Cheers.

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-04 17:53                 ` Yann Dupont
@ 2012-07-05 21:32                   ` Gregory Farnum
  2012-07-06  7:19                     ` Yann Dupont
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Farnum @ 2012-07-05 21:32 UTC (permalink / raw)
  To: Yann Dupont, Sam Just; +Cc: ceph-devel

On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Le 04/07/2012 18:21, Gregory Farnum a écrit :
>
>> On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
>>>
>>> Le 03/07/2012 23:38, Tommi Virtanen a écrit :
>>>>
>>>> On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr
>>>> (mailto:Yann.Dupont@univ-nantes.fr)> wrote:
>>>>>
>>>>> In the case I could repair, do you think a crashed FS as it is right
>>>>> now is
>>>>> valuable for you, for future reference , as I saw you can't reproduce
>>>>> the
>>>>> problem ? I can make an archive (or a btrfs dump ?), but it will be
>>>>> quite
>>>>> big.
>>>>
>>>>     At this point, it's more about the upstream developers (of btrfs
>>>> etc)
>>>> than us; we're on good terms with them but not experts on the on-disk
>>>> format(s). You might want to send an email to the relevant mailing
>>>> lists before wiping the disks.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> (mailto:majordomo@vger.kernel.org)
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>     Well, I probably wasn't clear enough. I talked about crashed FS, but
>>> i
>>> was talking about ceph. The underlying FS (btrfs in that case) of 1 node
>>> (and only one) has PROBABLY crashed in the past, causing corruption in
>>> ceph data on this node, and then the subsequent crash of other nodes.
>>>   RIGHT now btrfs on this node is OK. I can access the filesystem without
>>> errors.
>>>   For the moment, on 8 nodes, 4 refuse to restart .
>>> 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
>>> with the underlying fs as far as I can tell.
>>>   So I think the scenario is :
>>>   One node had problem with btrfs, leading first to kernel problem ,
>>> probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
>>> kernel oops. Before that ultimate kernel oops, bad data has been
>>> transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
>>> nodes.
>>
>> I don't think that's actually possible — the OSDs all do quite a lot of
>> interpretation between what they get off the wire and what goes on disk.
>> What you've got here are 4 corrupted LevelDB databases, and we pretty much
>> can't do that through the interfaces we have. :/
>
>
> ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
> erroneous out of space ) in more or less the same time. And when 1 osd was
> out,
>
>>
>>>
>>>   If you think this scenario is highly improbable in real life (that is,
>>> btrfs will probably be fixed for good, and then, corruption can't
>>> happen), it's ok.
>>>   But I wonder if this scenario can be triggered with other problem, and
>>> bad data can be transmitted to other sane nodes (power outage, out of
>>> memory condition, disk full... for example)
>>>   That's why I proposed you a crashed ceph volume image (I shouldn't have
>>> talked about a crashed fs, sorry for the confusion)
>>
>> I appreciate the offer, but I don't think this will help much — it's a
>> disk state managed by somebody else, not our logical state, which has
>> broken. If we could figure out how that state got broken that'd be good, but
>> a "ceph image" won't really help in doing so.
>
> ok, no problem. I'll restart from scratch, freshly formated.
>
>>
>> I wonder if maybe there's a confounding factor here — are all your nodes
>> similar to each other,
>
>
> Yes. I designed the cluster that way. All nodes are identical hardware
> (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
> storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

Oh, interesting. Are the broken nodes all on the same set of arrays?


>
>
>>   or are they running on different kinds of hardware? How did you do your
>> Ceph upgrades? What's ceph -s display when the cluster is running as best it
>> can?
>
>
> Ceph was running 0.47.2 at that time - (debian package for ceph). After the
> crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without
> success.
>
> Nothing particular for upgrades, because for the moment ceph is broken, so
> just apt-get upgrade with new version.
>
>
> ceph -s show that :
>
> root@label5:~# ceph -s
>    health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32
> pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale;
> 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%);
> 1814/1245570 unfound (0.146%)
>    monmap e1: 3 mons at
> {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
> election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa
>    osdmap e2404: 8 osds: 3 up, 3 in
>     pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5
> active+recovering+remapped, 32 active+clean+replay, 11
> active+recovering+degraded, 25 active+remapped, 710 down+peering, 222
> active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering,
> 20 stale+active+degraded, 6 down+remapped+peering, 8
> stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB
> used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%);
> 1814/1245570 unfound (0.146%)
>    mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby

Okay, that looks about how I'd expect if half your OSDs are down.

>
>
>
> BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of the 4
> surviving OSD didn't complete :
>
> 2012-07-04 10:13:27.291541 7f8711099780 -1 filestore(/CEPH/data/osd.1)
> FileStore::mount : stale version stamp detected: 2. Proceeding, do_update is
> set, performing disk format upgrade.
> 2012-07-04 10:13:27.291618 7f8711099780  0 filestore(/CEPH/data/osd.1) mount
> found snaps <3744666,3746725>
>
> then , nothing happens for hours, iotop show constant disk usage :

>  6069 be/4 root        0.00 B/s   32.09 M/s  0.00 % 19.08 % ceph-osd -i 1
> --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf
>
> strace show lots of syscall like this :
>
> [pid  6069] pread(25, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4101,
> 94950) = 4101
> [pid  6069] pread(23, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4107,
> 49678) = 4107
> [pid  6069] pread(36, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4110,
> 99797) = 4110
> [pid  6069] pread(37, "\0EB_LEAF_\0002%e183%uTEMP.label5%u2"..., 4105, 8211)
> = 4105
> [pid  6069] pread(25, "\0C\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4121,
> 99051) = 4121
> [pid  6069] pread(36, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4173,
> 103907) = 4173
> [pid  6069] pread(37, "\0E\0_LEAF_\0002%e183%uTEMP.label5%u3"..., 4169,
> 12316) = 4169
> [pid  6069] pread(37, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4130,
> 16485) = 4130
> [pid  6069] pread(36, "\0B\0_LEAF_\0002%e183%uTEMP.rb%e0%e1%"..., 4129,
> 108080) = 4129

Sam, does this look like something of ours to you?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-05 21:32                   ` Gregory Farnum
@ 2012-07-06  7:19                     ` Yann Dupont
  2012-07-06 17:01                       ` Gregory Farnum
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-06  7:19 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Just, ceph-devel

Le 05/07/2012 23:32, Gregory Farnum a écrit :

[...]
>> ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
>> erroneous out of space ) in more or less the same time. And when 1 osd was
>> out,

OH , I didn't finish the sentence... When 1 osd was out, missing data 
was copied on another nodes, probably speeding btrfs problem on those 
nodes (I suspect erroneous out of space conditions)

I've reformatted OSD with xfs. Performance is slightly worse for the 
moment (well, depend on the workload, and maybe lack of syncfs is to 
blame), but at least I hope to have the storage layer rock-solid. BTW, 
I've managed to keep the faulty btrfs volumes .

[...]

>>> I wonder if maybe there's a confounding factor here — are all your nodes
>>> similar to each other,
>> Yes. I designed the cluster that way. All nodes are identical hardware
>> (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
>> storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)
> Oh, interesting. Are the broken nodes all on the same set of arrays?

No. There are 4 completely independant raid arrays, in 4 different 
locations. They are similar (same brand & model, but slighltly different 
disks, and 1 different firmware), all arrays are multipathed. I don't 
think the raid array is the problem. We use those particular models 
since 2/3 years, and in the logs I don't see any problem that can be 
caused by the storage itself (like scsi or multipath errors)

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-06  7:19                     ` Yann Dupont
@ 2012-07-06 17:01                       ` Gregory Farnum
  2012-07-07  8:19                         ` Yann Dupont
  0 siblings, 1 reply; 25+ messages in thread
From: Gregory Farnum @ 2012-07-06 17:01 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Sam Just, ceph-devel

On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Le 05/07/2012 23:32, Gregory Farnum a écrit :
>
> [...]
>
>>> ok, so as all nodes were identical, I probably have hit a btrfs bug (like
>>> a
>>> erroneous out of space ) in more or less the same time. And when 1 osd
>>> was
>>> out,
>
>
> OH , I didn't finish the sentence... When 1 osd was out, missing data was
> copied on another nodes, probably speeding btrfs problem on those nodes (I
> suspect erroneous out of space conditions)

Ah. How full are/were the disks?

>
> I've reformatted OSD with xfs. Performance is slightly worse for the moment
> (well, depend on the workload, and maybe lack of syncfs is to blame), but at
> least I hope to have the storage layer rock-solid. BTW, I've managed to keep
> the faulty btrfs volumes .
>
> [...]
>
>
>>>> I wonder if maybe there's a confounding factor here — are all your nodes
>>>> similar to each other,
>>>
>>> Yes. I designed the cluster that way. All nodes are identical hardware
>>> (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
>>> storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)
>>
>> Oh, interesting. Are the broken nodes all on the same set of arrays?
>
>
> No. There are 4 completely independant raid arrays, in 4 different
> locations. They are similar (same brand & model, but slighltly different
> disks, and 1 different firmware), all arrays are multipathed. I don't think
> the raid array is the problem. We use those particular models since 2/3
> years, and in the logs I don't see any problem that can be caused by the
> storage itself (like scsi or multipath errors)

I must have misunderstood then. What did you mean by "1 Array for 2 OSD nodes"?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-06 17:01                       ` Gregory Farnum
@ 2012-07-07  8:19                         ` Yann Dupont
  2012-07-09 17:14                           ` Samuel Just
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-07  8:19 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sam Just, ceph-devel

Le 06/07/2012 19:01, Gregory Farnum a écrit :
> On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> Le 05/07/2012 23:32, Gregory Farnum a écrit :
>>
>> [...]
>>
>>>> ok, so as all nodes were identical, I probably have hit a btrfs bug (like
>>>> a
>>>> erroneous out of space ) in more or less the same time. And when 1 osd
>>>> was
>>>> out,
>>
>> OH , I didn't finish the sentence... When 1 osd was out, missing data was
>> copied on another nodes, probably speeding btrfs problem on those nodes (I
>> suspect erroneous out of space conditions)
> Ah. How full are/were the disks?

The OSD nodes were below 50 % (all are 5 To volumes):

osd.0 : 31%
osd.1 : 31%
osd.2 : 39%
osd.3 : 65%
no osd.4 :)
osd.5 : 35%
osd.6 : 60%
osd.7 : 42%
osd.8 : 34%

all the volumes were using btrfs with lzo compress.

[...]
>
> Oh, interesting. Are the broken nodes all on the same set of arrays?
>>
>> No. There are 4 completely independant raid arrays, in 4 different
>> locations. They are similar (same brand & model, but slighltly different
>> disks, and 1 different firmware), all arrays are multipathed. I don't think
>> the raid array is the problem. We use those particular models since 2/3
>> years, and in the logs I don't see any problem that can be caused by the
>> storage itself (like scsi or multipath errors)
> I must have misunderstood then. What did you mean by "1 Array for 2 OSD nodes"?

I have 8 osd nodes, in 4 different locations (several km away). In each 
location I have 2 nodes and 1 raid Array.
On each location, each raid array has 16 2To disks, 2 controllers with 
4x 8 Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks 
for one, 7 disks for the orher). Each raid set is primary attached to 1 
controller, and each osd node on the location has acces to the 
controller with 2 distinct paths.

There were no correlation between failed nodes & raid array.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-07  8:19                         ` Yann Dupont
@ 2012-07-09 17:14                           ` Samuel Just
  2012-07-10  9:46                             ` Yann Dupont
  0 siblings, 1 reply; 25+ messages in thread
From: Samuel Just @ 2012-07-09 17:14 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Gregory Farnum, ceph-devel

Can you restart the node that failed to complete the upgrade with

debug filestore = 20
debug osd = 20

and post the log after an hour or so of running?  The upgrade process
might legitimately take a while.
-Sam

On Sat, Jul 7, 2012 at 1:19 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Le 06/07/2012 19:01, Gregory Farnum a écrit :
>
>> On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr>
>> wrote:
>>>
>>> Le 05/07/2012 23:32, Gregory Farnum a écrit :
>>>
>>> [...]
>>>
>>>>> ok, so as all nodes were identical, I probably have hit a btrfs bug
>>>>> (like
>>>>> a
>>>>> erroneous out of space ) in more or less the same time. And when 1 osd
>>>>> was
>>>>> out,
>>>
>>>
>>> OH , I didn't finish the sentence... When 1 osd was out, missing data was
>>> copied on another nodes, probably speeding btrfs problem on those nodes
>>> (I
>>> suspect erroneous out of space conditions)
>>
>> Ah. How full are/were the disks?
>
>
> The OSD nodes were below 50 % (all are 5 To volumes):
>
> osd.0 : 31%
> osd.1 : 31%
> osd.2 : 39%
> osd.3 : 65%
> no osd.4 :)
> osd.5 : 35%
> osd.6 : 60%
> osd.7 : 42%
> osd.8 : 34%
>
> all the volumes were using btrfs with lzo compress.
>
> [...]
>
>>
>> Oh, interesting. Are the broken nodes all on the same set of arrays?
>>>
>>>
>>> No. There are 4 completely independant raid arrays, in 4 different
>>> locations. They are similar (same brand & model, but slighltly different
>>> disks, and 1 different firmware), all arrays are multipathed. I don't
>>> think
>>> the raid array is the problem. We use those particular models since 2/3
>>> years, and in the logs I don't see any problem that can be caused by the
>>> storage itself (like scsi or multipath errors)
>>
>> I must have misunderstood then. What did you mean by "1 Array for 2 OSD
>> nodes"?
>
>
> I have 8 osd nodes, in 4 different locations (several km away). In each
> location I have 2 nodes and 1 raid Array.
> On each location, each raid array has 16 2To disks, 2 controllers with 4x 8
> Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks for one,
> 7 disks for the orher). Each raid set is primary attached to 1 controller,
> and each osd node on the location has acces to the controller with 2
> distinct paths.
>
> There were no correlation between failed nodes & raid array.
>
>
> Cheers,
>
> --
> Yann Dupont - Service IRTS, DSI Université de Nantes
> Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-04  8:06             ` Yann Dupont
  2012-07-04 16:21               ` Gregory Farnum
@ 2012-07-09 17:43               ` Tommi Virtanen
  2012-07-09 19:05                 ` Yann Dupont
  1 sibling, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-09 17:43 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Sam Just, ceph-devel

On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> Well, I probably wasn't clear enough. I talked about crashed FS, but i was
> talking about ceph. The underlying FS (btrfs in that case) of 1 node (and
> only one) has PROBABLY crashed in the past, causing corruption in ceph data
> on this node, and then the subsequent crash of other nodes.
>
> RIGHT now btrfs on this node is OK. I can access the filesystem without
> errors.

But the LevelDB isn't. It's contents got corrupted, somehow somewhere,
and it really is up to the LevelDB library to tolerate those errors;
we have a simple get/put interface we use, and LevelDB is triggering
an internal error.

> One node had problem with btrfs, leading first to kernel problem , probably
> corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops.
> Before that ultimate kernel oops, bad data has been transmitted to other
> (sane) nodes, leading to ceph-osd crash on thoses nodes.

The LevelDB binary contents are not transferred over to other nodes;
this kind of corruption would not spread over the Ceph clustering
mechanisms. It's more likely that you have 4 independently corrupted
LevelDBs. Something in the workload Ceph runs makes that corruption
quite likely.

The information here isn't enough to say whether the cause of the
corruption is btrfs or LevelDB, but the recovery needs to handled by
LevelDB -- and upstream is working on making it more robust:
http://code.google.com/p/leveldb/issues/detail?id=97

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-09 17:43               ` Tommi Virtanen
@ 2012-07-09 19:05                 ` Yann Dupont
  2012-07-09 19:48                   ` Tommi Virtanen
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-09 19:05 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Sam Just, ceph-devel

Le 09/07/2012 19:43, Tommi Virtanen a écrit :
> On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> Well, I probably wasn't clear enough. I talked about crashed FS, but i was
>> talking about ceph. The underlying FS (btrfs in that case) of 1 node (and
>> only one) has PROBABLY crashed in the past, causing corruption in ceph data
>> on this node, and then the subsequent crash of other nodes.
>>
>> RIGHT now btrfs on this node is OK. I can access the filesystem without
>> errors.
> But the LevelDB isn't. It's contents got corrupted, somehow somewhere,
> and it really is up to the LevelDB library to tolerate those errors;
> we have a simple get/put interface we use, and LevelDB is triggering
> an internal error.
Yes, understood.

>> One node had problem with btrfs, leading first to kernel problem , probably
>> corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops.
>> Before that ultimate kernel oops, bad data has been transmitted to other
>> (sane) nodes, leading to ceph-osd crash on thoses nodes.
> The LevelDB binary contents are not transferred over to other nodes;
Ok thanks for the clarification ;
> this kind of corruption would not spread over the Ceph clustering
> mechanisms. It's more likely that you have 4 independently corrupted
> LevelDBs. Something in the workload Ceph runs makes that corruption
> quite likely.
Very likely : since I reformatted my nodes with XFS I don't have 
problems so far.
>
> The information here isn't enough to say whether the cause of the
> corruption is btrfs or LevelDB, but the recovery needs to handled by
> LevelDB -- and upstream is working on making it more robust:
> http://code.google.com/p/leveldb/issues/detail?id=97
Yes, saw this. It's very important. Sometimes, s... happens. In respect 
to the size ceph volumes can reach, having a tool to restart damaged 
nodes (for whatever reason) is a must.

Thanks for the time you took to answer. It's much clearer for me now.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-09 19:05                 ` Yann Dupont
@ 2012-07-09 19:48                   ` Tommi Virtanen
  0 siblings, 0 replies; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-09 19:48 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Sam Just, ceph-devel

On Mon, Jul 9, 2012 at 12:05 PM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> The information here isn't enough to say whether the cause of the
>> corruption is btrfs or LevelDB, but the recovery needs to handled by
>> LevelDB -- and upstream is working on making it more robust:
>> http://code.google.com/p/leveldb/issues/detail?id=97
>
> Yes, saw this. It's very important. Sometimes, s... happens. In respect to
> the size ceph volumes can reach, having a tool to restart damaged nodes (for
> whatever reason) is a must.
>
> Thanks for the time you took to answer. It's much clearer for me now.

If it doesn't recover, you re-format the disk and thereby throw away
the contents. Not really all that different from handling hardware
failure. That's why we have replication.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-09 17:14                           ` Samuel Just
@ 2012-07-10  9:46                             ` Yann Dupont
  2012-07-10 15:56                               ` Tommi Virtanen
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-10  9:46 UTC (permalink / raw)
  To: Samuel Just; +Cc: Gregory Farnum, ceph-devel

Le 09/07/2012 19:14, Samuel Just a écrit :
> Can you restart the node that failed to complete the upgrade with

Well, it's a little big complicated ; I now run those nodes with XFS, 
and I've long-running jobs on it right now, so I can't stop the ceph 
cluster at the moment.

As I've keeped the original broken btrfs volumes, I tried this morning 
to run the old osd in parrallel, using the $cluster variable. I only 
have partial success.
I tried using different port for the mons, but ceph want to use the old 
mon map. I can edit it (epoch 1) but it seems to use 'latest' instead, 
the format isn't compatible with monmaptool and I don't know how to 
"inject" the modified on a non running cluster.

Anyway, osd seems to start fine, and I can reproduce the bug :
> debug filestore = 20
> debug osd = 20
>

I've put it in [global], is it sufficient ?

>
> and post the log after an hour or so of running?  The upgrade process
> might legitimately take a while.
> -Sam
Only 15 minutes running, but ceph-osd is consumming lots of cpu, and a 
strace shows lots of pread.

Here is the log :

[..]
2012-07-10 11:33:29.560052 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount syncfs(2) syscall not support by 
glibc
2012-07-10 11:33:29.560062 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount no syncfs(2), but the btrfs SYNC 
ioctl will suffice
2012-07-10 11:33:29.560172 7f3e615ac780 -1 
filestore(/CEPH-PROD/data/osd.1) FileStore::mount : stale version stamp 
detected: 2. Proceeding, do_update is set, performing disk format upgrade.
2012-07-10 11:33:29.560233 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount found snaps <3744666,3746725>
2012-07-10 11:33:29.560263 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1)  current/ seq was 3746725
2012-07-10 11:33:29.560267 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1)  most recent snap from 
<3744666,3746725> is 3746725
2012-07-10 11:33:29.560280 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1) mount rolling back to consistent snap 
3746725
2012-07-10 11:33:29.839281 7f3e615ac780  5 
filestore(/CEPH-PROD/data/osd.1) mount op_seq is 3746725


... and nothing more.

I'll let him running for 3 hours. If I have another message, I'll let 
you know.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-10  9:46                             ` Yann Dupont
@ 2012-07-10 15:56                               ` Tommi Virtanen
  2012-07-10 16:39                                 ` Yann Dupont
  0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-10 15:56 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Samuel Just, Gregory Farnum, ceph-devel

On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> As I've keeped the original broken btrfs volumes, I tried this morning to
> run the old osd in parrallel, using the $cluster variable. I only have
> partial success.

The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-10 15:56                               ` Tommi Virtanen
@ 2012-07-10 16:39                                 ` Yann Dupont
  2012-07-10 17:11                                   ` Tommi Virtanen
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-10 16:39 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Samuel Just, Gregory Farnum, ceph-devel

Le 10/07/2012 17:56, Tommi Virtanen a écrit :
> On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> As I've keeped the original broken btrfs volumes, I tried this morning to
>> run the old osd in parrallel, using the $cluster variable. I only have
>> partial success.
> The cluster mechanism was never intended for moving existing osds to
> other clusters. Trying that might not be a good idea.
Ok, good to know. I saw that the remaining maps could lead to problem, 
but in 2 words, what are the other associated risks ? Basically If I use 
2 distincts config files,
with differents & non-overlapping paths, and different ports for OSD, 
MDS & MON, we basically have 2 distincts and independant instances ?

By the way, is using 2 mon instance with different ports supported ?

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-10 16:39                                 ` Yann Dupont
@ 2012-07-10 17:11                                   ` Tommi Virtanen
  2012-07-10 17:36                                     ` Yann Dupont
  0 siblings, 1 reply; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-10 17:11 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Samuel Just, Gregory Farnum, ceph-devel

On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> The cluster mechanism was never intended for moving existing osds to
>> other clusters. Trying that might not be a good idea.
> Ok, good to know. I saw that the remaining maps could lead to problem, but
> in 2 words, what are the other associated risks ? Basically If I use 2
> distincts config files,
> with differents & non-overlapping paths, and different ports for OSD, MDS &
> MON, we basically have 2 distincts and independant instances ?

Fundamentally, it comes down to this: the two clusters will still have
the same fsid, and you won't be isolated from configuration errors or
leftover state (such as the monmap) in any way. There's a high chance
that your "let's poke around and debug" cluster wrecks your healthy
cluster.

> By the way, is using 2 mon instance with different ports supported ?

Monitors are identified by ip:port. You can have multiple bind to the
same IP address, as long as they get separate ports.

Naturally, this practically means giving up on high availability.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-10 17:11                                   ` Tommi Virtanen
@ 2012-07-10 17:36                                     ` Yann Dupont
  2012-07-10 18:16                                       ` Tommi Virtanen
  0 siblings, 1 reply; 25+ messages in thread
From: Yann Dupont @ 2012-07-10 17:36 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Samuel Just, Gregory Farnum, ceph-devel

Le 10/07/2012 19:11, Tommi Virtanen a écrit :
> On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>>> The cluster mechanism was never intended for moving existing osds to
>>> other clusters. Trying that might not be a good idea.
>> Ok, good to know. I saw that the remaining maps could lead to problem, but
>> in 2 words, what are the other associated risks ? Basically If I use 2
>> distincts config files,
>> with differents & non-overlapping paths, and different ports for OSD, MDS &
>> MON, we basically have 2 distincts and independant instances ?
> Fundamentally, it comes down to this: the two clusters will still have
> the same fsid, and you won't be isolated from configuration errors or

Ah I understand. This is not the case : see :

root@chichibu:~# cat /CEPH/data/osd.0/fsid
f00139fe-478e-4c50-80e2-f7cb359100d4
root@chichibu:~# cat /CEPH-PROD/data/osd.0/fsid
43afd025-330e-4aa8-9324-3e9b0afce794

(CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, 
completely redone & reformatted with mkcephfs. The volumes are totally 
independant :

if you want the gore details :

root@chichibu:~# lvs
   LV              VG             Attr   LSize   Origin Snap%  Move Log 
Copy%  Convert
   ceph-osd        LocalDisk      -wi-a- 225,00g
   mon-btrfs       LocalDisk      -wi-ao 10,00g
   mon-xfs         LocalDisk      -wi-ao 10,00g
   data            ceph-chichibu  -wi-ao   5,00t    <- OLD btrfs, 
mounted on /CEPH-PROD
   data            xceph-chichibu -wi-ao   4,50t   <- NEW xfs, mounted 
on /CEPH

> leftover state (such as the monmap) in any way. There's a high chance
> that your "let's poke around and debug" cluster wrecks your healthy
> cluster.

Yes I understand the risk.

>> By the way, is using 2 mon instance with different ports supported ?
> Monitors are identified by ip:port. You can have multiple bind to the
> same IP address, as long as they get separate ports.
>
> Naturally, this practically means giving up on high availability.

The idea is not just having 2 mon. I'll still use 3 differents machines 
for mon, but with 2 mon instance on each. One for the current ceph, the 
other for the old ceph.
2x3 Mon.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: domino-style OSD crash
  2012-07-10 17:36                                     ` Yann Dupont
@ 2012-07-10 18:16                                       ` Tommi Virtanen
  0 siblings, 0 replies; 25+ messages in thread
From: Tommi Virtanen @ 2012-07-10 18:16 UTC (permalink / raw)
  To: Yann Dupont; +Cc: Samuel Just, Gregory Farnum, ceph-devel

On Tue, Jul 10, 2012 at 10:36 AM, Yann Dupont
<Yann.Dupont@univ-nantes.fr> wrote:
>> Fundamentally, it comes down to this: the two clusters will still have
>> the same fsid, and you won't be isolated from configuration errors or
> (CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, completely
> redone & reformatted with mkcephfs. The volumes are totally independant :

Ahh you re-created the monitors too. That changes things, then you
have a new random fsid. I understood you only re-mkfsed the osd.

Doing it like that, your real worry is just the remembered state of
monmaps, osdmaps etc. If the daemons accidentally talk to the wrong
cluster, the fsid *should* protect you from damage; they should get
rejected. Similarly, if you use cephx authentication, the keys won't
match either.

>> Naturally, this practically means giving up on high availability.
> The idea is not just having 2 mon. I'll still use 3 differents machines for
> mon, but with 2 mon instance on each. One for the current ceph, the other
> for the old ceph.
> 2x3 Mon.

That should be perfectly doable.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2012-07-10 18:16 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-04  8:44 domino-style OSD crash Yann Dupont
2012-06-04 16:16 ` Tommi Virtanen
2012-06-04 17:40   ` Sam Just
2012-06-04 18:34     ` Greg Farnum
2012-07-03  8:40     ` Yann Dupont
2012-07-03 19:42       ` Tommi Virtanen
2012-07-03 20:54         ` Yann Dupont
2012-07-03 21:38           ` Tommi Virtanen
2012-07-04  8:06             ` Yann Dupont
2012-07-04 16:21               ` Gregory Farnum
2012-07-04 17:53                 ` Yann Dupont
2012-07-05 21:32                   ` Gregory Farnum
2012-07-06  7:19                     ` Yann Dupont
2012-07-06 17:01                       ` Gregory Farnum
2012-07-07  8:19                         ` Yann Dupont
2012-07-09 17:14                           ` Samuel Just
2012-07-10  9:46                             ` Yann Dupont
2012-07-10 15:56                               ` Tommi Virtanen
2012-07-10 16:39                                 ` Yann Dupont
2012-07-10 17:11                                   ` Tommi Virtanen
2012-07-10 17:36                                     ` Yann Dupont
2012-07-10 18:16                                       ` Tommi Virtanen
2012-07-09 17:43               ` Tommi Virtanen
2012-07-09 19:05                 ` Yann Dupont
2012-07-09 19:48                   ` Tommi Virtanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.