All of lore.kernel.org
 help / color / mirror / Atom feed
* Upgrading from 0.61.5 to 0.61.6 ended in disaster
@ 2013-07-24  7:05 Stefan Priebe - Profihost AG
  2013-07-24  7:37 ` Stefan Priebe - Profihost AG
  2013-07-24 23:19 ` Sage Weil
  0 siblings, 2 replies; 12+ messages in thread
From: Stefan Priebe - Profihost AG @ 2013-07-24  7:05 UTC (permalink / raw)
  To: ceph-devel

Hi,

today i wanted to upgrade from 0.61.5 to 0.61.6 to get rid of the mon bug.

But this ended in a complete desaster.

What i've done:
1.) recompiled ceph tagged with 0.61.6
2.) installed new ceph version on all machines
3.) JUST tried to restart ONE mon

this failed with:
[1774]: (33) Numerical argument out of domain
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file
/var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '

2013-07-24 08:41:43.086951 7f53c185d700 -1 mon.a@0(leader) e1 *** Got
Signal Terminated ***
2013-07-24 08:41:43.088090 7f53c185d700  0 quorum service shutdown
2013-07-24 08:41:43.088094 7f53c185d700  0 mon.a@0(???).health(3840)
HealthMonitor::service_shutdown 1 services
2013-07-24 08:41:43.088097 7f53c185d700  0 quorum service shutdown
2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
ceph-mon, pid 29871
2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fae6384a780 time 2013-07-24 08:41:56.096683
mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)

 ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
 4: (Monitor::init_paxos()+0xe5) [0x48f955]
 5: (Monitor::preinit()+0x679) [0x4bba79]
 6: (main()+0x36b0) [0x484bb0]
 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
 8: /usr/bin/ceph-mon() [0x4801e9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- begin dump of recent events ---
   -13> 2013-07-24 08:41:44.222821 7fae6384a780  5 asok(0x2698000)
register_command perfcounters_dump hook 0x2682010
   -12> 2013-07-24 08:41:44.222835 7fae6384a780  5 asok(0x2698000)
register_command 1 hook 0x2682010
   -11> 2013-07-24 08:41:44.222837 7fae6384a780  5 asok(0x2698000)
register_command perf dump hook 0x2682010
   -10> 2013-07-24 08:41:44.222842 7fae6384a780  5 asok(0x2698000)
register_command perfcounters_schema hook 0x2682010
    -9> 2013-07-24 08:41:44.222845 7fae6384a780  5 asok(0x2698000)
register_command 2 hook 0x2682010
    -8> 2013-07-24 08:41:44.222847 7fae6384a780  5 asok(0x2698000)
register_command perf schema hook 0x2682010
    -7> 2013-07-24 08:41:44.222849 7fae6384a780  5 asok(0x2698000)
register_command config show hook 0x2682010
    -6> 2013-07-24 08:41:44.222852 7fae6384a780  5 asok(0x2698000)
register_command config set hook 0x2682010
    -5> 2013-07-24 08:41:44.222854 7fae6384a780  5 asok(0x2698000)
register_command log flush hook 0x2682010
    -4> 2013-07-24 08:41:44.222856 7fae6384a780  5 asok(0x2698000)
register_command log dump hook 0x2682010
    -3> 2013-07-24 08:41:44.222859 7fae6384a780  5 asok(0x2698000)
register_command log reopen hook 0x2682010
    -2> 2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
ceph-mon, pid 29871
    -1> 2013-07-24 08:41:44.224397 7fae6384a780  1 finished
global_init_daemonize
     0> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
7fae6384a780 time 2013-07-24 08:41:56.096683
mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)

 ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
 1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
 2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
 3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
 4: (Monitor::init_paxos()+0xe5) [0x48f955]
 5: (Monitor::preinit()+0x679) [0x4bba79]
 6: (main()+0x36b0) [0x484bb0]
 7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
 8: /usr/bin/ceph-mon() [0x4801e9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

4.) i thought no problem mon.b and mon.c are still running. BUT all OSDs
were still trying to reach mon.a

2013-07-24 08:41:43.088997 7f011268f700  0 monclient: hunting for new mon
2013-07-24 08:41:56.792449 7f0109e7e700  0 -- 10.255.0.82:6802/29397 >>
10.255.0.100:6789/0 pipe(0x489e000 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:42:02.792990 7f0116b6c700  0 -- 10.255.0.82:6802/29397 >>
10.255.0.100:6789/0 pipe(0x3c02780 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:42:11.793525 7f0109d7d700  0 -- 10.255.0.82:6802/29397 >>
10.255.0.100:6789/0 pipe(0x84ec280 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:42:23.794315 7f0109e7e700  0 -- 10.255.0.82:6802/29397 >>
10.255.0.100:6789/0 pipe(0x44c7b80 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:42:27.621336 7f0122d2e700  0 log [WRN] : 5 slow requests,
5 included below; oldest blocked for > 30.378391 secs
2013-07-24 08:42:27.621344 7f0122d2e700  0 log [WRN] : slow request
30.378391 seconds old, received at 2013-07-24 08:41:57.242902:
osd_op(client.14727601.0:3839848
rbd_data.e0b5b26b8b4567.0000000000005b5a [write 684032~4096] 5.816d89d1
snapc bef=[bef] e142137) v4 currently wait for new map
2013-07-24 08:42:27.621348 7f0122d2e700  0 log [WRN] : slow request
30.195074 seconds old, received at 2013-07-24 08:41:57.426219:
osd_op(client.14828945.0:1088870
rbd_data.e245696b8b4567.000000000000140e [write 988160~7168] 5.ed959c36
snapc b80=[b80] e142137) v4 currently wait for new map
2013-07-24 08:42:27.621350 7f0122d2e700  0 log [WRN] : slow request
30.148871 seconds old, received at 2013-07-24 08:41:57.472422:
osd_op(client.14667314.0:2818172
rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1654784~4096] 5.6972a67e
snapc baa=[baa] e142137) v4 currently wait for new map
2013-07-24 08:42:27.621351 7f0122d2e700  0 log [WRN] : slow request
30.148829 seconds old, received at 2013-07-24 08:41:57.472464:
osd_op(client.14667314.0:2818173
rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1957888~4096] 5.6972a67e
snapc baa=[baa] e142137) v4 currently wait for new map
2013-07-24 08:42:27.621352 7f0122d2e700  0 log [WRN] : slow request
30.148784 seconds old, received at 2013-07-24 08:41:57.472509:
osd_op(client.14667314.0:2818174
rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1966080~4096] 5.6972a67e
snapc baa=[baa] e142137) v4 currently wait for new map

...

2013-07-24 08:50:20.826687 7f00ee6d9700  0 -- 10.255.0.82:6802/29397 >>
10.255.0.100:6789/0 pipe(0xdf02280 sd=288 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:50:26.826914 7f00f1697700  0 -- 10.255.0.82:6802/29397 >>
10.255.0.100:6789/0 pipe(0x465a000 sd=229 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:50:40.713100 7f00ee6d9700  0 -- 10.255.0.82:6802/29397 >>
10.255.0.100:6789/0 pipe(0x4383680 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:50:44.828164 7f011392a700  0 -- 10.255.0.82:6802/29397 >>
10.255.0.100:6789/0 pipe(0x41ecf00 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
2013-07-24 08:51:02.829357 7f00f1697700  0 -- 10.255.0.82:6802/29397 >>
10.255.0.100:6789/0 pipe(0x1d8b180 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault

Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-24  7:05 Upgrading from 0.61.5 to 0.61.6 ended in disaster Stefan Priebe - Profihost AG
@ 2013-07-24  7:37 ` Stefan Priebe - Profihost AG
  2013-07-24 10:42   ` Joao Eduardo Luis
  2013-07-24 11:11   ` Joao Eduardo Luis
  2013-07-24 23:19 ` Sage Weil
  1 sibling, 2 replies; 12+ messages in thread
From: Stefan Priebe - Profihost AG @ 2013-07-24  7:37 UTC (permalink / raw)
  To: ceph-devel

Hi,

i uploaded my ceph mon store to cephdrop
/home/cephdrop/ceph-mon-failed-assert-0.61.6/mon.tar.gz.

So hopefully someone can find the culprit soon.

It fails in OSDMonitor.cc here:

   // if we trigger this, then there's something else going with the store
    // state, and we shouldn't want to work around it without knowing what
    // exactly happened.
    assert(latest_full > 0);

Stefan

Am 24.07.2013 09:05, schrieb Stefan Priebe - Profihost AG:
> Hi,
> 
> today i wanted to upgrade from 0.61.5 to 0.61.6 to get rid of the mon bug.
> 
> But this ended in a complete desaster.
> 
> What i've done:
> 1.) recompiled ceph tagged with 0.61.6
> 2.) installed new ceph version on all machines
> 3.) JUST tried to restart ONE mon
> 
> this failed with:
> [1774]: (33) Numerical argument out of domain
> failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file
> /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '
> 
> 2013-07-24 08:41:43.086951 7f53c185d700 -1 mon.a@0(leader) e1 *** Got
> Signal Terminated ***
> 2013-07-24 08:41:43.088090 7f53c185d700  0 quorum service shutdown
> 2013-07-24 08:41:43.088094 7f53c185d700  0 mon.a@0(???).health(3840)
> HealthMonitor::service_shutdown 1 services
> 2013-07-24 08:41:43.088097 7f53c185d700  0 quorum service shutdown
> 2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
> 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
> ceph-mon, pid 29871
> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
> function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
> 7fae6384a780 time 2013-07-24 08:41:56.096683
> mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)
> 
>  ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
>  1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
>  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
>  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
>  4: (Monitor::init_paxos()+0xe5) [0x48f955]
>  5: (Monitor::preinit()+0x679) [0x4bba79]
>  6: (main()+0x36b0) [0x484bb0]
>  7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
>  8: /usr/bin/ceph-mon() [0x4801e9]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> --- begin dump of recent events ---
>    -13> 2013-07-24 08:41:44.222821 7fae6384a780  5 asok(0x2698000)
> register_command perfcounters_dump hook 0x2682010
>    -12> 2013-07-24 08:41:44.222835 7fae6384a780  5 asok(0x2698000)
> register_command 1 hook 0x2682010
>    -11> 2013-07-24 08:41:44.222837 7fae6384a780  5 asok(0x2698000)
> register_command perf dump hook 0x2682010
>    -10> 2013-07-24 08:41:44.222842 7fae6384a780  5 asok(0x2698000)
> register_command perfcounters_schema hook 0x2682010
>     -9> 2013-07-24 08:41:44.222845 7fae6384a780  5 asok(0x2698000)
> register_command 2 hook 0x2682010
>     -8> 2013-07-24 08:41:44.222847 7fae6384a780  5 asok(0x2698000)
> register_command perf schema hook 0x2682010
>     -7> 2013-07-24 08:41:44.222849 7fae6384a780  5 asok(0x2698000)
> register_command config show hook 0x2682010
>     -6> 2013-07-24 08:41:44.222852 7fae6384a780  5 asok(0x2698000)
> register_command config set hook 0x2682010
>     -5> 2013-07-24 08:41:44.222854 7fae6384a780  5 asok(0x2698000)
> register_command log flush hook 0x2682010
>     -4> 2013-07-24 08:41:44.222856 7fae6384a780  5 asok(0x2698000)
> register_command log dump hook 0x2682010
>     -3> 2013-07-24 08:41:44.222859 7fae6384a780  5 asok(0x2698000)
> register_command log reopen hook 0x2682010
>     -2> 2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
> 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
> ceph-mon, pid 29871
>     -1> 2013-07-24 08:41:44.224397 7fae6384a780  1 finished
> global_init_daemonize
>      0> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
> function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
> 7fae6384a780 time 2013-07-24 08:41:56.096683
> mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)
> 
>  ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
>  1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
>  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
>  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
>  4: (Monitor::init_paxos()+0xe5) [0x48f955]
>  5: (Monitor::preinit()+0x679) [0x4bba79]
>  6: (main()+0x36b0) [0x484bb0]
>  7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
>  8: /usr/bin/ceph-mon() [0x4801e9]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> 4.) i thought no problem mon.b and mon.c are still running. BUT all OSDs
> were still trying to reach mon.a
> 
> 2013-07-24 08:41:43.088997 7f011268f700  0 monclient: hunting for new mon
> 2013-07-24 08:41:56.792449 7f0109e7e700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x489e000 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:02.792990 7f0116b6c700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x3c02780 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:11.793525 7f0109d7d700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x84ec280 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:23.794315 7f0109e7e700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x44c7b80 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:27.621336 7f0122d2e700  0 log [WRN] : 5 slow requests,
> 5 included below; oldest blocked for > 30.378391 secs
> 2013-07-24 08:42:27.621344 7f0122d2e700  0 log [WRN] : slow request
> 30.378391 seconds old, received at 2013-07-24 08:41:57.242902:
> osd_op(client.14727601.0:3839848
> rbd_data.e0b5b26b8b4567.0000000000005b5a [write 684032~4096] 5.816d89d1
> snapc bef=[bef] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621348 7f0122d2e700  0 log [WRN] : slow request
> 30.195074 seconds old, received at 2013-07-24 08:41:57.426219:
> osd_op(client.14828945.0:1088870
> rbd_data.e245696b8b4567.000000000000140e [write 988160~7168] 5.ed959c36
> snapc b80=[b80] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621350 7f0122d2e700  0 log [WRN] : slow request
> 30.148871 seconds old, received at 2013-07-24 08:41:57.472422:
> osd_op(client.14667314.0:2818172
> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1654784~4096] 5.6972a67e
> snapc baa=[baa] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621351 7f0122d2e700  0 log [WRN] : slow request
> 30.148829 seconds old, received at 2013-07-24 08:41:57.472464:
> osd_op(client.14667314.0:2818173
> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1957888~4096] 5.6972a67e
> snapc baa=[baa] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621352 7f0122d2e700  0 log [WRN] : slow request
> 30.148784 seconds old, received at 2013-07-24 08:41:57.472509:
> osd_op(client.14667314.0:2818174
> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1966080~4096] 5.6972a67e
> snapc baa=[baa] e142137) v4 currently wait for new map
> 
> ...
> 
> 2013-07-24 08:50:20.826687 7f00ee6d9700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0xdf02280 sd=288 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:50:26.826914 7f00f1697700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x465a000 sd=229 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:50:40.713100 7f00ee6d9700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x4383680 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:50:44.828164 7f011392a700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x41ecf00 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:51:02.829357 7f00f1697700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x1d8b180 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
> 
> Stefan
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-24  7:37 ` Stefan Priebe - Profihost AG
@ 2013-07-24 10:42   ` Joao Eduardo Luis
  2013-07-24 11:11   ` Joao Eduardo Luis
  1 sibling, 0 replies; 12+ messages in thread
From: Joao Eduardo Luis @ 2013-07-24 10:42 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

On 07/24/2013 08:37 AM, Stefan Priebe - Profihost AG wrote:
> Hi,
>
> i uploaded my ceph mon store to cephdrop
> /home/cephdrop/ceph-mon-failed-assert-0.61.6/mon.tar.gz.
>
> So hopefully someone can find the culprit soon.
>
> It fails in OSDMonitor.cc here:
>
>     // if we trigger this, then there's something else going with the store
>      // state, and we shouldn't want to work around it without knowing what
>      // exactly happened.
>      assert(latest_full > 0);

Looking into it.  Will report back asap.

   -Joao

>
> Stefan
>
> Am 24.07.2013 09:05, schrieb Stefan Priebe - Profihost AG:
>> Hi,
>>
>> today i wanted to upgrade from 0.61.5 to 0.61.6 to get rid of the mon bug.
>>
>> But this ended in a complete desaster.
>>
>> What i've done:
>> 1.) recompiled ceph tagged with 0.61.6
>> 2.) installed new ceph version on all machines
>> 3.) JUST tried to restart ONE mon
>>
>> this failed with:
>> [1774]: (33) Numerical argument out of domain
>> failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i a --pid-file
>> /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf '
>>
>> 2013-07-24 08:41:43.086951 7f53c185d700 -1 mon.a@0(leader) e1 *** Got
>> Signal Terminated ***
>> 2013-07-24 08:41:43.088090 7f53c185d700  0 quorum service shutdown
>> 2013-07-24 08:41:43.088094 7f53c185d700  0 mon.a@0(???).health(3840)
>> HealthMonitor::service_shutdown 1 services
>> 2013-07-24 08:41:43.088097 7f53c185d700  0 quorum service shutdown
>> 2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
>> 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
>> ceph-mon, pid 29871
>> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
>> function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
>> 7fae6384a780 time 2013-07-24 08:41:56.096683
>> mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)
>>
>>   ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
>>   1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
>>   2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
>>   3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
>>   4: (Monitor::init_paxos()+0xe5) [0x48f955]
>>   5: (Monitor::preinit()+0x679) [0x4bba79]
>>   6: (main()+0x36b0) [0x484bb0]
>>   7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
>>   8: /usr/bin/ceph-mon() [0x4801e9]
>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>> --- begin dump of recent events ---
>>     -13> 2013-07-24 08:41:44.222821 7fae6384a780  5 asok(0x2698000)
>> register_command perfcounters_dump hook 0x2682010
>>     -12> 2013-07-24 08:41:44.222835 7fae6384a780  5 asok(0x2698000)
>> register_command 1 hook 0x2682010
>>     -11> 2013-07-24 08:41:44.222837 7fae6384a780  5 asok(0x2698000)
>> register_command perf dump hook 0x2682010
>>     -10> 2013-07-24 08:41:44.222842 7fae6384a780  5 asok(0x2698000)
>> register_command perfcounters_schema hook 0x2682010
>>      -9> 2013-07-24 08:41:44.222845 7fae6384a780  5 asok(0x2698000)
>> register_command 2 hook 0x2682010
>>      -8> 2013-07-24 08:41:44.222847 7fae6384a780  5 asok(0x2698000)
>> register_command perf schema hook 0x2682010
>>      -7> 2013-07-24 08:41:44.222849 7fae6384a780  5 asok(0x2698000)
>> register_command config show hook 0x2682010
>>      -6> 2013-07-24 08:41:44.222852 7fae6384a780  5 asok(0x2698000)
>> register_command config set hook 0x2682010
>>      -5> 2013-07-24 08:41:44.222854 7fae6384a780  5 asok(0x2698000)
>> register_command log flush hook 0x2682010
>>      -4> 2013-07-24 08:41:44.222856 7fae6384a780  5 asok(0x2698000)
>> register_command log dump hook 0x2682010
>>      -3> 2013-07-24 08:41:44.222859 7fae6384a780  5 asok(0x2698000)
>> register_command log reopen hook 0x2682010
>>      -2> 2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
>> 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
>> ceph-mon, pid 29871
>>      -1> 2013-07-24 08:41:44.224397 7fae6384a780  1 finished
>> global_init_daemonize
>>       0> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
>> function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
>> 7fae6384a780 time 2013-07-24 08:41:56.096683
>> mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)
>>
>>   ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
>>   1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
>>   2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
>>   3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
>>   4: (Monitor::init_paxos()+0xe5) [0x48f955]
>>   5: (Monitor::preinit()+0x679) [0x4bba79]
>>   6: (main()+0x36b0) [0x484bb0]
>>   7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
>>   8: /usr/bin/ceph-mon() [0x4801e9]
>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>> 4.) i thought no problem mon.b and mon.c are still running. BUT all OSDs
>> were still trying to reach mon.a
>>
>> 2013-07-24 08:41:43.088997 7f011268f700  0 monclient: hunting for new mon
>> 2013-07-24 08:41:56.792449 7f0109e7e700  0 -- 10.255.0.82:6802/29397 >>
>> 10.255.0.100:6789/0 pipe(0x489e000 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-07-24 08:42:02.792990 7f0116b6c700  0 -- 10.255.0.82:6802/29397 >>
>> 10.255.0.100:6789/0 pipe(0x3c02780 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-07-24 08:42:11.793525 7f0109d7d700  0 -- 10.255.0.82:6802/29397 >>
>> 10.255.0.100:6789/0 pipe(0x84ec280 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-07-24 08:42:23.794315 7f0109e7e700  0 -- 10.255.0.82:6802/29397 >>
>> 10.255.0.100:6789/0 pipe(0x44c7b80 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-07-24 08:42:27.621336 7f0122d2e700  0 log [WRN] : 5 slow requests,
>> 5 included below; oldest blocked for > 30.378391 secs
>> 2013-07-24 08:42:27.621344 7f0122d2e700  0 log [WRN] : slow request
>> 30.378391 seconds old, received at 2013-07-24 08:41:57.242902:
>> osd_op(client.14727601.0:3839848
>> rbd_data.e0b5b26b8b4567.0000000000005b5a [write 684032~4096] 5.816d89d1
>> snapc bef=[bef] e142137) v4 currently wait for new map
>> 2013-07-24 08:42:27.621348 7f0122d2e700  0 log [WRN] : slow request
>> 30.195074 seconds old, received at 2013-07-24 08:41:57.426219:
>> osd_op(client.14828945.0:1088870
>> rbd_data.e245696b8b4567.000000000000140e [write 988160~7168] 5.ed959c36
>> snapc b80=[b80] e142137) v4 currently wait for new map
>> 2013-07-24 08:42:27.621350 7f0122d2e700  0 log [WRN] : slow request
>> 30.148871 seconds old, received at 2013-07-24 08:41:57.472422:
>> osd_op(client.14667314.0:2818172
>> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1654784~4096] 5.6972a67e
>> snapc baa=[baa] e142137) v4 currently wait for new map
>> 2013-07-24 08:42:27.621351 7f0122d2e700  0 log [WRN] : slow request
>> 30.148829 seconds old, received at 2013-07-24 08:41:57.472464:
>> osd_op(client.14667314.0:2818173
>> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1957888~4096] 5.6972a67e
>> snapc baa=[baa] e142137) v4 currently wait for new map
>> 2013-07-24 08:42:27.621352 7f0122d2e700  0 log [WRN] : slow request
>> 30.148784 seconds old, received at 2013-07-24 08:41:57.472509:
>> osd_op(client.14667314.0:2818174
>> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1966080~4096] 5.6972a67e
>> snapc baa=[baa] e142137) v4 currently wait for new map
>>
>> ...
>>
>> 2013-07-24 08:50:20.826687 7f00ee6d9700  0 -- 10.255.0.82:6802/29397 >>
>> 10.255.0.100:6789/0 pipe(0xdf02280 sd=288 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-07-24 08:50:26.826914 7f00f1697700  0 -- 10.255.0.82:6802/29397 >>
>> 10.255.0.100:6789/0 pipe(0x465a000 sd=229 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-07-24 08:50:40.713100 7f00ee6d9700  0 -- 10.255.0.82:6802/29397 >>
>> 10.255.0.100:6789/0 pipe(0x4383680 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-07-24 08:50:44.828164 7f011392a700  0 -- 10.255.0.82:6802/29397 >>
>> 10.255.0.100:6789/0 pipe(0x41ecf00 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
>> 2013-07-24 08:51:02.829357 7f00f1697700  0 -- 10.255.0.82:6802/29397 >>
>> 10.255.0.100:6789/0 pipe(0x1d8b180 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
>>
>> Stefan
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-24  7:37 ` Stefan Priebe - Profihost AG
  2013-07-24 10:42   ` Joao Eduardo Luis
@ 2013-07-24 11:11   ` Joao Eduardo Luis
  2013-07-24 11:54     ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 12+ messages in thread
From: Joao Eduardo Luis @ 2013-07-24 11:11 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

On 07/24/2013 08:37 AM, Stefan Priebe - Profihost AG wrote:
> Hi,
>
> i uploaded my ceph mon store to cephdrop
> /home/cephdrop/ceph-mon-failed-assert-0.61.6/mon.tar.gz.
>
> So hopefully someone can find the culprit soon.
>
> It fails in OSDMonitor.cc here:
>
>     // if we trigger this, then there's something else going with the store
>      // state, and we shouldn't want to work around it without knowing what
>      // exactly happened.
>      assert(latest_full > 0);
>

Wrong variable being used in a loop as part of a workaround for 5704.

Opened a bug for this on http://tracker.ceph.com/issues/5737

A fix is available on wip-5737 (next) and wip-5737-cuttlefish.

Tested the mon against your store and it worked flawlessly.  Also tested 
it against the same stores used during the original fix and also they 
worked just fine.

My question now is how the hell those stores worked fine although the 
original fix was grabbing what should have been a non-existent version, 
or how did they not trigger that assert.  Which is what I'm going to 
investigate next.

   -Joao


-- 
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-24 11:11   ` Joao Eduardo Luis
@ 2013-07-24 11:54     ` Stefan Priebe - Profihost AG
  2013-07-24 15:29       ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Priebe - Profihost AG @ 2013-07-24 11:54 UTC (permalink / raw)
  To: Joao Eduardo Luis; +Cc: ceph-devel

Am 24.07.2013 13:11, schrieb Joao Eduardo Luis:
> On 07/24/2013 08:37 AM, Stefan Priebe - Profihost AG wrote:
>> Hi,
>>
>> i uploaded my ceph mon store to cephdrop
>> /home/cephdrop/ceph-mon-failed-assert-0.61.6/mon.tar.gz.
>>
>> So hopefully someone can find the culprit soon.
>>
>> It fails in OSDMonitor.cc here:
>>
>>     // if we trigger this, then there's something else going with the
>> store
>>      // state, and we shouldn't want to work around it without knowing
>> what
>>      // exactly happened.
>>      assert(latest_full > 0);
>>
> 
> Wrong variable being used in a loop as part of a workaround for 5704.
> 
> Opened a bug for this on http://tracker.ceph.com/issues/5737
> 
> A fix is available on wip-5737 (next) and wip-5737-cuttlefish.
> 
> Tested the mon against your store and it worked flawlessly.  Also tested
> it against the same stores used during the original fix and also they
> worked just fine.
> 
> My question now is how the hell those stores worked fine although the
> original fix was grabbing what should have been a non-existent version,
> or how did they not trigger that assert.  Which is what I'm going to
> investigate next.

What i don't understand is WHY the hell the OSDs haven't used the 2nd or
3rd monitor which wasn't restarted?

Greets,
Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-24 11:54     ` Stefan Priebe - Profihost AG
@ 2013-07-24 15:29       ` Sage Weil
  0 siblings, 0 replies; 12+ messages in thread
From: Sage Weil @ 2013-07-24 15:29 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Joao Eduardo Luis, ceph-devel

On Wed, 24 Jul 2013, Stefan Priebe - Profihost AG wrote:
> What i don't understand is WHY the hell the OSDs haven't used the 2nd or
> 3rd monitor which wasn't restarted?

Double check the ceph.conf's on the OSD machines and make sure all mons 
are listed?  If the remaining mons have a quorum and the OSDs (or any 
client) are able to reach any one of them they will discover the others.

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-24  7:05 Upgrading from 0.61.5 to 0.61.6 ended in disaster Stefan Priebe - Profihost AG
  2013-07-24  7:37 ` Stefan Priebe - Profihost AG
@ 2013-07-24 23:19 ` Sage Weil
  2013-07-25  6:19   ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 12+ messages in thread
From: Sage Weil @ 2013-07-24 23:19 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: ceph-devel

On Wed, 24 Jul 2013, Stefan Priebe - Profihost AG wrote:
> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
> function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
> 7fae6384a780 time 2013-07-24 08:41:56.096683
> mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)
> 
>  ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
>  1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
>  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
>  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
>  4: (Monitor::init_paxos()+0xe5) [0x48f955]
>  5: (Monitor::preinit()+0x679) [0x4bba79]
>  6: (main()+0x36b0) [0x484bb0]
>  7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
>  8: /usr/bin/ceph-mon() [0x4801e9]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.

This is fixed in the cuttlefish branch as of earlier this afternoon.  I've 
spent most of the day expanding the automated test suite to include 
upgrade combinations to trigger this and *finally* figured out that this 
particular problem seems to surface on clusters that upgraded from bobtail 
-> cuttlefish but not clusters created on cuttlefish.

If you've run into this issue, please use the cuttlefish branch build for 
now.  We will have a release out in the next day or so that includes this 
and a few other pending fixes.

I'm sorry we missed this one!  The upgrade test matrix I've been working 
on today should catch this type of issue in the future.

Thanks!
sage



> 
> --- begin dump of recent events ---
>    -13> 2013-07-24 08:41:44.222821 7fae6384a780  5 asok(0x2698000)
> register_command perfcounters_dump hook 0x2682010
>    -12> 2013-07-24 08:41:44.222835 7fae6384a780  5 asok(0x2698000)
> register_command 1 hook 0x2682010
>    -11> 2013-07-24 08:41:44.222837 7fae6384a780  5 asok(0x2698000)
> register_command perf dump hook 0x2682010
>    -10> 2013-07-24 08:41:44.222842 7fae6384a780  5 asok(0x2698000)
> register_command perfcounters_schema hook 0x2682010
>     -9> 2013-07-24 08:41:44.222845 7fae6384a780  5 asok(0x2698000)
> register_command 2 hook 0x2682010
>     -8> 2013-07-24 08:41:44.222847 7fae6384a780  5 asok(0x2698000)
> register_command perf schema hook 0x2682010
>     -7> 2013-07-24 08:41:44.222849 7fae6384a780  5 asok(0x2698000)
> register_command config show hook 0x2682010
>     -6> 2013-07-24 08:41:44.222852 7fae6384a780  5 asok(0x2698000)
> register_command config set hook 0x2682010
>     -5> 2013-07-24 08:41:44.222854 7fae6384a780  5 asok(0x2698000)
> register_command log flush hook 0x2682010
>     -4> 2013-07-24 08:41:44.222856 7fae6384a780  5 asok(0x2698000)
> register_command log dump hook 0x2682010
>     -3> 2013-07-24 08:41:44.222859 7fae6384a780  5 asok(0x2698000)
> register_command log reopen hook 0x2682010
>     -2> 2013-07-24 08:41:44.224104 7fae6384a780  0 ceph version
> 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3), process
> ceph-mon, pid 29871
>     -1> 2013-07-24 08:41:44.224397 7fae6384a780  1 finished
> global_init_daemonize
>      0> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
> function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
> 7fae6384a780 time 2013-07-24 08:41:56.096683
> mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)
> 
>  ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
>  1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
>  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
>  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
>  4: (Monitor::init_paxos()+0xe5) [0x48f955]
>  5: (Monitor::preinit()+0x679) [0x4bba79]
>  6: (main()+0x36b0) [0x484bb0]
>  7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
>  8: /usr/bin/ceph-mon() [0x4801e9]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> 4.) i thought no problem mon.b and mon.c are still running. BUT all OSDs
> were still trying to reach mon.a
> 
> 2013-07-24 08:41:43.088997 7f011268f700  0 monclient: hunting for new mon
> 2013-07-24 08:41:56.792449 7f0109e7e700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x489e000 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:02.792990 7f0116b6c700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x3c02780 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:11.793525 7f0109d7d700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x84ec280 sd=256 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:23.794315 7f0109e7e700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x44c7b80 sd=286 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:42:27.621336 7f0122d2e700  0 log [WRN] : 5 slow requests,
> 5 included below; oldest blocked for > 30.378391 secs
> 2013-07-24 08:42:27.621344 7f0122d2e700  0 log [WRN] : slow request
> 30.378391 seconds old, received at 2013-07-24 08:41:57.242902:
> osd_op(client.14727601.0:3839848
> rbd_data.e0b5b26b8b4567.0000000000005b5a [write 684032~4096] 5.816d89d1
> snapc bef=[bef] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621348 7f0122d2e700  0 log [WRN] : slow request
> 30.195074 seconds old, received at 2013-07-24 08:41:57.426219:
> osd_op(client.14828945.0:1088870
> rbd_data.e245696b8b4567.000000000000140e [write 988160~7168] 5.ed959c36
> snapc b80=[b80] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621350 7f0122d2e700  0 log [WRN] : slow request
> 30.148871 seconds old, received at 2013-07-24 08:41:57.472422:
> osd_op(client.14667314.0:2818172
> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1654784~4096] 5.6972a67e
> snapc baa=[baa] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621351 7f0122d2e700  0 log [WRN] : slow request
> 30.148829 seconds old, received at 2013-07-24 08:41:57.472464:
> osd_op(client.14667314.0:2818173
> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1957888~4096] 5.6972a67e
> snapc baa=[baa] e142137) v4 currently wait for new map
> 2013-07-24 08:42:27.621352 7f0122d2e700  0 log [WRN] : slow request
> 30.148784 seconds old, received at 2013-07-24 08:41:57.472509:
> osd_op(client.14667314.0:2818174
> rbd_data.dfcaa86b8b4567.0000000000000a13 [write 1966080~4096] 5.6972a67e
> snapc baa=[baa] e142137) v4 currently wait for new map
> 
> ...
> 
> 2013-07-24 08:50:20.826687 7f00ee6d9700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0xdf02280 sd=288 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:50:26.826914 7f00f1697700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x465a000 sd=229 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:50:40.713100 7f00ee6d9700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x4383680 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:50:44.828164 7f011392a700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x41ecf00 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
> 2013-07-24 08:51:02.829357 7f00f1697700  0 -- 10.255.0.82:6802/29397 >>
> 10.255.0.100:6789/0 pipe(0x1d8b180 sd=281 :0 s=1 pgs=0 cs=0 l=1).fault
> 
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-24 23:19 ` Sage Weil
@ 2013-07-25  6:19   ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 12+ messages in thread
From: Stefan Priebe - Profihost AG @ 2013-07-25  6:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Am 25.07.2013 01:19, schrieb Sage Weil:
> On Wed, 24 Jul 2013, Stefan Priebe - Profihost AG wrote:
>> 2013-07-24 08:41:56.097385 7fae6384a780 -1 mon/OSDMonitor.cc: In
>> function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread
>> 7fae6384a780 time 2013-07-24 08:41:56.096683
>> mon/OSDMonitor.cc: 156: FAILED assert(latest_full > 0)
>>
>>  ceph version 0.61.6-15-g85db066 (85db0667307ac803c753d16fa374dd2fc29d76f3)
>>  1: (OSDMonitor::update_from_paxos(bool*)+0x2413) [0x50f5a3]
>>  2: (PaxosService::refresh(bool*)+0xe6) [0x4f2c66]
>>  3: (Monitor::refresh_from_paxos(bool*)+0x57) [0x48f7b7]
>>  4: (Monitor::init_paxos()+0xe5) [0x48f955]
>>  5: (Monitor::preinit()+0x679) [0x4bba79]
>>  6: (main()+0x36b0) [0x484bb0]
>>  7: (__libc_start_main()+0xfd) [0x7fae619a6c8d]
>>  8: /usr/bin/ceph-mon() [0x4801e9]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
> 
> This is fixed in the cuttlefish branch as of earlier this afternoon.  I've 
> spent most of the day expanding the automated test suite to include 
> upgrade combinations to trigger this and *finally* figured out that this 
> particular problem seems to surface on clusters that upgraded from bobtail 
> -> cuttlefish but not clusters created on cuttlefish.

Thanks!


> If you've run into this issue, please use the cuttlefish branch build for 
> now.  We will have a release out in the next day or so that includes this 
> and a few other pending fixes.
> 
> I'm sorry we missed this one!  The upgrade test matrix I've been working 
> on today should catch this type of issue in the future.

Thanks!

Stefan

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-25 15:46 ` Sage Weil
  2013-07-25 16:12   ` peter
@ 2013-07-29  9:40   ` peter
  1 sibling, 0 replies; 12+ messages in thread
From: peter @ 2013-07-29  9:40 UTC (permalink / raw)
  To: ceph-devel

On 2013-07-25 17:46, Sage Weil wrote:
> On Thu, 25 Jul 2013, peter@2force.nl wrote:
>> We did not upgrade from bobtail to cuttlefish and are still seeing 
>> this issue.
>> I posted this on the ceph-users mailinglist and I missed this thread 
>> (sorry!)
>> so I didn't know.
> 
> That's interesting; a bobtail upgraded cluster was the only way I was 
> able
> to reproduce it, but I'm also working with relatively short-lived 
> clusters
> in a test environment so there may very well be a possibility I 
> missed.
> Can you summarize what the lineage of your cluster is?  (What version 
> was
> it installed with, and when was it upgraded and to what versions?)
> 
>> Either way, I also have an osd crashing after upgrading to 0.61.6. As 
>> said on
>> the other list, I'm more than happy to share log files etc with you 
>> guys.
> 
> Will take a look.
> 
> Thanks!
> sage

Hi Sage,

Did you happen to find out what is causing the osd crash? I'm not sure 
what the best way is to recover from this.

Thanks,

Peter

> 
> 
>> 
>> Thanks,
>> 
>> Peter
>> 
>>> This is fixed in the cuttlefish branch as of earlier this afternoon. 
>>> I've
>>> spent most of the day expanding the automated test suite to include
>>> upgrade combinations to trigger this and *finally* figured out that 
>>> this
>>> particular problem seems to surface on clusters that upgraded from 
>>> bobtail
>>> -> cuttlefish but not clusters created on cuttlefish.
>> 
>>> If you've run into this issue, please use the cuttlefish branch 
>>> build for
>>> now.  We will have a release out in the next day or so that includes 
>>> this
>>> and a few other pending fixes.
>> 
>>> I'm sorry we missed this one!  The upgrade test matrix I've been 
>>> working
>>> on today should catch this type of issue in the future.
>> 
>>> Thanks!
>>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-25 15:46 ` Sage Weil
@ 2013-07-25 16:12   ` peter
  2013-07-29  9:40   ` peter
  1 sibling, 0 replies; 12+ messages in thread
From: peter @ 2013-07-25 16:12 UTC (permalink / raw)
  To: ceph-devel

On 2013-07-25 17:46, Sage Weil wrote:
> On Thu, 25 Jul 2013, peter@2force.nl wrote:
>> We did not upgrade from bobtail to cuttlefish and are still seeing 
>> this issue.
>> I posted this on the ceph-users mailinglist and I missed this thread 
>> (sorry!)
>> so I didn't know.
> 
> That's interesting; a bobtail upgraded cluster was the only way I was 
> able
> to reproduce it, but I'm also working with relatively short-lived 
> clusters
> in a test environment so there may very well be a possibility I 
> missed.
> Can you summarize what the lineage of your cluster is?  (What version 
> was
> it installed with, and when was it upgraded and to what versions?)
> 

Here is what I found:

2013-06-17 17:12:57
===================
0.61.3-1precise

2013-06-20 17:27:50
===================
0.61.3-1precise -> 0.61.4-1precise

2013-07-19 14:04:30
===================
0.61.4-1precise -> 0.61.5-1precise

2013-07-23 17:51:40
===================
0.61.5-1precise -> 0.61.5-2-g7ab701a-1precise

2013-07-24 09:16:52
===================
0.61.5-2-g7ab701a-1precise -> 0.61.6-1precise

2013-07-25 16:35:59
===================
0.61.6-1-g28720b0-1precise -> 0.61.6-15-g24a56a9-1precise

Hope this helps...

>> Either way, I also have an osd crashing after upgrading to 0.61.6. As 
>> said on
>> the other list, I'm more than happy to share log files etc with you 
>> guys.
> 
> Will take a look.
> 
> Thanks!
> sage

Thanks alot!

Peter

> 
> 
>> 
>> Thanks,
>> 
>> Peter
>> 
>>> This is fixed in the cuttlefish branch as of earlier this afternoon. 
>>> I've
>>> spent most of the day expanding the automated test suite to include
>>> upgrade combinations to trigger this and *finally* figured out that 
>>> this
>>> particular problem seems to surface on clusters that upgraded from 
>>> bobtail
>>> -> cuttlefish but not clusters created on cuttlefish.
>> 
>>> If you've run into this issue, please use the cuttlefish branch 
>>> build for
>>> now.  We will have a release out in the next day or so that includes 
>>> this
>>> and a few other pending fixes.
>> 
>>> I'm sorry we missed this one!  The upgrade test matrix I've been 
>>> working
>>> on today should catch this type of issue in the future.
>> 
>>> Thanks!
>>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
  2013-07-25 11:19 peter
@ 2013-07-25 15:46 ` Sage Weil
  2013-07-25 16:12   ` peter
  2013-07-29  9:40   ` peter
  0 siblings, 2 replies; 12+ messages in thread
From: Sage Weil @ 2013-07-25 15:46 UTC (permalink / raw)
  To: peter; +Cc: ceph-devel

On Thu, 25 Jul 2013, peter@2force.nl wrote:
> We did not upgrade from bobtail to cuttlefish and are still seeing this issue.
> I posted this on the ceph-users mailinglist and I missed this thread (sorry!)
> so I didn't know.

That's interesting; a bobtail upgraded cluster was the only way I was able 
to reproduce it, but I'm also working with relatively short-lived clusters 
in a test environment so there may very well be a possibility I missed.  
Can you summarize what the lineage of your cluster is?  (What version was 
it installed with, and when was it upgraded and to what versions?)

> Either way, I also have an osd crashing after upgrading to 0.61.6. As said on
> the other list, I'm more than happy to share log files etc with you guys.

Will take a look.

Thanks!
sage


> 
> Thanks,
> 
> Peter
> 
> > This is fixed in the cuttlefish branch as of earlier this afternoon.  I've
> > spent most of the day expanding the automated test suite to include
> > upgrade combinations to trigger this and *finally* figured out that this
> > particular problem seems to surface on clusters that upgraded from bobtail
> > -> cuttlefish but not clusters created on cuttlefish.
> 
> > If you've run into this issue, please use the cuttlefish branch build for
> > now.  We will have a release out in the next day or so that includes this
> > and a few other pending fixes.
> 
> > I'm sorry we missed this one!  The upgrade test matrix I've been working
> > on today should catch this type of issue in the future.
> 
> > Thanks!
> > sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
@ 2013-07-25 11:19 peter
  2013-07-25 15:46 ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: peter @ 2013-07-25 11:19 UTC (permalink / raw)
  To: ceph-devel

We did not upgrade from bobtail to cuttlefish and are still seeing this 
issue. I posted this on the ceph-users mailinglist and I missed this 
thread (sorry!) so I didn't know.

Either way, I also have an osd crashing after upgrading to 0.61.6. As 
said on the other list, I'm more than happy to share log files etc with 
you guys.

Thanks,

Peter

> This is fixed in the cuttlefish branch as of earlier this afternoon.  
> I've
> spent most of the day expanding the automated test suite to include
> upgrade combinations to trigger this and *finally* figured out that 
> this
> particular problem seems to surface on clusters that upgraded from 
> bobtail
> -> cuttlefish but not clusters created on cuttlefish.

> If you've run into this issue, please use the cuttlefish branch build 
> for
> now.  We will have a release out in the next day or so that includes 
> this
> and a few other pending fixes.

> I'm sorry we missed this one!  The upgrade test matrix I've been 
> working
> on today should catch this type of issue in the future.

> Thanks!
> sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-07-29  9:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-24  7:05 Upgrading from 0.61.5 to 0.61.6 ended in disaster Stefan Priebe - Profihost AG
2013-07-24  7:37 ` Stefan Priebe - Profihost AG
2013-07-24 10:42   ` Joao Eduardo Luis
2013-07-24 11:11   ` Joao Eduardo Luis
2013-07-24 11:54     ` Stefan Priebe - Profihost AG
2013-07-24 15:29       ` Sage Weil
2013-07-24 23:19 ` Sage Weil
2013-07-25  6:19   ` Stefan Priebe - Profihost AG
2013-07-25 11:19 peter
2013-07-25 15:46 ` Sage Weil
2013-07-25 16:12   ` peter
2013-07-29  9:40   ` peter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.