All of lore.kernel.org
 help / color / mirror / Atom feed
* OSD doesn't start
@ 2012-07-04 15:31 Székelyi Szabolcs
  2012-07-04 16:34 ` Gregory Farnum
  0 siblings, 1 reply; 7+ messages in thread
From: Székelyi Szabolcs @ 2012-07-04 15:31 UTC (permalink / raw)
  To: ceph-devel

Hi,

after upgrading to 0.48 "Argonaut", my OSDs won't start up again. This problem 
might not be related to the upgrade, since the cluster had strange behavior 
before, too: ceph-fuse was spinning the CPU around 70%, so did the OSDs. This 
happened to both of my clusters. Thought that upgrading might solve the 
problem, but it just got worse.

I've copied the log of the OSD run to http://pastebin.com/XYRtfFMU . I've 
rebooted all the nodes, but they still don't work.

What should I do to resurrect my OSDs?

Thanks,
-- 
cc



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD doesn't start
  2012-07-04 15:31 OSD doesn't start Székelyi Szabolcs
@ 2012-07-04 16:34 ` Gregory Farnum
  2012-07-05 14:12   ` Székelyi Szabolcs
  2012-07-08 18:53   ` Székelyi Szabolcs
  0 siblings, 2 replies; 7+ messages in thread
From: Gregory Farnum @ 2012-07-04 16:34 UTC (permalink / raw)
  To: Székelyi Szabolcs; +Cc: ceph-devel

Hrm, it looks like the OSD data directory got a little busted somehow. How did you perform your upgrade? (That is, how did you kill your daemons, in what order, and when did you bring them back up.)  
-Greg


On Wednesday, July 4, 2012 at 8:31 AM, Székelyi Szabolcs wrote:

> Hi,
>  
> after upgrading to 0.48 "Argonaut", my OSDs won't start up again. This problem  
> might not be related to the upgrade, since the cluster had strange behavior  
> before, too: ceph-fuse was spinning the CPU around 70%, so did the OSDs. This  
> happened to both of my clusters. Thought that upgrading might solve the  
> problem, but it just got worse.
>  
> I've copied the log of the OSD run to http://pastebin.com/XYRtfFMU . I've  
> rebooted all the nodes, but they still don't work.
>  
> What should I do to resurrect my OSDs?
>  
> Thanks,
> --  
> cc
>  
>  
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org (mailto:majordomo@vger.kernel.org)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD doesn't start
  2012-07-04 16:34 ` Gregory Farnum
@ 2012-07-05 14:12   ` Székelyi Szabolcs
  2012-07-05 23:33     ` Székelyi Szabolcs
  2012-07-08 18:53   ` Székelyi Szabolcs
  1 sibling, 1 reply; 7+ messages in thread
From: Székelyi Szabolcs @ 2012-07-05 14:12 UTC (permalink / raw)
  To: ceph-devel

On 2012. July 4. 09:34:04 Gregory Farnum wrote:
> Hrm, it looks like the OSD data directory got a little busted somehow. How
> did you perform your upgrade? (That is, how did you kill your daemons, in
> what order, and when did you bring them back up.)

Since it would be hard and long to describe in text, I've collected the 
relevant log entries, sorted by time at http://pastebin.com/Ev3M4DQ9 . The 
short story is that after seeing that the OSDs won't start, I tried to bring 
down the whole cluster and start it up from scratch. It didn't change 
anything, so I rebooted the two machines (running all three daemons), to see 
if it changes anything. It didn't and I gave up.

My ceph config is available at http://pastebin.com/KKNjmiWM .

Since this is my test cluster, I'm not very concerned about the data on it. 
But the other one, with the same config, is dying I think. ceph-fuse is eating 
around 75% CPU on the sole monitor ("cc") node. The monitor about 15%. On the 
other two nodes, the OSD eats around 50%, the MDS 15%, the monitor another 
10%. No Ceph filesystem activity is going on at the moment. Blktrace reports 
about 1kB/s disk traffic on the partition hosting the OSD data dir. The data 
seems to be accessible at the moment, but I'm afraid that my production 
cluster will end up in a similar situation after upgrade, so I don't dare to 
touch it.

Do you have any suggestion what I should check?

Thanks,
-- 
cc

> On Wednesday, July 4, 2012 at 8:31 AM, Székelyi Szabolcs wrote:
> > Hi,
> > 
> > after upgrading to 0.48 "Argonaut", my OSDs won't start up again. This
> > problem might not be related to the upgrade, since the cluster had
> > strange behavior before, too: ceph-fuse was spinning the CPU around 70%,
> > so did the OSDs. This happened to both of my clusters. Thought that
> > upgrading might solve the problem, but it just got worse.
> > 
> > I've copied the log of the OSD run to http://pastebin.com/XYRtfFMU . I've
> > rebooted all the nodes, but they still don't work.
> > 
> > What should I do to resurrect my OSDs?
> > 
> > Thanks,
> > --
> > cc
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > (mailto:majordomo@vger.kernel.org) More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD doesn't start
  2012-07-05 14:12   ` Székelyi Szabolcs
@ 2012-07-05 23:33     ` Székelyi Szabolcs
  2012-07-08 18:51       ` Székelyi Szabolcs
  0 siblings, 1 reply; 7+ messages in thread
From: Székelyi Szabolcs @ 2012-07-05 23:33 UTC (permalink / raw)
  To: ceph-devel

On 2012. July 5. 16:12:42 Székelyi Szabolcs wrote:
> On 2012. July 4. 09:34:04 Gregory Farnum wrote:
> > Hrm, it looks like the OSD data directory got a little busted somehow. How
> > did you perform your upgrade? (That is, how did you kill your daemons, in
> > what order, and when did you bring them back up.)
> 
> Since it would be hard and long to describe in text, I've collected the
> relevant log entries, sorted by time at http://pastebin.com/Ev3M4DQ9 . The
> short story is that after seeing that the OSDs won't start, I tried to bring
> down the whole cluster and start it up from scratch. It didn't change
> anything, so I rebooted the two machines (running all three daemons), to
> see if it changes anything. It didn't and I gave up.
> 
> My ceph config is available at http://pastebin.com/KKNjmiWM .
> 
> Since this is my test cluster, I'm not very concerned about the data on it.
> But the other one, with the same config, is dying I think. ceph-fuse is
> eating around 75% CPU on the sole monitor ("cc") node. The monitor about
> 15%. On the other two nodes, the OSD eats around 50%, the MDS 15%, the
> monitor another 10%. No Ceph filesystem activity is going on at the moment.
> Blktrace reports about 1kB/s disk traffic on the partition hosting the OSD
> data dir. The data seems to be accessible at the moment, but I'm afraid
> that my production cluster will end up in a similar situation after
> upgrade, so I don't dare to touch it.
> 
> Do you have any suggestion what I should check?

Yes, it definitely looks like dying. Besides the above symptoms all clients' 
ceph-fuse burn the CPU, there are unreadable files on the fs (tar blocks on 
them infinitely), the FUSE clients emit messages like

ceph-fuse: 2012-07-05 23:21:41.583692 7f444dfd5700  0 -- client_ip:0/1181 
send_message dropped message ping v1 because of no pipe on con 0x1034000

every 5 seconds. I tried to backup the data on it, but it got blocked in the 
middle. Since then I'm unable to get any data out of it, not even by killing 
ceph-fuse and remounting the fs.

-- 
cc


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD doesn't start
  2012-07-05 23:33     ` Székelyi Szabolcs
@ 2012-07-08 18:51       ` Székelyi Szabolcs
  0 siblings, 0 replies; 7+ messages in thread
From: Székelyi Szabolcs @ 2012-07-08 18:51 UTC (permalink / raw)
  To: ceph-devel

On 2012. July 6. 01:33:13 Székelyi Szabolcs wrote:
> On 2012. July 5. 16:12:42 Székelyi Szabolcs wrote:
> > On 2012. July 4. 09:34:04 Gregory Farnum wrote:
> > > Hrm, it looks like the OSD data directory got a little busted somehow.
> > > How
> > > did you perform your upgrade? (That is, how did you kill your daemons,
> > > in
> > > what order, and when did you bring them back up.)
> > 
> > Since it would be hard and long to describe in text, I've collected the
> > relevant log entries, sorted by time at http://pastebin.com/Ev3M4DQ9 . The
> > short story is that after seeing that the OSDs won't start, I tried to
> > bring down the whole cluster and start it up from scratch. It didn't
> > change anything, so I rebooted the two machines (running all three
> > daemons), to see if it changes anything. It didn't and I gave up.
> > 
> > My ceph config is available at http://pastebin.com/KKNjmiWM .
> > 
> > Since this is my test cluster, I'm not very concerned about the data on
> > it.
> > But the other one, with the same config, is dying I think. ceph-fuse is
> > eating around 75% CPU on the sole monitor ("cc") node. The monitor about
> > 15%. On the other two nodes, the OSD eats around 50%, the MDS 15%, the
> > monitor another 10%. No Ceph filesystem activity is going on at the
> > moment.
> > Blktrace reports about 1kB/s disk traffic on the partition hosting the OSD
> > data dir. The data seems to be accessible at the moment, but I'm afraid
> > that my production cluster will end up in a similar situation after
> > upgrade, so I don't dare to touch it.
> > 
> > Do you have any suggestion what I should check?
> 
> Yes, it definitely looks like dying. Besides the above symptoms all clients'
> ceph-fuse burn the CPU, there are unreadable files on the fs (tar blocks on
> them infinitely), the FUSE clients emit messages like
> 
> ceph-fuse: 2012-07-05 23:21:41.583692 7f444dfd5700  0 -- client_ip:0/1181
> send_message dropped message ping v1 because of no pipe on con 0x1034000
> 
> every 5 seconds. I tried to backup the data on it, but it got blocked in the
> middle. Since then I'm unable to get any data out of it, not even by
> killing ceph-fuse and remounting the fs.

So it looks like the recent leap second caused all my troubles... After a 
colleague applied the workaround descibed here[0], the load on the nodes went 
back to normal, but the cluster was still sick. For example, stopping one of 
the monitors and looking at the output of `ceph -s`, it still showed all the 
monitors as up & running, whereas it was clear that at least one of them 
should have been marked down (there was no ceph-mon process there).

Finally I stopped the whole cluster (BTW `ceph stop` documented here[1] 
doesn't work any longer, it replies something like 'unrecognized subsystem'), 
rebooted all the nodes, and everything came up as it should have.

Cheers,
-- 
cc


[0] http://www.h-online.com/open/news/item/Leap-second-bug-in-Linux-wastes-
electricity-1631462.html
[1] http://ceph.com/docs/master/control/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD doesn't start
  2012-07-04 16:34 ` Gregory Farnum
  2012-07-05 14:12   ` Székelyi Szabolcs
@ 2012-07-08 18:53   ` Székelyi Szabolcs
  2012-07-09 16:18     ` Gregory Farnum
  1 sibling, 1 reply; 7+ messages in thread
From: Székelyi Szabolcs @ 2012-07-08 18:53 UTC (permalink / raw)
  To: ceph-devel

On 2012. July 4. 09:34:04 Gregory Farnum wrote:
> Hrm, it looks like the OSD data directory got a little busted somehow. How
> did you perform your upgrade? (That is, how did you kill your daemons, in
> what order, and when did you bring them back up.) 

Just to make sure: what's the recommended upgrade process?

Thanks,
-- 
cc



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: OSD doesn't start
  2012-07-08 18:53   ` Székelyi Szabolcs
@ 2012-07-09 16:18     ` Gregory Farnum
  0 siblings, 0 replies; 7+ messages in thread
From: Gregory Farnum @ 2012-07-09 16:18 UTC (permalink / raw)
  To: Székelyi Szabolcs; +Cc: ceph-devel

On Sun, Jul 8, 2012 at 11:53 AM, Székelyi Szabolcs <szekelyi@niif.hu> wrote:
> On 2012. July 4. 09:34:04 Gregory Farnum wrote:
>> Hrm, it looks like the OSD data directory got a little busted somehow. How
>> did you perform your upgrade? (That is, how did you kill your daemons, in
>> what order, and when did you bring them back up.)
>
> Just to make sure: what's the recommended upgrade process?

Nominally, you should be able to upgrade however you like, but this
doesn't get much testing. We normally recommend doing the monitors,
and then doing OSDs all together or a rack at a time (depending on
cluster size).

In any case, it sounds like you didn't break anything that way, I was
just looking around for clues. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-07-09 16:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-04 15:31 OSD doesn't start Székelyi Szabolcs
2012-07-04 16:34 ` Gregory Farnum
2012-07-05 14:12   ` Székelyi Szabolcs
2012-07-05 23:33     ` Székelyi Szabolcs
2012-07-08 18:51       ` Székelyi Szabolcs
2012-07-08 18:53   ` Székelyi Szabolcs
2012-07-09 16:18     ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.